Extracted Text

2402.01878v1.pdf

LiPO: Listwise Preference Optimization through Learning-to-Rank
Tianqi Liu
* 1
Zhen Qin
* 1
Junru Wu
1
Jiaming Shen
1
Misha Khalman
2
Rishabh Joshi
2
Yao Zhao
2
Mohammad Saleh
2
Simon Baumgartner
1
Jialu Liu
1
Peter J. Liu
2
Xuanhui Wang
1
Abstract
Aligning language models (LMs) with curated hu-
man feedback is critical to control their behaviors
in real-world applications. Several recent policy
optimization methods, such as DPO and SLiC,
serve as promising alternatives to the traditional
Reinforcement Learning from Human Feedback
(RLHF) approach. In practice, human feedback
often comes in a format of a ranked list over mul-
tiple responses to amortize the cost of reading
prompt. Multiple responses can also be ranked by
reward models or AI feedback. There lacks such
a study on directly fitting upon a list of responses.
In this work, we formulate the LM alignment as
alistwiseranking problem and describe the List-
wise Preference Optimization (LiPO) framework,
where the policy can potentially learn more effec-
tively from a ranked list of plausible responses
given the prompt. This view draws an explicit
connection to Learning-to-Rank (LTR), where
most existing preference optimization work can
be mapped to existing ranking objectives, espe-
cially pairwise ones. Following this connection,
we provide an examination of ranking objectives
that are not well studied for LM alignment with
DPO and SLiC as special cases when list size is
two. In particular, we highlight a specific method,
LiPO-λ, which leverages a state-of-the-artlist-
wiseranking objective and weights each prefer-
ence pair in a more advanced manner. We show
that LiPO-λcan outperform DPO and SLiC by a
clear margin on two preference alignment tasks.
1. Introduction
Recent Large Language Models, such as GPT-4 (OpenAI,
2023) and Gemini (Team et al.,), have unlocked un-
precedented capabilities, witnessed by impressive perfor-
*
Equal contribution
1
Google Research
2
Google Deepmind. Cor-
respondence to: Tianqi Liu<tianqiliu@google.com>, Zhen Qin
<zhenqin@google.com>.
Preprint.
Figure 1.
Performance of the pairwise DPO (DPOBT), listwise
DPO (DPOPL), and our proposed listwise approach, called LiPO-λ.
All can benefit from training data beyond pairwise data (List Size
= 2), while LiPO-λcan benefit more and monotonically as list size
increases.
mance on diverse tasks from conversational chatbot to pro-
gramming. A key step to control the behavior of such
Language Models (LMs) is to align them with curated hu-
man feedback. Preference optimization has become an ac-
tive research area. Reinforcement Learning with Human
Feedback (RLHF) (Christiano et al.,) was first intro-
duced to improve the alignment of LMs with human prefer-
ences (Ouyang et al.,). However, RLHF is a complex
process, requiring substantial memory and hyperparamter
tuning.
Several recent works resort to alternatives of RLHF, and
noticeably converge to a pairwise ranking optimization
paradigm. For example, DPO (Rafailov et al.,) opti-
mizes a pairwise logistic loss directly from pairwise human
preference data, while avoiding an explicit reward model
and RL-based optimization. Similarly, SLiC (Zhao et al.,
2023) optimizes a pairwise hinge loss objective on pairwise
preference data directly from human or sampled from super-
vised fine-tuned (SFT) policy and ranked by a reward model.
RRHF (Yuan et al.,) starts from listwise preference
data labeled by a reward model, and optimizes a pairwise
contrastive objective (i.e., comparing all pairs in the list),
which is analogous to the SLiC objective. More recently,
RSO (Liu et al.,) unifies DPO and SLiC losses and
proposes a better way of sourcing preference data via sta-
tistical rejection sampling. Virtually all existing preference
1

LiPO: Listwise Preference Optimization through Learning-to-Rank
optimization frameworks do not go beyond pairwise pref-
erences. However, in practice, human preference data can
come as a ranked list to amortize the cost of reading the
prompt (K ¨opf et al.,;,). For exam-
ple, Instruct-GPT (Ouyang et al.,) collects data in this
way: for each output, labelers give a Likert score for overall
quality on a 1-7 scale. After evaluating each output individ-
ually, labelers rank all the outputs for a given prompt. Ties
are encouraged in cases where two outputs seem to be of
similar quality. As another example, OpenAssistant (K ¨opf
et al.,) has a tree-structured prompt-assistant alternat-
ing conversation threads, with each node associated a rank
among all siblings.
In this work, we treat LM alignment as a listwise ranking
problem, where the LM can potentially learn alignment
more effectively from listwise preferences. This draws an
explicit connection to the rich Learning-to-Rank (LTR) lit-
erature (Liu,). In particular, popular existing methods
can be mapped to existing ranking objectives, especially
pairwise ones. Noticeably, the LTR literature has shown
that a direct listwise optimization can be more effective
than pairwise alternatives for the listwise ranking problem,
with a rich set of methods of various properties and perfor-
mances (Liu,).
This listwise ranking perspective, together with related meth-
ods, have not been well studied for the LM preference op-
timization problem. As we will show, not every listwise
objective is effective and thus it is desired to conduct an
examination to identify the most effective ones for LM
alignment. In Figure, we can see that listwise preference
datacan benefit existing pairwise methods like DPO even
though they treat all pairs from a list equally, a problem
not well studied in the literature, while our new listwise
method, called LiPO-λ, can further benefit from listwise
preferences. The DPO paper (Rafailov et al.,) briefly
touches the listwise Plackett-Luce model (Luce,) for
preference modeling in their Appendix without any experi-
mental results. We notice that their listwise approach maps
to a specific listwise ranking objective (Xia et al.,),
which falls behind the state of the art.
To this end, we provide the first comprehensive study of
ranking objectives under the listwise preference optimiza-
tion (LiPO) framework, where we compare popular and
state-of-the-art ranking objectives for the LM preference
optimization problem. In particular, existing methods, such
as DPO, RRHF, and SLiC, can be mapped to existing pair-
wise ranking optimization objectives from listwise data (if
only pairwise data is used, it is equivalent to list size be-
ing 2). Furthermore, we show that a new method, LiPO-λ,
which leverages a state-of-the-art ranking objective (Burges
et al.,;,;,), can
achieve very competitive performance. Noticeably, LiPO-λ
allows an intuitive interpretation: it leverages a sophisticated
weighting paradigms that assigns listwise-aware weighting
to sampled pairs, to optimize a well-founded ranking met-
ric (Wang et al.,), in contrast to existing methods that
assign uniform weights or use weighting schema that fails
to consider various factors.
By a comprehensive study of various ranking objectives on
Reddit TL;DR and AnthropicHH tasks, we show that LiPO-
λis better than existing methods, including DPO (Rafailov
et al.,) and SLiC (Zhao et al.,), as well as other
baselines motivated by the LTR literature.
Our contributions are summarized as follows:
•
We describe the Listwise Preference Optimization
(LiPO) framework, which generalizes recent pairwise
preference optimization methods and allows an ex-
amination of other alternatives through the lens of
Learning-to-Rank.
•
We provide a comprehensive investigation of ranking
objectives for LM preference optimization, especially
listwise objectives that are not well studied in the LM
preference optimization literature.
•
We highlight a new method, LiPO-λ, which shows
competitive performance across the evaluation tasks.
2. The LiPO Framework
In the remaining, we build on the notations of
(2023). In LM generation, given a prompt,x∈ X, there
is an action spaceY, where each action is usually called a
response. A policyπ∈∆
X
Y associates to each promptxa
discrete probability distributionπ(.|x)∈∆Y where∆Yis
the set of discrete distributions overY.
2.1. Problem Formulation
The goal is to learn a policyπfrom training data, with a
key consideration that the policy should align with human
preference. Existing work (Zhao et al.,;,
2023) would start the discussion of learning frompairwise
preference data from here. However, we note that preference
modeling can be treated as a more generallistwiseranking
problem: eventually the policyπis expected to virtually
rank a list of discrete actions from∆Y, and learning from
listwise preference data may be more effective for LM align-
ment. In fact, as we discussed above, human preference data
can come as a ranked list to amortize the cost of reading the
prompt (K ¨opf et al.,;,). Thus, we
describe a more general Listwise Preference Optimization
(LiPO) framework.
The training dataset isD={x
(i)
,y
(i)
, ψ
(i)
}
N
i=1 : given a
promptx, we have alistof responsesy= (y1, ..., yK) of
2

LiPO: Listwise Preference Optimization through Learning-to-RankSample Label Optimize
Prompts
Sampling
List of Responses
y
2
y
1
y
3
y
K
......
Reward Model or
Human Raters
Labeling
SFT Policy
Listwise Preference
Optimization
Preference List with
Labels
Optimized Policy
Figure 2.
An illustration of the Listwise Preference Optimization (LiPO) pipeline. For each prompt, LiPO samples a list of responses from
certain policy or mixed policies. Then human or reward ranking model assign each response a ranking label. After that LiPO optimizes
the policy via ranking loss to align with the preferences.
sizeK, which can be generated from SFT policy or other
sources. WhenK= 2, this reduces to pairwise data. We
assume there are real-valued labelsψ= (ψ1, ..., ψK)∈
[0,1]
K
associated with the corresponding responses, which
may come from human raters or be derived from a reward
model. A higher label value indicates a better response.
We useyi≻yj to denote a preference of two responses
givenψi> ψj . In our work, we computeψkas the average
winning probabilities of theykover all responses:
ψk=
1
K
K
X
i=1
P(yk≻yi) (1)
For human-ranked responses, we approximate the winning
probability as
ˆ
P(yi≻yj) =1(human preferyioveryj)
.
For a pairwise reward ranking model (Zhao et al.,;
Liu et al.,), we use the predicted winning probabilities
directly. This aggregation and labeling schema is valid
theoretically (Shah & Wainwright,) and works well
empirically (Qin et al.,) in the ranking literature.
Given the listwise preference data, LM alignment can be for-
mulated as a Learning-to-Rank (LTR) problem. In LTR (Liu,
2009), the goal is to learn a ranking modelπθthat can
output the relevance scoressfor all documents given a
query. In LM alignment, we treatxas the query andyas
documents in the LTR setting. Then we define the scores
s={s1, ..., sK} , wheresiis defined as the following nor-
malized one for(x, yi)inspired by2023):
s(πθ, πref, β) ={s1, ..., sK}
≜{βlog
πθ(y1|x)
πref(y1|x)
, ..., βlog
πθ(yK|x)
πref(yK|x)
},
(2)
whereπθis the policy parameterized byθthat is to be
learned,πrefis the SFT policy that initializesπθ, andβis
a hyper-parameter to control the KL divergence between
πθandπref. In DPO (Rafailov et al.,), the scores are
treated as “implicit reward”, the goal of DPO is to align the
“implicit reward” towards human preference data directly.
To simplify the notation, we omit the dependency ofsupon
(πθ, πref, β)from now on.
LTR algorithms learnπθusing loss functions as their objec-
tives. A ranking loss function is in general defined based on
labelsψof responsesyand predicted scoress:
L(πθ;πref, β) =E
(x,y,ψ)∼D[l(ψ,s)]. (3)
l
is the loss function for a single promptxthat takes the
labels and scores as input and output a real value as the loss:
l: (ψ,s)→R. (4)
A Learning-to-Rank algorithm is to find the optimalπθthat
minimizes the overall loss in the space of ranking models.
We call Eq
framework under the context of LM alignment.
2.2. Ranking Losses in Existing Work
With the definition ofsin Eq, we show that several popular
LM alignment methods can be mapped into the LiPO frame-
work using different ranking losses in the LTR literature.
In DPO paper, two loss functions are proposed under two
ranking frameworks: Bradley-Terry (BT) model (Bradley &
Terry,) and Plackett-Luce (PL) model (Plackett,).
We referDPOBT loss as the BT model-based pairwise loss,
and referDPOPL loss as the PL model-based listwise loss.
In SLiC paper, the loss consists of two terms: calibration
loss and regularization loss. RSO (Liu et al.,) improves
the calibration loss by normalizing the likelihood with the
reference policy. We useSLiCnorm to denote the improved
calibration loss.
Pairwise preference losses.The pairwise logistic ranking
loss (Burges et al.,) is one popular choice to fit a list
3

LiPO: Listwise Preference Optimization through Learning-to-Rank
of ranked data:
Lpair-logistic(πθ;πref, β) =
Ex,y,ψ∼D


X
ψi>ψj
log(1 +e
−(si−sj)
)

.
(5)
We connect the above loss withDPOBT via the following
proposition:
Proposition 2.1.WhenK= 2and pairwise logistic rank-
ing loss is used, LiPO is equivalent toDPOBT (Rafailov
et al.,).
Proof.
WhenK= 2, there are only two pairs:(y1, y2) and
(y2, y1) . We useywto denote the winning response andyl
to denote the losing response.
Then Eq
E
(x,yw,yl)∼D
h
log(1 +e
−(sw−sl)
)
i
=
−E
(x,yw,yl)∼D[log(σ(sw−sl))]
(6)
which is the same as Eq (7) in2023) if we
substitutesiwithβlog
πθ(yi|x)
πref(yi|x)
.
Similarly, we can connect normalized SLiC loss (hinge-
norm loss in2023)) with pairwise hinge loss
from RankSVM (Joachims,):
Lpair-hinge(πθ;πref, β) =
Ex,y,ψ∼D


X
ψi>ψj
max (0,1−(si−sj))

.
(7)
Proposition 2.2.WhenK= 2and pairwise hinge ranking
loss is used, LiPO is equivalent toSLiCnorm (Zhao et al.,
2023;,).
Proof.
WhenK= 2, there are only two pairs:(y1, y2) and
(y2, y1) . We useywto denote the winning response andyl
to denote the losing response.
Then Eq
E
(x,yw,yl)∼D[max (0,1−(sw−sl))] (8)
which is the same as Eq (10) in2023) if we
substitutesiwithβlog
πθ(yi|x)
πref(yi|x)
and setβ=γ.
Listwise preference losses.One can fit all pairs using
pairwise-logistic or pairwise-hinge losses. Another way is
to directly fit an Maximum Likelihood Estimation (MLE)
on the listwise ranked data.2008) proposes list
MLE ranking loss:
Llist-mle(πθ;πref, β) =
−Ex,y1,y2...yK∼D
"
log
K
Y
k=1
exp(s
τ(k))
P
K
j=k
exp(s
τ(j))
#
,
(9)
whereτ(i)is the document ranked at thei-th position in the
listwise permutation determined by label.
Proposition 2.3.When the list MLE loss is used, LiPO is
equivalent to DPOPL(Rafailov et al.,).
Proof.
The list MLE loss is identical to the one in Eq (20) in
Rafailov et al.2023) if we substitute siwithβlog
πθ(yi|x)
πref(yi|x)
.
2.3. Limitations of Existing Work
From the general LiPO formulation and the analysis of
recent policy optimization methods, we can see they map
to specific choices of existing ranking objectives. Through
the lens of LTR, we note there are two major concerns of
the discussed methods that may limit the effectiveness of
preference optimization.
First, all pairwise approaches, which dominate the current
preference optimization literature, ignore listwise permuta-
tion information beyond pairs. Considering all candidates
under the same prompt in a principled manner may allow
the policy to learn more effectively.
Second, virtually all existing methods, even the listwise one
(DPOPL ), ignore the label values, i.e., they only focus on the
optimal pairwise or listwiserank-orderingof the responses.
This has information loss and may incur confusing learning
behavior. For example, two lists with labels (0.99, 0.50,
0.01) and (0.51, 0.50, 0.49) will be treated the same, leading
to inefficiency during training and may hurt generalization.
3. LiPO-λ
Under the general LiPO framework, a rich family of other
optimization objectives can be explored from the LTR lit-
erature. In this section, we propose a specific instantiation,
LiPO-λ, which builds upon a state-of-the-art ranking objec-
tive, addresses the above limitations in a principled manner,
and performs well empirically in the experiments.
Our LiPO-λis based on the LambdaLoss method (Burges
et al.,;,;,).
4

LiPO: Listwise Preference Optimization through Learning-to-Rank
Specifically, the training objective of LiPO-λis:
Llambda-loss(πθ;πref, β) =
−Ex,y,ψ∼D


X
ψi>ψj
∆i,jlog(1 +e
−(si−sj)
)

,
where∆i,j=|Gi−Gj| · |
1
D(τ(i))
−
1
D(τ(j))
|.
(10)
∆i,j
is called the Lambda weight.Gis called a gain function
withGi= 2
ψi
−1 as the commonly used one.Dis a rank
discount function withD(τ(i)) = log(1 +τ(i)) as the
commonly used one, whereτ(i)is the rank position ofyi
in the ranking permutation induced bys, thus it is alistwise
method even though the formula can be written in terms
of pairs. In other words, there are dependencies on other
items in the same list for each pair. One has the flexibility
to change the gain and discount functions but we resort to
the original LambdaLoss configuration here for simplicity.
It has been shown that this loss function can optimize the
well-founded DCG metric (Burges et al.,;
et al.,),
DCG=
K
X
i=1
Gi
D(τ(i))
,
which has several ideal properties as a ranking metric such
as consistent distinguishability (Wang et al.,): for every
pair of substantially different ranking policies, the ranking
metric can decide which one is better in a consistent manner
on almost all datasets.
Comparisons.There are several interesting findings by
comparing LiPO-λwith other methods under the LiPO
framework. First, the gain functionGconsiders label score
valuesψ, which is ignored in virtually all existing methods.
Second, comparing with the list MLE loss, the permutation
considered is induced by the model prediction scores, in-
stead of the static labels.2006) showed that
considering such dynamic permutations based on model pre-
dictions during training can lead to smoother optimization
landscapes to optimize the non-smooth ranking objectives,
resulting in better empirical performance than using the
static ranks from labels. Last but not least, LiPO-λcan be
treated as a weighted version ofDPOBT over all pairs of
ranked list by comparing with Eq. Instead of treating each
pair equally, Lambda weight is a listwise permutation aware
weighting mechanism. One intuition of Lambda weight is
to weight each pairs by the difference of ranking metrics
when they are swapped in the list (Burges,).
4. Other Ranking Losses
Other existing or future ranking objectives may be studied
under the LiPO framework. In experiments, in addition to
existing methods that can be treated as specific instantiations
of LiPO, we also study two pointwise ranking objectives,
the pointwise Mean Squared Error (MSE) loss:
Lpoint-mse(πθ;πref, β) =Ex,y,ψ∼D
K
X
k=1
(ψk−sk)
2
,(11)
and the pointwise sigmoid cross entropy loss:
Lpoint-sigmoid(πθ;πref, β) =
−Ex,y,ψ∼D
K
X
i=1
(ψilogσ(si) + (1−ψi) log(1−σ(si)))
(12)
We also consider another popular listwise ranking objective,
called softmax cross entropy loss as in ListNet (Cao et al.,
2007)):
Lsoftmax(πθ;πref, β) =
Ex,y,ψ∼D
"
K
X
k=1
ψk
P
K
j=1
ψj
log

exp(sk)
P
K
j=1
exp(sj)
!#
.
(13)
5. Experiments
TasksWe study different ranking losses unified under the
LiPO framework on Reddit TL;DR summarization (Stien-
non et al.,) and AnthropicHH dialogue (Bai et al.,
2022) datasets. The Reddit TL;DR summarization dataset
contains both fine-tune dataD
tldr
sftand human feedback data
D
tldr
hf.D
tldr
sftcontains 117k/6k/6k examples in train, valida-
tion, and test splits.D
tldr
hfconsists of 93k human preferences
on decodes from multiple models. The AnthropicHH is a
dialogue dataset withxas conversation between a human
query and an AI assistant. We use the helpful sliceD
helpful
hf
from 161k/9k examples in train and test splits. We use the
positive responses as SFT targets.
MethodStarting from a T5-large (770M) SFT policy and
a T5-XXL (11B) pairwise reward-ranking model (Zhao
et al.,;,), we first construct response
lists by samplingK= 8responses (with more ablations
in Section) for each prompt using SFT policy with
temperature= 0.7 andtop
k= 40
. Then we conduct all
pair comparisons using the pairwise reward-ranking model,
resulting in a winning probability matrixΨ∈[0,1]
K×K .
Then we computeψkasψk=
1
K
P
K
i=1
Ψki
. To compute
the normalized ranking scoress, we setβ= 0.05.
We consider three types of loss functions under the LiPO
framework: pointwise, pairwise, and listwise. Pointwise
losses include point-mse (Eq) and point-sigmoid (Eq).
Pairwise losses include pair-hinge (Eq) pair-logistic (Eq).
Listwise losses include list-mle (Eq), softmax (Eq), and
lambda-loss (Eq). We use the RAX library (Jagerman
5

LiPO: Listwise Preference Optimization through Learning-to-Rank
et al.,) to compute losses in Jax. See Appendix
example usages.
For pairwise losses, we utilize all 64 pairs and show in
Section improveperformance than sampling a
single pair each time as commonly done in existing work, so
this setting is not only fair but also benefit pairwise methods.
We use batch size 32 and learning rate 2e-5 with Adafactor
optimizer (Shazeer & Stern,). For each run, we pick
the checkpoint with the highest reward-ranking model win
rate against the SFT target.
EvaluationOur experiments use three different ap-
proaches to evaluate following existing protocols (Rafailov
et al.,;,): Proxy Reward Model, Au-
toSxS, and human evaluation. Proxy Reward Model com-
putes win rate of generated response against SFT target on
the trained T5-XXL pairwise reward-ranking model. Au-
toSxS uses large PaLM 2 instruction tuned model (PaLM
2-L-IT) (Google et al.,) for few-shot in-context learn-
ing (details in Appendix). Human Evaluation asks human
raters to assign a quality score on each response and de-
termine the best one amongDPOBT ,DPOPL and LiPO-λ
(details in Appendix).
5.1. Performance Comparison on the Two Tasks
The main comparison results are shown in Table. We have
the following observations: First, pointwise methods are not
competitive as expected, indicating that only considering
the pointwise label values are not sufficient, and preference
information is critical. Second,DPOPL does not perform
better thanDPOBT , showing that the choice of listwise rank-
ing objective is important given listwise data. This aligns
with existing LTR literature that list-mle is not a compet-
itive ranking objective as it enforces listwise permutation
without caring about label values. On the other hand, while
DPOBT does not consider label values either, the pairwise
format can be less prone to ill-behaved listwise behaviors.
Third, the listwise Softmax loss is not competitive. This is
understandable as Softmax loss is most effective on opti-
mization listwise ranking with sparse labels, such as binary
click data (Yu et al.,;,). For LM gen-
eration, the responses are sampled from a plausible set so
the labels are dense, which does not fit the Softmax loss
well. Meanwhile, LiPO-λshows strong performance and
improves upon all baselines by effectively leveraging list-
wise data and label value information.
5.2. Ablation Studies and Analysis
To gain an in-depth understanding of the benefits brought by
the listwise formulation and methods, we conduct ablation
studies to understand the effect in terms of listwise data,
Lambda weight choices, and model sizes.
Figure 3.
Performance of different ranking losses with varying list
sizes on the Reddit TL;DR dataset. The shaded areas are 95%
confidence intervals.
Ablation study on list sizeTo better understand the effect
of preference optimization on listwise data, we conduct
analysis over multiple choices of list sizes on the Reddit
TL;DR dataset. As illustrated in Figure, most methods can
benefit from going beyond pairwise training data (List Size
= 2) by leveraging more information under each prompt
x. LiPO-λwith lambda-loss is the only method that can
robustly benefit from longer list sizes, showing it can more
effectively leverage the rich listwise information.
Ablation study on Lambda weightsAs discussed in Sec-
tion, we use specific choice of Lambda weights by set-
ting the gain functionGi= 2
ψi
−1 and rank discount
functionD(τ(i)) = log(1 +τ(i)) , which is called the
DCG weight, that can optimize the DCG ranking met-
ric. In this ablation study, we try other options that may
not have a clear connection to ranking metrics. Constant
Weight assigns equal weights on all pairs, which reduces to
DPOBT. ConstantδGain sets the gain function difference
to a constant:|Gi−Gj|= 1 , and ConstantδDiscount
sets the rank discount function difference to a constant:
|
1
D(τ(i))
−
1
D(τ(j))
|= 1
. The comparisons are shown in Fig-
ure. We can see that using DCG weight is most effective
on both datasets, showing the importance of setting both
functions appropriately in order to optimize well-founded
ranking metrics.
Scale up the policy modelTo understand how well the
LiPO can be scaled up to larger policy models, we train a
T5-XXL policy model and compare amongDPOBT ,DPOPL
and LiPO-λ. Table
well and LiPO-λis competitive on both tasks.
6

LiPO: Listwise Preference Optimization through Learning-to-Rank
Approach Ranking Loss Proxy Reward (%) AutoSxS (%)
Reddit TL;DR
point-mse 49.43±1.18 39.94±1.72
point-sigmoid64.14±1.16 49.28±1.80
softmax 75.40±0.98 58.60±1.72
SLiCnormpair-hinge 87.23±0.78 67.16±1.62
DPOBT pair-logistic88.52±0.74 67.09±1.65
DPOPL list-mle 88.27±0.76 67.23±1.60
LiPO-λ lambda-loss 90.60±0.65 68.06±1.58
AnthropicHH
point-mse 57.55±1.22 21.97±1.47
point-sigmoid71.35±1.11 25.72±1.50
softmax 73.21±1.07 28.87±1.65
SLiCnormpair-hinge 89.68±0.72 42.07±1.75
DPOBT pair-logistic91.11±0.66 44.80±1.73
DPOPL list-mle 90.61±0.72 43.25±1.77
LiPO-λ lambda-loss 92.60±0.62 47.90±1.70
Table 1.
Comparison of different methods with T5-large policy model to leverage listwise preference data. Proxy reward and few-shot
PaLM 2-L-IT win rates against SFT target text are reported. All methods use preference list with size 8, and pairwise methods including
SLiCnormand DPOBTuse all pairs generated from the list and treat them equally.
Figure 4.
Performance on the Reddit TL;DR and AnthropicHH
datasets by using different Lambda weight choices. See text for
explanation of different options. The error bars denote 95% confi-
dence intervals.
Approach Proxy Reward (%) AutoSxS (%)
Reddit TL;DR
DPOBT 96.22±0.43 82.38±1.20
DPOPL 96.00±0.45 81.96±1.30
LiPO-λ 97.32±0.36 83.79±1.25
AnthropicHH
DPOBT 97.48±0.34 68.81±1.63
DPOPL 97.28±0.37 68.84±1.58
LiPO-λ 98.27±0.29 69.81±1.58
Table 2.
Comparison ofDPOBT ,DPOPL , and LiPO-λwith T5-
XXL policy model to leverage listwise preference data on T5-XXL
policy model. Proxy reward and few-shot PaLM 2-L-IT win rate
against SFT target text are reported. All methods use preference
list with size 8.
5.3. Human Evaluation Results
To further verify the improvements of LiPO-λ, we conduct
side-by-side human evaluation using Amazon Mechanical
Turk. Given a document and three responses generated
fromDPOBT ,DPOPL and LiPO-λ, raters are asked to assign
a pointwise overall quality (1-5) to each response, and to
choose the best one. Each task is replicated 3 times and
therefore judged by 3 different raters. To eliminate bias,
we anonymize all the models and randomly shuffle order of
responses for each task. We aggregate pointwise metrics by
averaging the ratings across all replicas, and we aggregate
the choice metric using majority vote. For more details
about the tasks, see Appendix.
In total 67 different raters participated in the Reddit TL;DR
evaluation study with a median of 18 tasks per rater. The
human evaluation results are shown in Table. LiPO- λhas
shown to be better than DPOBTand DPOPLin both tasks.
7

LiPO: Listwise Preference Optimization through Learning-to-Rank
Approach Chosen as Preferred
1
Quality
Reddit TL;DR
DPOBT 19% 3.63
DPOPL 16% 3.67
LiPO-λ 40% 3.80
AnthropicHH
DPOBT 20% 3.66
DPOPL 20% 3.66
LiPO-λ 27% 3.72
Table 3.
Human evaluation comparing three approaches on two
tasks. LiPO-λshows to be preferred more often thanDPOBT and
DPOPL.
6. Related Work
LM Alignment.While self-supervised LMs learn to com-
plete some interesting tasks (Radford et al.,), their per-
formance on downstream tasks, such as acting as a conver-
sational agent, can be significantly improved by alignment
with human preference datasets. The pivotal Reinforcement
Learning from Human Feedback (RLHF) framework (Chris-
tiano et al.,) first fits a reward function under a pref-
erence model such as the Bradley-Terry model (Bradley &
Terry,), then fine-tunes the LM to maximize the given
reward using reinforcement learning algorithms. However,
fine-tuning LMs with reinforcement learning is challenging
in practice, involving training multiple LMs and sampling
from the LM policy in the loop of training, incurring signifi-
cant computational costs and requiring extensive hyperpa-
rameter tuning. A stream of recent work resort to alterna-
tives of RL based preference optimization approaches. As
we discussed, two parallel work, SLiC (Zhao et al.,)
and RRHF (Yuan et al.,) directly use human prefer-
ence data or use a reward model to label preferences, then
both use a pairwise hinge loss to align policy responses.
One difference is RRHF considers listwise data to start with,
but their pairwise objective handles each pair from the list
independently, which can be treated as pairwise objective
on listwise data. DPO (Rafailov et al.,) proposes to
directly align the behavior of LM without using a reward
model with the pairwise logistic loss. Later work (Liu et al.,
2023) shows that DPO has a distribution drift issue due to
its lack of a reward model, as the maximum likelihood esti-
mator (MLE) of the target optimal policy requires labeled
preference pairs sampled from that policy. DPO further
proposed a variant that learns from listwise preference data
using the list MLE loss, which is analogous to PRO (Song
et al.,). In this work, we mainly focus on the opti-
mization objective perspective and show that recent work
converge to a few known ranking objectives in the LTR
literature.
1
The proportion may not sum up to 100% because there are
cases of same preference across all approaches.
Learning-to-Rank.The Learning-to-Rank (LTR) field
has a rich literature due to its practical values in applica-
tions such as web search (Liu,) and recommender
systems (Karatzoglou et al.,). Traditional LTR work
mainly focuses on developing more effective ranking objec-
tives to optimize ranking metrics, ranging from pointwise,
pairwise, to listwise approaches (Liu,).
RankSVM (Joachims,) and RankNet (Burges et al.,
2005) leverage pairwise hinge loss and pairwise logistic
loss respectively for the ranking problem. Listwise ranking
objectives gain popularity thereafter to directly optimize
the listwise ranking metrics. ListMLE and Softmax cross
entropy losses are two representative listwise losses pro-
posed in (Xia et al.,) and (Cao et al.,). ListMLE
only concerns about the ordering under the Plackett-Luce
model (Luce,), and Softmax cross entropy loss is effec-
tive on lists with sparse labels, such as click logs (Yu et al.,
2015). LambdaRank (Burges et al.,) shows that weight-
ing pairs with the listwise Lambda weight leads to strong em-
pirical performance in terms of optimizing the non-smooth
DCG metric, and it is unified under the LambdaLoss (Wang
et al.,) framework with theoretical justification and
convergence proof. Other lines of LTR research focus on
exploring new model architectures (Qin et al.,), learn-
ing from biased feedback (Joachims et al.,), etc. In
this work, we focus on the ranking objectives and show that
existing LM alignment methods can be mapped to specific
choices of ranking objectives. Following this connection,
we studied multiple listwise objectives, which have not been
not well studied, for LM alignment.
7. Conclusion and Future Work
We describe the Listwise Preference Optimization(LiPO)
framework for LM alignment with a ranked list over multi-
ple responses for each prompt. It connects LM alighment
with Learning-to-Rank techniques. In this framework, we
generalize recent preference optimization methods and an-
alyze limitations of existing methods from the Learning-
to-Rank perspective. We then provide a comprehensive
investigation of various ranking objectives for LM prefer-
ence optimization, especially listwise objectives that are not
well studied in the literature. We highlight a new method,
LiPO-λ, which builds upon the state-of-the-art ranking ob-
jectives and shows competitive performance across multiple
evaluation tasks.
Our work opens up a few interesting directions for future
work. (1) The effectiveness of LambdaLoss is demonstrated
empirically in this paper and this can inspire more theoreti-
cal understanding of this method for LM alignment; (2) It
is also interesting to study how to do the online learning
where the list of responses are elicited from the policy being
trained to reduce the distribution shift.
8

LiPO: Listwise Preference Optimization through Learning-to-Rank
References
Bai, A., Jagerman, R., Qin, Z., Yan, L., Kar, P., Lin, B.-R.,
Wang, X., Bendersky, M., and Najork, M. Regression
compatible listwise objectives for calibrated ranking with
binary relevance. InProceedings of the 32nd ACM In-
ternational Conference on Information and Knowledge
Management, pp. 4502–4508, 2023.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das-
Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T.,
et al. Training a helpful and harmless assistant with rein-
forcement learning from human feedback.arXiv preprint
arXiv:2204.05862, 2022.
Bradley, R. A. and Terry, M. E. Rank analysis of incom-
plete block designs: I. the method of paired comparisons.
Biometrika, 39(3/4):324–345, 1952.
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M.,
Hamilton, N., and Hullender, G. Learning to rank using
gradient descent. InProceedings of the 22nd international
conference on Machine learning, pp. 89–96, 2005.
Burges, C., Ragno, R., and Le, Q. Learning to rank with
nonsmooth cost functions. In Sch¨olkopf, B., Platt, J.,
and Hoffman, T. (eds.),Advances in Neural Information
Processing Systems, volume 19. MIT Press, 2006.
Burges, C. J. From ranknet to lambdarank to lambdamart:
An overview.Learning, 11(23-581):81, 2010.
Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. Learning
to rank: from pairwise approach to listwise approach.
InProceedings of the 24th international conference on
Machine learning, pp. 129–136, 2007.
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg,
S., and Amodei, D. Deep reinforcement learning from
human preferences.Advances in neural information pro-
cessing systems, 30, 2017.
Donmez, P., Svore, K. M., and Burges, C. J. On the local
optimality of lambdarank. InProceedings of the 32nd
International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’09, pp.
460–467, New York, NY, USA, 2009. Association for
Computing Machinery. ISBN 9781605584836. doi: 10.
1145/1571941.1572021. URLhttps://doi.org/
10.1145/1571941.1572021 .
Google, Anil, R., Dai, A. M., Firat, O., Johnson, M., Lep-
ikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P.,
Chen, Z., Chu, E., Clark, J. H., Shafey, L. E., Huang,
Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omer-
nick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K.,
Xu, Y., Zhang, Y., Abrego, G. H., Ahn, J., Austin, J.,
Barham, P., Botha, J., Bradbury, J., Brahma, S., Brooks,
K., Catasta, M., Cheng, Y., Cherry, C., Choquette-Choo,
C. A., Chowdhery, A., Crepy, C., Dave, S., Dehghani,
M., Dev, S., Devlin, J., D´ıaz, M., Du, N., Dyer, E.,
Feinberg, V., Feng, F., Fienber, V., Freitag, M., Garcia,
X., Gehrmann, S., Gonzalez, L., Gur-Ari, G., Hand, S.,
Hashemi, H., Hou, L., Howland, J., Hu, A., Hui, J., Hur-
witz, J., Isard, M., Ittycheriah, A., Jagielski, M., Jia, W.,
Kenealy, K., Krikun, M., Kudugunta, S., Lan, C., Lee,
K., Lee, B., Li, E., Li, M., Li, W., Li, Y., Li, J., Lim, H.,
Lin, H., Liu, Z., Liu, F., Maggioni, M., Mahendru, A.,
Maynez, J., Misra, V., Moussalem, M., Nado, Z., Nham,
J., Ni, E., Nystrom, A., Parrish, A., Pellat, M., Polacek,
M., Polozov, A., Pope, R., Qiao, S., Reif, E., Richter,
B., Riley, P., Ros, A. C., Roy, A., Saeta, B., Samuel, R.,
Shelby, R., Slone, A., Smilkov, D., So, D. R., Sohn, D.,
Tokumine, S., Valter, D., Vasudevan, V., Vodrahalli, K.,
Wang, X., Wang, P., Wang, Z., Wang, T., Wieting, J., Wu,
Y., Xu, K., Xu, Y., Xue, L., Yin, P., Yu, J., Zhang, Q.,
Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S., and
Wu, Y. PaLM 2 technical report, 2023.
Jagerman, R., Qin, Z., Wang, X., Bendersky, M., and Najork,
M. On optimizing top-k metrics for neural ranking mod-
els. InProceedings of the 45th International ACM SIGIR
Conference on Research and Development in Information
Retrieval, pp. 2303–2307, 2022a.
Jagerman, R., Wang, X., Zhuang, H., Qin, Z., Bendersky,
M., and Najork, M. Rax: Composable learning-to-rank
using jax. InProceedings of the 28th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining,
pp. 3051–3060, 2022b.
Joachims, T. Optimizing search engines using clickthrough
data. InProceedings of the eighth ACM SIGKDD inter-
national conference on Knowledge discovery and data
mining, pp. 133–142, 2002.
Joachims, T., Swaminathan, A., and Schnabel, T. Unbiased
learning-to-rank with biased feedback. InProceedings of
the tenth ACM international conference on web search
and data mining, pp. 781–789, 2017.
Karatzoglou, A., Baltrunas, L., and Shi, Y. Learning to
rank for recommender systems. InProceedings of the 7th
ACM Conference on Recommender Systems, pp. 493–494,
2013.
K¨opf, A., Kilcher, Y., von R¨utte, D., Anagnostidis, S.,
Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stan-
ley, O., Nagyfi, R., et al. Openassistant conversations–
democratizing large language model alignment.arXiv
preprint arXiv:2304.07327, 2023.
Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T.,
Bishop, C., Carbune, V., and Rastogi, A. Rlaif: Scal-
9

LiPO: Listwise Preference Optimization through Learning-to-Rank
ing reinforcement learning from human feedback with ai
feedback.arXiv preprint arXiv:2309.00267, 2023.
Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M.,
Liu, P. J., and Liu, J. Statistical rejection sam-
pling improves preference optimization.arXiv preprint
arXiv:2309.06657, 2023.
Liu, T.-Y. Learning to rank for information retrieval.Found.
Trends Inf. Retr., 2009.
Luce, R. D.Individual choice behavior: A theoretical
analysis. Courier Corporation, 2005.
OpenAI. Gpt-4 technical report, 2023.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
et al. Training language models to follow instructions
with human feedback.Advances in Neural Information
Processing Systems, 35:27730–27744, 2022.
Plackett, R. L. The analysis of permutations.Journal of the
Royal Statistical Society Series C: Applied Statistics, 24
(2):193–202, 1975.
Qin, Z., Yan, L., Zhuang, H., Tay, Y., Pasumarthi, R. K.,
Wang, X., Bendersky, M., and Najork, M. Are neural
rankers still outperformed by gradient boosted decision
trees? InInternational Conference on Learning Repre-
sentations, 2020.
Qin, Z., Jagerman, R., Hui, K., Zhuang, H., Wu, J., Shen,
J., Liu, T., Liu, J., Metzler, D., Wang, X., et al. Large
language models are effective text rankers with pairwise
ranking prompting.arXiv preprint arXiv:2306.17563,
2023.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
Sutskever, I., et al. Language models are unsupervised
multitask learners.OpenAI blog, 1(8):9, 2019.
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D.,
Ermon, S., and Finn, C. Direct preference optimization:
Your language model is secretly a reward model. InThirty-
seventh Conference on Neural Information Processing
Systems, 2023.
Shah, N. B. and Wainwright, M. J. Simple, robust and
optimal ranking from pairwise comparisons.Journal of
machine learning research, 18(199):1–38, 2018.
Shazeer, N. and Stern, M. Adafactor: Adaptive learning
rates with sublinear memory cost. InProceedings of the
35th International Conference on Machine Learning, pp.
4596–4604, 2018.
Shu, L., Luo, L., Hoskere, J., Zhu, Y., Liu, C., Tong, S.,
Chen, J., and Meng, L. Rewritelm: An instruction-tuned
large language model for text rewriting.arXiv preprint
arXiv:2305.15685, 2023.
Song, F., Yu, B., Li, M., Yu, H., Huang, F., Li, Y., and Wang,
H. Preference ranking optimization for human alignment.
arXiv preprint arXiv:2306.17492, 2023.
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R.,
Voss, C., Radford, A., Amodei, D., and Christiano,
P. F. Learning to summarize with human feedback.Ad-
vances in Neural Information Processing Systems, 33:
3008–3021, 2020.
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu,
J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al.
Gemini: a family of highly capable multimodal models.
arXiv preprint arXiv:2312.11805, 2023.
Wang, X., Li, C., Golbandi, N., Bendersky, M., and Najork,
M. The lambdaloss framework for ranking metric opti-
mization. InProceedings of the 27th ACM international
conference on information and knowledge management,
pp. 1313–1322, 2018.
Wang, Y., Wang, L., Li, Y., He, D., and Liu, T.-Y. A theoreti-
cal analysis of ndcg type ranking measures. InConference
on learning theory, pp. 25–54. PMLR, 2013.
Xia, F., Liu, T.-Y., Wang, J., Zhang, W., and Li, H. List-
wise approach to learning to rank: theory and algorithm.
InProceedings of the 25th international conference on
Machine learning, pp. 1192–1199, 2008.
Yu, J., Tao, D., Wang, M., and Rui, Y. Learning to rank
using user clicks and visual features for image retrieval.
IEEE Transactions on Cybernetics, 45(4):767–779, 2015.
doi: 10.1109/TCYB.2014.2336697.
Yuan, H., Yuan, Z., Tan, C., Wang, W., Huang, S., and
Huang, F. RRHF: Rank responses to align language mod-
els with human feedback. InThirty-seventh Conference
on Neural Information Processing Systems, 2023.
Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., and
Liu, P. J. Slic-hf: Sequence likelihood calibration with
human feedback.arXiv preprint arXiv:2305.10425, 2023.
10

LiPO: Listwise Preference Optimization through Learning-to-Rank
A. AutoSxS Details
A.0.1.DETAILS
The purpose of the AutoSxS is to prevent the artificially high reward scores by Reward Model due to reward hacking on
learned policies. Since the policy is trained using the information in the pairwise reward-ranking model, it is not necessary
the higher the win rate on reward-ranking model, the better the policy. AutoSxS uses PaLM 2-L-IT few-shot in-context
learning to infer 8 decoded samples with 4 flipped order of response A and B. The label contains three choices: A, B, and tie
with score 1, 0, and 0.5, respectively. To ensure the robustness, we use average score to determine the win or loss if the
magnitude exceeds 0.35. The AutoSxS has been demonstrated as effective and consistent in DPO using GPT-4 as zero-shot
rater. In this work, we replace GPT-4 with PaLM 2-L-IT for our evaluation using few-shot prompts. The quality of PaLM
2-L-IT on similar tasks has been shown to be close to human raters (Lee et al.,;,). The systematic study
on consistency and quality of AutoSxS is beyond the scope of this work.
A.0.2. REDDITTL;DR FEW-SHOTPROMPTS
task: Judge the quality of two TLDRs, choose the options among (A), (B) or same.
context: I’ve (M[21]) been in a relationship for a year and a half with F[22] and it really has never gone well. I think we
want different things and we are not overly compatible. I broke up with her about a year ago and she tried to kill herself so
we got back together. This week I met an F[19] who I think I’m really compatible with. She and I talked for a few hours and
we have a lot in common. I like her a lot, but she is currently a freshman and I am currently a senior so I will be graduating
in May and going on to a prestigious PhD program starting next fall.
So here are my questions: * What should I do in regards to my current relationship? I know I need to end it, but I just don’t
know how. * What should I do in regards to the other girl? * Do you think my feelings for the other girl stem from my
distaste for my current relationship?
I appreciate any help you give me.
tldr (A): I’m unhappy in my current relationship with a girl I just met, but don’t know how to end it. I have no idea what
I’m doing or what to do.
tldr (B): M[21] unhappy in relationship with F[22]. Met an F[19] in town with similar interests and I really like her. What
should I do in regards to current relationship/other girl?
explanation: tldr (A)’s second and third sentences convey similar idea and are redundant. tldr (B) mentions an important
piece of information of the new girl, contains more details than tldr (A) and is concise at the same time.
choose among (A), (B) or same: (B)
context: Before anything, not a sad story or anything, I don’t think she’s cheating or anything of the sorts. My country’s
equivalent to Valentine’s Day is coming and I had this pretty simple idea to surprise my girlfriend and it would involve
giving her some roses. The thing is, although I know she would appreciate my intention in and of itself, I don’t know if she
would like the actual flowers and such, so I wanted to find out if she likes roses and if she would like getting some, but
without her realizing it so as not to spoil the surprise. Any ideas on how to get that information out of her?tldr (A): How do
I find out if my girlfriend likes roses without her realizing it?
tldr (B): I want to surprise my girlfriend with some flowers when Valentine’s Day is around the corner, but I don’t know if
she would like the flowers or flowers themselves without her knowing.
explanation: tldr (A) is a concise that captures the main idea. tldr (B) also captures the main point with more details, but the
language ’flowers or flowers themselves’ is not fluent.
choose among (A), (B) or same: (A)
context: Okay, so my younger brothers were out and about when they passed some teenagers who yelled obscenities at
them. My father then went over and told them to knock it off, when they started yelling obscenities at him. My dad, with a
small amount of temper, got angry and yelled at them. They started recording it and made a video on YouTube where it
looked like he was just screaming at them. After that, we were able to get it taken down only to have it reuploaded with
blurred faces. We have in no way given consent to be in this video. Is there any way we can get them to take it doen?
tldr (A): my dad got angry at teenagers for yelling obscenities at him, they got a video on youtube and blurred faces, what
can we do to get it taken down?
tldr (B): My brothers were being verbally harassed by kids, father yelled at them, they made a video of it to get the video
taken down, it was like a blur with blurred faces.
11

LiPO: Listwise Preference Optimization through Learning-to-Rank
explanation: tldr (A) mentions most main points of story while skipping some details like younger brothers being yelled at
and original videos taken down. tldr (B) has a major factual error, they didn’t make a video to get the video taken down.
choose among (A), (B) or same: (A)
context: Apologize for the throw away account.
My friend is interested in in making his way into the mainstream music industry as an A&R representative. He has a 4 year
degree in English but did some extra curricular activities around music business in college. He also has had 2 internships out
of college with small time record labels in the Chicago area. He has been trying for a few years now with no success. He
lives in Chicago but has moved to LA for a year. He returned recently to Chicago. It has been about 3 years since his last
internship, but he has done some online collaboration stuff. Still far and few in between.
My question is what path have any of you taken to make your way into the record industry? Is networking the only important
thing? Is there a set of skills that are specifically looked for?
tldr (A): My friend wants to make his way into the music industry as an A&R representative. What does he need to know to
make it happen?
tldr (B): friend wants to break into the music industry as an A&R rep. what is the best way to do this?
explanation: tldr (A) and (B) are basically paraphrase of each other. With (A) slightly more verbose and (B) slightly more
concise.
choose among (A), (B) or same: same
context: I was at a concert this weekend supporting my favorite artist. Most of the amphitheater was standing, however
*some* people in my section were not (including some people behind me and to my right). All of a sudden at the end of
the set I got basically shoved in the back and told to sit down. I told them absolutely not, I want to support my favorite
artist (who kept saying STAND UP, btw). I paid over$100 for these seats and I want to have a great time.. after all this IS a
concert. Not to mention, I heard them calling me names before they confronted me so now I was certainly not going to sit
down. Then a few songs later the woman sitting next to me told me what a bitch I was for not letting the people behind me
see. Was I in the wrong here?
tldr (A): I stood up at a concert, people in front of me didn’t and confronted me. Was I in the wrong?
tldr (B): I stood up at a concert even though I was told not to. I got yelled at by a woman who was sitting next to me. Was I
in the wrong?
explanation: tldr (A) doesn’t say who confronted the author or wrongly implied it was the people in front. tldr (B) is correct
but could mention that more people confronted the author than the woman sitting next.
choose among (A), (B) or same: (B)
A.0.3. ANTHROPICHH FEW-SHOTPROMPTS
task: For the following query to a chatbot, which response is more helpful? Choose among (A), (B) and same.
context: Human: Why can’t humans build a space station on the moon? Assistant: That’s a really good question. We don’t
know yet if we’ll be able to build a space station on the moon, though it seems quite feasible. But we can also look at the
science for something more ambitious, like a spaceship that’s large enough to make a round trip to Mars. That might be hard
to do! Even with the technology we have now, it’s not clear to scientists how you’d build a spaceship that would travel for
months in space. Human: I think we should start on the Moon first.
response (A): Assistant: I think you’re right that the Moon is a good place to start, since it’s a natural satellite of Earth.
There are a lot of interesting places on the Moon to study, but we have to go to Mars to check out the atmosphere, the
conditions on the ground, and the survival potential of humans, and to see if there are living things.
response (B): Assistant: The problem is that we don’t know yet if it’s even possible to build a space station on the moon.
But let’s assume we could do it, and see how far we could get with a spaceship to Mars.
explanation: Response (A) offers a more expansive and engaged exploration of the user’s sentiment about starting with the
Moon, while Response (B) reiterates previous information without addressing the user’s perspective or adding new insights.
choose among (A), (B) and same: (A)
context: Human: What is garam masala?
Assistant: Garam masala is a mixture of spices, typically consisting of cumin, pepper, garlic, and either coriander, ginger, or
both. It’s frequently used in curries or vegetable dishes, but it’s also added to some vegetable or meat sides, or used as a
spread.
12

LiPO: Listwise Preference Optimization through Learning-to-Rank
Human: What other things can I add to it to make it unique?
Assistant: Try adding some of the following ingredients, in any combination you like: turmeric cinnamon cardamom celery
seeds fennel fresh chilies, like jalape˜nos or Thai chiles grated ginger green onions basil mint
Human: Is there a sweet garam masala?
Assistant: Yes, some garam masasalas are sweetened. They might contain sugar or honey, or they might be just made with
ground cinnamon or cardamom. You might also want to mix some sort of mango or pineapple relish into your garam masala.
Human: What is a good use for sweet garam masala?
response (A): Assistant: You might use it to flavor vegetable or meat stews, or chilies, or as a spread on toast. You might
mix some into yogurt or fruit salad, or serve it with nuts and chutney on crackers.
response (B): Assistant: Sweet garam masala is typically used in desserts like ice cream or fudge, or in sweet curries like
risotto or k¨ofte. You can also use it in curries to add a sweet flavor. You can also use it as a spice rub for meats, like chicken
or fish. You can also use it as a spice mix in a curry, or you can use it to make a sweet sauce for a curry.
explanation: Response (B) offers specific and diverse uses of sweet garam masala, allowing the user to get a clearer idea of
its potential applications in cooking.
choose among (A), (B) and same: (B)
context: Human: What is the customary amount to tip a taxi driver?
response (A): Assistant: The standard tip for taxi drivers in the United States is typically a 20% tip. This is a fairly standard
tip that is given to taxi drivers for a job well done, and it’s also a nice gesture to show your appreciation for their service. The
tip can be more or less depending on the driver’s service and the level of professionalism they showed. If you’re considering
tipping a taxi driver, I would recommend giving them a tip of at least 20%.
response (B): Assistant: A standard 10% tip is recommended.
explanation: Response (A) provides a clear recommendation, contextualizes it to a specific country (the U.S.), and offers
additional guidance based on the quality of service. It’s more comprehensive than response (B).
choose among (A), (B) and same: (A)
B. HumanSxS Details
For task examples see Figure
C. Python codes for loss computation
Algorithm
13

LiPO: Listwise Preference Optimization through Learning-to-Rank
Figure 5.Example of human evaluation task on Reddit TL;DR dataset.
Algorithm 1Computing different ranking losses in Python
importrax
importjax.numpy as jnp
scores = jnp.array([[2.0, 1.0, 3.0]])
labels = jnp.array([[1.0, 0.0, 0.0]])
point_mse_loss = rax.pointwise_mse_loss(scores, labels)
point_sigmoid_loss = rax.pointwise_sigmoid_loss(scores, labels)
pair_hinge_loss = rax.pairwise_hinge_loss(scores, labels)
pair_logistic_loss = rax.pairwise_logistic_loss(scores, labels)
list_mle_loss = rax.listmle_loss(scores, labels)
softmax_loss = rax.softmax_loss(scores, labels)
lambda_loss = rax.pairwise_logistic_loss(
scores, labels, lambdaweight_fn=rax.dcg_lambdaweight
)
14

LiPO: Listwise Preference Optimization through Learning-to-Rank
Figure 6.Example of human evaluation task on AnthropicHH dialogue dataset.
15