Extracted Text

2405.00675

Self-Play Preference Optimization for Language Model
Alignment
Yue Wu
∗†
Zhiqing Sun
∗‡
Huizhuo Yuan
∗§
Kaixuan Ji
¶
Yiming Yang
‖
Quanquan Gu
∗∗
Abstract
Traditional reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and
irrationality in human preferences. Recent advancements suggest that directly working with
preference probabilities can yield a more accurate reflection of human preferences, enabling
more flexible and accurate language model alignment. In this paper, we propose a self-play-
based method for language model alignment, which treats the problem as a constant-sum
two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed
Self-Play Preference Optimization(SPPO), approximates the Nash equilibrium through iterative
policy updates and enjoys a theoretical convergence guarantee. Our method can effectively
increase the log-likelihood of the chosen response and decrease that of the rejected response, which
cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization
(DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts
(without responses) from the UltraFeedback dataset and without any prompt augmentation,
by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can
obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-
controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms
the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Starting from a
stronger base model Llama-3-8B-Instruct, we are able to achieve a length-controlled win rate
of 38.77%. Notably, the strong performance of SPPO is achieved without additional external
supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.
Codes are available athttps://github.com/uclaml/SPPO.
∗
Equal contribution
†
Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095; e-mail:
ywu@cs.ucla.edu
‡
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213; e-mail:zhiqings@cs.cmu.edu
§
Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095; e-mail:
hzyuan@cs.ucla.edu
¶
Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095; e-mail:
kauxuanji@cs.ucla.edu
‖
Language Technologies Institute & Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA
15213; e-mail:yiming@cs.cmu.edu
∗∗
Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095; e-mail:
qgu@cs.ucla.edu
1

1 Introduction
Large Language Models (LLMs) (e.g.,,;,), have shown
remarkable capabilities in producing human-like text, fielding questions, and coding. Despite
their advancements, these models encounter challenges in tasks requiring high levels of reliability,
safety, and ethical alignment. To address these challenges, Reinforcement Learning from Human
Feedback (RLHF), also known as Preference-based Reinforcement Learning (PbRL), presents a
promising solution. This framework for policy optimization, highlighted in works by
(2017) and recently in2022), has led to significant empirical success in fine-tuning
instruction-following LLMs, making them more aligned with human preferences and thus more
helpful.
Most existing approaches to RLHF rely on either explicit or implicit reward models. Taking
InstructGPT (Ouyang et al.,) as an example, a reference policy πrefis first established,
typically from supervised pre-training or instruction-based (supervised) fine-tuning. An explicit
reward function is obtained by training a reward model based on human preference feedback data,
employing the Bradley-Terry (BT) model (Bradley and Terry,). Subsequently, reinforcement
learning algorithms such as Proximal Policy Optimization (Schulman et al.,, PPO) are used to
fine-tune the reference LLMπrefby maximizing the expected reward function. The reward model
provides a fleward score”r(y;x) for the given responseyand promptx, approximately reflecting
how humans value these responses. More recently, methods like Direct Preference Optimization
(Rafailov et al.,, DPO) have been introduced. These methods forgo the training of a separate
reward model, instead using the log-likelihood ratio to implicitly represent the reward score, which
is then integrated into the same Bradley-Terry model to directly optimize the LLM. Nonetheless,
both the two-step RLHF algorithms and the one-step direct preference fundamentally adhere to the
reward maximization objective and are determined by parametric models such as the BT model.
Parametric preference models such as the Bradley-Terry model (Bradley and Terry,) and
the Thurstone model (Thurstone,) provide reasonable approximations of human preferences,
yet they fall short of fully capturing the complexity of human behavior. These models presuppose a
monotonous and transitive relationship among preferences for different choices. However, empirical
evidence suggests otherwise. For instance,1969) observed human decisions can be
influenced by different factors and exhibit inconsistency. Such observations indicate that human
preferences do not always adhere to a single, value-based hierarchy and can even appear irrational,
such as exhibiting loops in preference relations. For LLMs, another motivating evidence is that
Munos et al.2023) has empirically shown that directly predicting the pairwise preference can
achieve higher accuracy than predicting the preference via a BT-based reward model.
To address the inconsistency in human preference, researchers have proposed to work directly
with the preference probability and design algorithms that can more flexibly represent human
preferences (Lou et al.,;,) in the ranking or bandit setting. Recently, an
emerging line of work (Wang et al.,;,;,) also proposed to
study RLHF for LLMs under such general preferenceP(y≻y
′
|
x), whereyandy
′
are two different
responses andxis prompt. The goal is to identify the Nash equilibrium or von Neumann winner of
the two-player constant-sum game
(π
∗
, π
∗
) = arg max
π
min
π
′
Ex∼X
h
E
y∼π(·|x),y
′
∼π
′
(·|x)
Θ
P(y≻y
′
|x)
Λ
i
,
where each player is an LLM that outputs responses and aims to maximize its probability of being
2

preferred over its opponent. The preferenceP(y≻y
′
|
x) is assumed given by an external source, such
as human annotators or a strong language model.2023) studied the KL-regularized
version of this game and proposed to approximate the Nash equilibrium using an on-policy mirror
descent algorithm.
Very recently,2024) proposed Self-play Preference Optimization (SPO)
1
for
the same (unregularized) two-player constant-sum game. They provide a general reduction of
preference optimization to no-regret online learning for the multi-step Markov Decision Process.
When constrained to the bandit setting for LLMs, their proposed algorithmic framework reduces to
the famous Hedge algorithm (Freund and Schapire,), which admits the exponential update
rule as described in(4.1). To approximately solve the exponential update,2024) then
proposed to employ typical policy optimization algorithms such as Proximal Policy Optimization
(PPO) (Schulman et al.) or Soft Actor-Critic (SAC) (Haarnoja et al.,) to maximize the
win rate against the reference policy and evaluated the performance of their self-play algorithms
in robotic and game tasks. However, it typically requires more effort to apply PPO or SAC to
large-scale fine-tuning of LLM and make them work stably. Therefore, it remains unclear how their
self-play framework can be applied to large-scale language model alignment efficiently.
In this paper, motivated by these developments mentioned above, we propose a new self-play
algorithm that (1) enjoys provable guarantees to solve the two-player constant-sum game; and (2)
can scale up to large-scale efficient fine-tuning of large language models. In detail, we formulate
the RLHF problem as a constant-sum two-player game. Our objective is to identify the Nash
equilibrium policy, which consistently provides preferred responses over any other policy on average.
To identify the Nash equilibrium policy approximately, we adopt the classic online adaptive algorithm
with multiplicative weights (Freund and Schapire,) as a high-level framework that solves the
two-player game. Further, each step of the high-level framework can be approximated by aself-play
mechanism, where in each round the policy is playing against itself in the previous round by
fine-tuning it on synthetic data that are generated by the policy and annotated by the preference
model.
Our contributions are highlighted as follows:
•
Starting from the exponential weight update algorithm which provably converges to the Nash equi-
librium of the two-player constant-sum game, we propose theSelf-Play Preference Optimization
(SPPO) algorithm for large language model alignment. The algorithm converges to an approximate
Nash equilibrium provably and admits a simple form of loss function for easy optimization.
•
We compare our method with the state-of-the-art method including DPO, Identity Preference
Optimization (IPO) (Azar et al.,), and Kahneman-Tversky Optimization (KTO) (Ethayarajh
et al.,), and show that our SPPO loss function can effectively increase the log-likelihood of
the chosen response and decrease that of the rejected response, which cannot be trivially achieved
by symmetric pairwise loss such as DPO and IPO. Our experiments also confirm that SPPO
outperforms iterative DPO and IPO on various benchmarks.
•
Empirically, SPPO significantly enhances the well-aligned Mistral-7B-Instruct-v0.2 and Llama-3-
8B-Instruct model, achieving an increase of over 11% on the length-controlled win rate against
GPT-4-Turbo on the AlpacaEval 2.0 (Dubois et al.,) test set. Additionally, SPPO exhibits
1
The SPO framework does not pertain to the efficient fine-tuning of LLMs. Our Self-Play Preference Optimization
(SPPO) focuses on LLM alignment and was developed independently. To distinguish it from the SPO framework, we
use the abbreviation SPPO.
3

strong generalist abilities across different tasks, including MT-Bench, the Open LLM Leaderboard,
and the PairRM score (Jiang et al.,). Unlike iterative DPO/IPO, which tends to show
performance decay on other benchmarks when optimized towards the PairRM score, SPPO’s
performance gain is consistent. Notably, all the strong performances are achieved without external
supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.
Despite using only the 60k prompts (without responses) from the UltraFeedback dataset (Cui
et al.,) and forgoing any prompt augmentation, our method achieves performance comparable
to GPT-4 on the AlpacaEval 2.0 win-rate.
Concurrent to our work, several studies, including Direct Nash Optimization (Rosset et al.,
2024) and REBEL(Gao et al.,) have also explored using either cross-entropy loss or square
loss minimization to approximate the exponential update. Specifically, they used the same trick
proposed in DPO (Rafailov et al.,) to cancel out the log-partition factor and directly regress on
the win-rate difference. The detailed description and comparison with these works can be found in
Section. However, it is shown theoretically and empirically by2024) that the pairwise
loss may only drive therelativelikelihood gap to be large, but may not necessarily drive up the
likelihood of the preferred responses. Our method instead deals with the likelihood of a single
response and can effectively match the likelihood of the response to its win rate.
2 Related Work
RLHF with Explicit/Implicit Reward Model Originally, reinforcement learning from human
feedback (RLHF) was proposed by2017) as a methodology that first learns a reward
model reflecting human preferences and then uses reinforcement learning algorithms to maximize
the reward. This methodology is applied by2022) to fine-tune instruction-following
large language models and leads to the popular ChatGPT.
The reward model in the works mentioned above assumes a parametric model such as the
Bradley-Terry model (Bradley and Terry,), which assigns a “score” representing how preferred
a given response is. More recently,2024) proposed to instead directly solve the closed-
form solution of such a score implied by the Bradley-Terry model. The Direct Policy Optimization
(DPO) method is claimed to be more efficient and stable, yet, still implicitly assumes such a reward
model that specifies the “score”. In a similar spirit,2023) proposed to calibrate the
score so that the score of the winner in comparison has a margin over the score of the loser, and
induces a different SLic loss. Similarly,2024) derived a different loss function
(called KTO) from the Kahneman-Tversky human utility function, which implicitly denotes a score
of the given response.2023) proposed Rejection Sampling Optimization (RSO) which
utilizes a preference model to generate preference pairs with candidates sampled from the optimal
policy; then preference optimization is applied on the sampled preference pairs.2024)
proposed Odds Ratio Preference Optimization (ORPO) algorithm that can perform supervised
fine-tuning and preference alignment in one training session without maintaining an intermediate
reference policy.
RLHF with General Preference Model Often, the human preference is not strictly transitive,
and cannot be sufficiently represented by a single numerical score.2023) proposed a
general preference optimization objective based on the preference probability between a pair of
4

responses instead of a score of a single response. They further propose a learning objective based on
identity mapping of the preference probability called IPO (Preference Optimization with Identity
mapping), which aims to maximize the current policy’s expected winning probability over a given
reference policy.2023) formulated the RLHF problem with general preference as a
two-player, constant-sum game, where each player is one policy that aims to maximize the probability
of its response being preferred against its opponent. They aim to identify the Nash equilibrium policy
of this game and propose a mirror-descent algorithm that guarantees the last-iterate convergence of
a policy with tabular representations
2
.2024) proposed to identify the Nash equilibrium
policy for multi-step MDPs when a general preference model is present and shows that the problem
can be reduced to a two-player zero-sum Markov game.
Theory of RLHF There is also a line of research to analyze RLHF and provide its theoretical
guarantees.2023) studied the standard RLHF with separate reward-learning and
model-tuning and proposed a pessimistic reward-learning process that provably learns a linear
reward model.2024) proposed a framework to reduce any RLHF problem with a
reward model to a reward-based standard RL problem. Additionally, they proposed to identify the
Nash equilibrium policy when a general preference model is present and show that the problem
can be reduced to a two-player zero-sum Markov game.2023) studied the reverse-KL
regularized contextual bandit for RLHF in different settings and proposed efficient algorithms with
finite-sample theoretical guarantees.2024) studied the theoretical learnability of the
KL-regularized Nash-Learning from Human Feedback (NLHF) by considering both offline and online
settings and proposed provably efficient algorithms.2024) proposed an active-query-based
proximal policy optimization algorithm with regret bounds and query complexity based on the
problem dimension and the sub-optimality gap.
Self-Play Fine-TuningMost works mentioned above (Rafailov et al.,;,;
Azar et al.,;,) consider one single optimization procedure starting from
some reference policy. The same procedure may be applied repeatedly for multiple rounds in a
self-play manner. In each round, new data are generated by the policy obtained in the last round;
these new data are then used for training a new policy that can outperform the old policy.
The self-play fine-tuning can be applied to both scenarios with or without human preference
data. For example,2023) proposed an Expectation-Maximization (EM) framework
where in each round, new data are generated and annotated with a reward score; the new policy is
obtained by fine-tuning the policy on the data with a high reward.) proposed
a self-play framework to fine-tune the model in a supervised way. In each round, new preference
pairs are synthesized by labeling the policy-generated responses as losers and the human-generated
responses as winners. Then DPO is applied in each round to fine-tune another policy based on
these synthesized preference data.2024) proposed Self-Rewarding Language Models,
where the language model itself is used to annotate preference on its own responses. Iterative
DPO is applied to fine-tune language models on these annotated data. These works show iterative
fine-tuning can significantly improve the performance.
Swamy et al.2024) considered a more general multi-step Markov Decision Process (MDP)
setting and proposed Self-play Preference Optimization (SPO), an RLHF framework that can
2
Due to the tabular representation, computing the normalizing factor is prohibitive and the algorithm is approxi-
mately executed by sampling one token instead of a full response.
5

utilize any no-regret online learning algorithm for preference-based policy optimization. They then
instantiated their framework with Soft Policy Iteration as an idealized variant of their algorithm,
which reduces to the exponential weight update rule(4.1)when constrained to the bandit setting.
The main difference is that they focus on the multi-round Markov decision process (MDP) in robotic
and game tasks rather than on fine-tuning large language models and approximating the update
using policy optimization methods such as PPO.
Concurrent to our work,2024) proposed the Direct Nash Optimization (DNO)
algorithm based on the cross-entropy between the true and predicted win rate gaps, and provided
theoretical guarantees on the error of finite-sample approximation. However, their practical version
still utilizes the iterative-DPO framework as in2023) with the DPO loss instead of their
derived DNO loss. Notably, in their experiments, they added the GPT-4 generated responses as their
“gold sample” into their fine-tuning data, and used GPT-4 as a judge to assign a numerical score to
each response for preference pair construction. In sharp contrast, our work does not require the use
of any strong external supervision besides a small-sized reward model. Another concurrent work
(Gao et al.,) proposed REBEL, an iterative self-play framework via regressing the relative
reward. When applied to the preference setting, it results a similar algorithm to our algorithm
SPPO, except that SPPO approximates the log-partition factorlogZπt (x) withη/2 whileREBEL
regresses on the win rate difference (so thatlogZπt (x) is canceled). Additionally,
et al.2024) pointed out that optimizing the IPO loss (Azar et al.,) iteratively with self-play
generated data is equivalent to finding the Nash equilibrium of the two-player game, and they
proposed the IPO-MD algorithm based on this observation, which generates data with a mixture
policy similar to the Nash-MD algorithm.
3 Preliminaries
We consider the preference learning scenario as follows. Given a text sequence (commonly referred
to as prompt)x= [x1, x2, . . .], two text sequencesy= [y1, y2, . . .] andy
′
are generated as responses
to the promptx. An autoregressive language modelπgiven the promptxcan generate responsesy
following the probability decomposition
π(y|x) =
N
Y
i=1
π(yi|x,y<i).
Given the promptxand two responsesyandy
′
, a preference oracle (either a human annotator or
a language model) will provide preference feedbacko(y≻y
′
|
x)∈ {0,1}indicating whetheryis
preferred overy
′
. We denoteP(y≻y
′
|
x) =E[o(y≻y
′
|
x)] as the probability ofy“winning the
duel” overy
′
. The KL divergence of two probability distributions of densitypandqis defined as
KL(p∥q) =E
y∼p(y)
h
log
p(y)
q(y)
i
.
3.1 RLHF with Reward Models
Christiano et al.2017) first learn a reward function r(y;x) following the Bradley-Terry model (Bradley
and Terry,). For a prompt-response-response triplet ( x,y,y
′
), the Bradley-Terry model specifies
the probability ofybeing chosen overyas
P(y≻y
′
|x) =
exp(r(y;x))
exp(r(y;x)) + exp(r(y
′
;x))
=σ
Γ
r(y;x)−r(y
′
;x)
∆
, (3.1)
6

whereσ(x) =e
x
/(e
x+ 1) is the logistic function. The reward function associated with the Bradley-
Terry model can be estimated by maximizing the log-likelihoodlogP(y≻y
′
|
x). Suppose the
true reward functionr(y;x)) is available,2017) proposed to solve the following
optimization problem with policy optimization algorithms in RL such as PPO (Schulman et al.,
2017):
max
θ
E
x∼X,y∼πθ(·|x)[r(y;x)]−η
−1
Ex∼X[KL(πθ(·|x)∥πref(·|x))], (3.2)
whereXis the prompt distribution.
Rafailov et al.2024) identified that the optimization problem above has a closed-form solution
such that for anyy,
π
∗
(y|x)∝πref(y|x) exp(ηr(y;x)),
which can be further converted to the DPO loss for any triplet (x,yw,yl) where the winnerywis
chosen over the loseryl:
ℓDPO(x,yw,yl;θ;πref) :=−logσ

η
−1
ˇ
log
`
πθ(yw|x)
πref(yw|x)
´
−log
`
πθ(yl|x)
πref(yl|x)
´˘
!
.
3.2 RLHF with General Preference
Following2024);2023), we aim to establish RLHF methods without a
reward model, as the human preference can be non-transitive (Tversky,). Under a general
preference oracleP(y≻y
′
|
x), we follow2015) and aim to identify the von Neumann
winner. More specifically, the von Neumann winnerπ
∗is the (symmetric) Nash equilibrium of the
following two-player constant-sum game:
(π
∗
, π
∗
) = arg max
π
min
π
′
Ex∼X
h
E
y∼π(·|x),y
′
∼π
′
(·|x)
Θ
P(y≻y
′
|x)
Λ
i
. (3.3)
In addition, we define the winning probability of one responseyagainst a distribution of responses
πas
P(y≻π|x) =E
y
′
∼π(·|x)[P(y≻y
′
|x)],
and the winning probability of one policyπagainst another policyπ
′
as
P(π≻π
′
|x) =E
y∼π(·|x)E
y
′
∼π
′
(·|x)[P(y≻y
′
|x)].
Furthermore, we defineP(π≻π
′ ) =Ex∼X[P(π≻π
′
| x)], wherexis a prompt drawn from the
prompt distributionX. The two-player constant-sum game (3.3) can be simplified as
(π
∗
, π
∗
) = arg max
π
min
π
′
P(π≻π
′
).
4 Self-Play Preference Optimization(SPPO)
In this section, we introduce theSelf-Play Preference Optimization(SPPO) algorithm, derived from
the following theoretical framework.
7

4.1 Theoretical Framework
There are well-known algorithms to approximately solve the Nash equilibrium in a constant-sum
two-player game. In this work, we follow1999) to establish an iterative
framework that can asymptotically converge to the optimal policy on average. We start with a
theoretical framework that conceptually solves the two-player game as follows:
πt+1(y|x)∝πt(y|x) exp(ηP(y≻πt|x)),fort= 1,2, . . . . (4.1)
(4.1)
is an iterative framework that relies on the multiplicative weight update in each roundtand
enjoys a clear structure. Initially, we have a base policyπ1usually from some supervised fine-tuned
model. In each round, the updated policyπt+1is obtained from the reference policyπtfollowing
the multiplicative weight update. More specifically, a responseyshould have a higher probability
weight if it has a higher average advantage over the current policyπt.
Equivalently, (4.1) can be written as
πt+1(y|x) =
πt(y|x) exp
Γ
ηP(y≻πt|x)
∆
Zπt(x)
, (4.2)
whereZπt(x) =
P
y
πt
(y|x)exp
Γ
ηP (y≻πt|x)
∆
is the normalizing factor (a.k.a., the partition
function). For any fixedxandy, the ideal update policyπt+1should satisfy the following equation:
log
`
πt+1(y|x)
πt(y|x)
´
=η·P(y≻πt|x)−logZπt(x). (4.3)
Unlike the pair-wise design in DPO or IPO that cancels the log normalizing factorlogZπt (x) by
differentiating(4.3)betweenyandy
′
, we choose to approximate(4.3)directly in terms ofL2
distance:
πt+1= argmin
π
E
x∼X,y∼πt(·|x)
`
log
`
π(y|x)
πt(y|x)
´
−
ı
ηP(y≻πt|x)−logZπt(x)
ȷ
´
2
.(4.4)
Estimation of the ProbabilityThe optimization objective(4.4)can be approximated with finite
samples. We choose to sampleKresponsesy1,y2, . . . ,yK∼πt (·|x) for each promptx, and denote
the empirical distribution bybπ
K
t. The finite-sample optimization problem can be approximated as
πt+1= argmin
π
E
x∼X,y∼πt(·|x)
`
log
`
π(y|x)
πt(y|x)
´
−
ı
ηP(y≻bπ
K
t|x)−logZ
bπ
K
t
(x)
ȷ
´
2
. (4.5)
Specifically,P(y≻bπ
K
t| x) =
P
K
k=1
P
(y≻yk|x)/KandZ
bπ
K
t(x) =E
y∼πt(·|x) [exp(ηP(y≻bπ
K
t| x))].
Z
bπ
K
t(x), treated as an expectation, can be further estimated byBnew samples with in totalO(KB)
queries of the preference oracleP.(4.5)is an efficiently tractable optimization problem. Informally
speaking, whenK→ ∞ ,(4.5)will recover(4.4). We have the following guarantee on the convergence
of (4.4):
Theorem 4.1.Assume the optimization problem(4.4)is realizable. Denoteπtas the policy
obtained via(4.4)and the mixture policy¯πT=
1
T
P
T
t=1
πt
. By settingη= Θ(1/
√
T
), we have that
max
π
Θ
P(π≻¯πT)
Λ
−min
π
Θ
P(π≺¯πT)
Λ
=O(1/
√
T).
8

Theorem T
towards the Nash equilibrium, in terms of the duality gap. The proof is based on Theorem 1 in
Freund and Schapire1999) with slight modification. For completeness, we include the proof in
Appendix.
Alternatively, we can avoid estimatinglogZ
bπ
K
t (x) by replacing it simply withη/2
3
in(4.5)to
obtain a more clear objective:
πt+1= argmin
π
E
x∼X,y∼πt(·|x)
`
log
`
π(y|x)
πt(y|x)
´
−η
`
P(y≻bπ
K
t|x)−
1
2
´´
2
. (4.6)
Intuitively, if a tie occurs (i.e.,P(y≻bπ
K
t| x) = 1/2), we prefer the model does not update weight
aty. Ifywins overbπ
K
ton average (i.e.,P(y≻bπ
K
t| x)>1/2), then we increase the probability
density atyto employ the advantage ofyoverbπ
K
t. In our experiments, we choose to minimize the
objective (4.6).
4.2 The SPPO Algorithm
Based on the aformentioned theoretical framework, we propose theSelf-Play Preference Optimization
algorithm in Algorithm.
Algorithm 1Self-Play Preference Optimization(SPPO)
1:
input: base policyπθ1, preference oracleP, learning rateη, number of generated samplesK.
2:fort= 1,2, . . .do
3:Generate synthetic responses by samplingx∼ Xandy1:K∼πt(·|x).
4:Annotate the win-rateP(yk≻yk
′|x),∀k, k
′
∈[K].
5:Select responses fromy1:Kto form datasetDt={(xi,yi,
b
P(yi≻πt|xi))}
i∈[N].
6:Optimizeπθt+1
according to (4.6):
θt+1←argmin
θ
E
(x,y,
b
P(y≻πt|x))∼Dt
`
log
`
πθ(y|x)
πt(y|x)
´
−η
`
b
P(y≻πt|x)−
1
2
´´
2
.(4.7)
7:end for
In each roundt, Algorithm Kresponsesy1,y2, . . . ,yKaccording toπt(·|x)
for each promptx(Line). Then, the preference oraclePwill be queried to calculate the win rate
among theKresponses (Line). At Line, certain criteria can be applied to determine which
response should be kept in the constructed datasetDtand construct the prompt-response-probability
triplet (x,y,
b
P(y≻πt|x)). We will discuss the design choices later in Section. One straightforward
design choice is to include allKresponses intoDtand each
b
P
(yi≻πt| x) is estimated by comparing
yito allKresponses. In total,O(K
2) queries will be made. Then the algorithm will optimize(4.6)
on the datasetDt(Line).
Comparison with DPO, IPO, and KTO In practice, we utilize mini-batches of more than
2 responses to estimate the win rate of a given response, while the DPO and IPO loss focus on a
3
Assuming the winning probability between any given pair is either 1 or 0 with equal chance, whenK→ ∞ , we
can show that indeedZ
bπ
K
t
(x)→e
η/2
.
9

single pair of responses. When only a pair of responsesywandylis available, we have the pair-wise
symmetric loss based on the preference triplet (x,yw,yl) defined as:
ℓSPPO(x,yw,yl;θ;πref) :=
`
log
`
πθ(yw|x)
πref(yw|x)
´
−η
ı
P(yw≻yl|x)−
1
2
ȷ
´
2
+
`
log
`
πθ(yl|x)
πref(yl|x)
´
−η
ı
P(yw≺yl|x)−
1
2
ȷ
´
2
, (4.8)
whereP(yw≻yl|x) can be either a soft probability within [0,1] or a hard label 1 indicatingyw≻yl.
We now compare the SPPO loss to other baselines assuming a hard labelyw≻ylis given. 1For
the ease of comparison, let
a=βlog
`
πθ(yw|x)
πref(yw|x)
´
, b=βlog
`
πθ(yl|x)
πref(yl|x)
´
, c=βKL(πθ∥πref),
then we have
ℓDPO(yw,yl,x) =−logσ(a−b), (4.9)
ℓIPO(yw,yl,x) = [(a−b)−1]
2
, (4.10)
ℓKTO(yw,yl,x) =σ(−a+c) +σ(b−c) (simplified), (4.11)
whereσ(x) =e
x
/(e
x
+ 1) and the SPPO loss can be written as
ℓSPPO(yw,yl,x) = (a−1/2)
2
+ (b+ 1/2)
2
.
It can be seen that SPPO not only pushes the gap betweenaandbto be 1, but also attempts to push
value ofato be close to 1/2 and the value ofbto be close to−1/2 such thatπθ(yw|x)> πref(yw|x) and
πθ(yl|x)< πref(yl|x). We believe this is particularly important: when there are plenty of preference
pairs, DPO and IPO can ensure the policy will converge to the target policy, but when the preference
pairs are scarce (e.g., one pair for each prompt), there is no guarantee that the estimated reward of
the winnerawill increase and the estimated reward of the loserbwill decrease. Instead, only the
reward gap between the winner and the loser (i.e.,a−b) will increase. This phenomenon is observed
by2024) that DPO only drives the loser’s likelihood to be small, but the winner’s likelihood
barely changes. We believe that fittingβlog
ı
πt+1(y|x)
πt(y|x)
ȷ
directly toP(y≻πt|x)−1/2 is more effective
than IPO which attempts to fitβlog
ı
πt+1(yw|x)
πt(yw|x)
ȷ
−βlog
ı
πt+1(yl|x)
πt(yl|x)
ȷ
toP(yw≻πt| x)−P(yl≻πt| x).
In addition, SPPO shares a similar spirit as KTO. The KTO loss pushesato be large by minimizing
σ(−a+c) and pushesbto be small by minimizingσ(b−c). In contrast, SPPO pushesato be as
large as 1/2 andbto be as small as−1/2.
On the other hand, we would like to comment that although DPO and KTO can be extended
to their iterative variants, they are not by nature iterative algorithms and do not have provable
guarantees that they can reach the Nash equilibrium. In contrast, SPPO and IPO are by design
capable to solve the Nash equilibrium iteratively. SPPO is superior to IPO because its design
explicitly alleviates the data sparsity issue, as discussed above and detailed in2024).
5 Experiments
We conduct extensive experiments to show the performance of our method and compare it with
other baselines.
10

5.1 Experiment Setup
Base Model and Datasets We follow the experimental setup of Snorkel
4
, a model that utilizes
iterative DPO to achieve state-of-the-art performance on AlpacaEval benchmarks. Specifically,
we use Mistral-7B-Instruct-v0.2 as our base model
5
. Mistral-7B-Instruct-v0.2 is an instruction
fine-tuned version of Mistral-7B-v0.2 model (Jiang et al.,). We also adopt Ultrafeedback (Cui
et al.,) as our source of prompts which includes around 60k prompts from diverse resources.
During generation, we follow the standard chat template of Mistral-7B. To avoid overfitting during
the fine-tuning, we split the dataset into three portions and use only one portion per iteration.
These settings were also adopted by training the model Snorkel-Mistral-PairRM-DPO
6
(Snorkel).
We follow the splitting in Snorkel for a fair comparison. Additionally, we use Llama-3-8B-Instruct
7
as a stronger base model along with the same preference dataset and data splitting.
Preference ModelWe employ PairRM (Jiang et al.,), an efficient pair-wise preference
model of size 0.4B. PairRM is based on DeBERTA-V3 (,) and trained on high-quality
human-preference datasets. Results on benchmarks like Auto-J Pairwise dataset (Li et al.,)
show that it outperforms most of the language-model-based reward models and performs comparably
with larger reward models like UltraRM-13B (Cui et al.,). We refer the readers to the homepage
on Huggingface
8
for detailed benchmark results. We therefore keep PairRM as our ranking model
following Snorkel for a balance between accuracy and efficiency.
Specifically, PairRM will output a flelative reward”s(y,y
′
;x) that reflects the strength difference
betweenyandy
′
, i.e.,
P(y≻y
′
|x) =
exp(s(y,y
′
;x))
1 + exp(s(y,y
′
;x))
.
Unlike the Bradley-Terry-based reward model, PairRM only assigns the relative reward which is not
guaranteed to be transitive (i.e.,s(y1,y2;x) +s(y2,y3;x)̸=s(y1,y3;x)). So it indeed models the
general preference.
Response Generation and Selection During the generation phase in each iteration, we use top
p= 1.0 and temperature 1.0 to sample from the current policy. We sample with different random
seeds to getK= 5 different responses for each prompt. Previous works utilizing Iterative DPO
choose 2 responses to form a pair for each prompt. For a fair comparison, we do not include all
K= 5 responses in the preference data but choose two responses among them. Following Snorkel,
we choose the winnerywand loserylto be the response with thehighestandlowestPairRM score,
which is defined for each responseyias:
sPairRM(yi;x) :=
1
K
K
X
k=1
s(yi,yk;x).
4
https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO
5
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
6
https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO
7
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
8
https://huggingface.co/llm-blender/PairRM
11

Probability EstimationWe then estimate the win rate over the distribution by the average win
rate over all the sampled responses as explained in (4.5):
b
P(yi≻πt|xi) =
1
K
K
X
k=1
P(yi≻yk|x),∀i∈[K].
Hyperparameter Tuning The experiments are conducted on 8×Nvidia A100 GPUs. For
SPPO, we trained three iterations in total. In each iteration, we selected the model trained on
the first epoch of the 20k prompts from UltraFeedback to proceed to the next iteration. For both
Mistral-7B-Instruct-v0.2 and Llama-3-8B-Instruct, the global training batch size is set to 64, andη
is set to 1e3. The learning rate schedule is determined by the following hyperparameters: learning
rate=5.0e-7, number of total training epochs=18, warmup ratio=0.1, linear schedule. The best
hyper-parameters for each model are selected by the average win rate (judged by PairRM-0.4B)
on a hold-out subset of Ultrafeedback as the metric. For more details on the win-rate comparison
using PairRM as a judge, please refer to Section.
BaselinesWe evaluate the following base models as well as baseline methods for fine-tuning
LLMs:
•
Mistral-7B-Instruct-v0.2: Mistral-7B-Instruct-v0.2 is an instruction fine-tuned version of
Mistral-7B-v0.2 model (Jiang et al.,). It is the starting point of our algorithm.
•
Snorkel (Mistral-PairRM-DPO): We directly evaluate the uploaded checkpoint on Hugging-
Face
9
. This model is obtained by three rounds of iterative DPO from Mistral-7B-Instruct-v0.2.
•(Iterative) DPO: We also implement the iterative DPO algorithm by ourselves. The experi-
mental settings and model selection schemes align with those used for SPPO, except for the
adoption of the DPO loss function as defined in(4.9). Hyperparameters are optimized to
maximize the average win-rate assessed by PairRM at each iteration. Note that the practical
algorithm in2024) is essentially the same as iterative DPO.
•
(Iterative) IPO: We implement the iterative IPO algorithm by ourselves. The experimental
setting and the model selection scheme is the same as iterative DPO, except that the loss
function is the IPO loss(4.10) . For fair comparison, hyperparameters for IPO is also selected
by evaluation using the average PairRM win-rate on the hold-out subset of Ultrafeedback.
•
Self-rewarding LM:2024) proposed to prompt the LLM itself as a preference
judge to construct new preference pairs and iteratively fine-tune the LLM with the DPO
algorithm. We use the AlpacaEval 2.0 win rate reported by2024) for comparison.
Note that Self-rewarding LM is a trained from Llama 2 70B.
•
Llama-3-8B-Instruct: Llama-3-8B-Instruct is an instruction-tuned model optimized for dialogue
use cases and outperforms many of the available open-source chat models on common industry
benchmarks.
9
https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO
12

Benchmarks Following previous works, we use AlpacaEval 2.0 (Dubois et al.,), MT-Bench
(Zheng et al.,), and Open LLM Leaderboard (Beeching et al.,) as our evaluation
benchmarks.
•
AlpacaEval 2.0is an LLM-based automatic evaluation benchmark. It employs AlpacaFarm
(Dubois et al.,) as its prompts set composed of general human instructions. The model
responses and the reference response generated by GPT-4-Turbo are fed into a GPT-4-Turbo-
based annotator to be judged. We follow the standard approach and report the win rate over
the reference responses.
•
MT-Bench(Zheng et al.,) is a collection of 80 high-quality multi-turn open-ended
questions. The questions cover topics like writing, role-playing, math, coding, etc.. The
generated answer is judged by GPT-4 and given a score directly without pairwise comparison.
•Open LLM Leaderboard (Beeching et al.,) consists of six datasets, each of which
focuses on a facet of language model evaluation. In detail, the evaluation rubric includes math
problem-solving, language understanding, human falsehood mimicking, and reasoning. We
follow the standard evaluation process and use in-context learning to prompt the language
model and compute the average score over six datasets to measure the performance.
5.2 Experimental Results
We evaluate the models on the three benchmarks described above. We also compare models based
on the pre-trained preference model PairRM.
Evaluation using GPT-4 as a judge In the assessment of AI chatbots, human evaluation
remains the benchmark for quality and accuracy (Askell et al.,;,). However,
due to its limitations in scalability and reproducibility, we explore the alternative approach of using
the advanced capabilities of GPT-4 (OpenAI et al.,) as an automatic evaluation tool. We
conduct GPT-4-based automatic evaluation on AlpacaEval 2.0 (Li et al.,) and MT-Bench
(Zheng et al.,) to measure the chatbot capability of our model. The results can be found
in Table
analyzing the MT-Bench results in Figure
models consistently improves along with the iterative alignment iterations.
Table
on 805 prompts. We also include one column indicating the length-controlled win rate, and one
column on the average length of each model, to account for the tendency of the LLM-based judge
to favor longer sequence outputs — an issue colloquially termed the ”reward hacking” phenomenon.
According to the table, Mistral-7B-SPPO Iter3 has the highest win rate, 28.52% for the length-
controlled version, and 31.02% for the overall win rate. The performance gains over previous
iterations are 7.69% (Mistral-7B-Instruct→Iter1), 2.10% (Iter1→Iter2), and 1.64% (Iter2→
Iter3), respectively, indicating steady improvements across iterations, as illustrated in Figure. We
also apply SPPO to a stronger baseline model, i.e., Llama-3-8B-Instruct, and the fine-tuned model
Llama-3-8B-SPPO has a higher length-controlled win rate 38.77% and overall win rate 39.85%.
13

Table 1: AlpacaEval 2.0 evaluation of various models (detailed in) in terms of both normal
and length-controlled (LC) win rates in percentage (%). Mistral-7B-SPPO Iter3 model achieves
the highest LC win rate of 28.53% and a normal win rate of 31.02%. SPPO demonstrates steady
performance gains across iterations and outperforms other baselines which show a tendency to
produce longer responses. Additionally, re-ranking with the PairRM reward model (best-of-16) at
test time consistently enhances the performance across all models and SPPO (best-of-16) achieves
high win ratewithout strong external supervision like GPT-4. We additionally include the results
obtained from fine-tuning Llama-3-8B-Instruct, which also show steady performance improvement.
Model
AlpacaEval 2.0
LC Win Rate Win Rate Avg. Len
Mistral-7B-Instruct-v0.2 17.11 14.72 1676
Mistral-7B-Instruct-v0.2 (best-of-16) 22.45 17.94 1529
Snorkel (Mistral-PairRM-DPO) 26.39 30.22 2736
Snorkel (Mistral-PairRM-DPO best-of-16)29.97 34.86 2616
Self-Rewarding 70B Iter1 - 9.94 1092
Self-Rewarding 70B Iter2 - 15.38 1552
Self-Rewarding 70B Iter3 - 20.44 2552
Mistral-7B-DPO Iter1 23.81 20.44 1723
Mistral-7B-DPO Iter2 24.23 24.46 2028
Mistral-7B-DPO Iter3 22.30 23.39 2189
Mistral-7B-IPO Iter1 23.78 20.77 1693
Mistral-7B-IPO Iter2 21.08 23.38 2660
Mistral-7B-IPO Iter3 20.06 22.47 2760
Mistral-7B-SPPO Iter1 24.79
(+7.69) 23.51
(+8.79) 1855
Mistral-7B-SPPO Iter2 26.89
(+2.10) 27.62
(+4.11) 2019
Mistral-7B-SPPO Iter3 28.53
(+1.64)31.02
(+3.40) 2163
Mistral-7B-SPPO Iter1 (best-of-16) 28.71
(+6.26) 27.77
(+9.83) 1901
Mistral-7B-SPPO Iter2 (best-of-16) 31.23
(+2.52) 32.12
(+4.35) 2035
Mistral-7B-SPPO Iter3 (best-of-16) 32.13
(+0.9)34.94
(+2.82) 2174
Llama-3-8B-Instruct 22.92 22.57 1899
Llama-3-8B-SPPO Iter1 31.73
(+8.81) 31.74
(+9.17) 1962
Llama-3-8B-SPPO Iter2 35.15
(+3.42) 35.98
(+4.24) 2021
Llama-3-8B-SPPO Iter3 38.77
(+3.62)39.85
(+3.87) 2066
The performance gains are more significant: 8.81% (Llama-3-8B-Instruct→Iter1), 3.42% (Iter1→
Iter2), and 3.62% (Iter2→Iter3), summing up to a total gain of 15.85%.
Additionally, the data indicates that SPPO achieves superior performance compared to the
iterative variants of DPO and IPO. The length-controlled win rate for SPPO reaches 28.53%,
outperforming the DPO’s best rate of 26.39% (by Snorkel) and IPO’s rate of 25.45% . Notably, while
DPO and IPO training tend to significantly increase the average output length—2736 and 2654,
14

Table 2: AlpacaEval 2.0 leaderboard results of both normal and length-controlled (LC) win rates in
percentage (%). Mistral-7B-SPPO can outperform larger models and Mistral-7B-SPPO (best-of-16)
can outperform proprietary models such as GPT-4(6/13). Llama-3-8B-SPPO exhibits even better
performance.
Model
AlpacaEval 2.0
LC. Win Rate Win Rate
GPT-4 Turbo 50.0 50.0
Claude 3 Opus 40.5 29.1
Llama-3-8B-SPPO Iter3 38.8 39.9
GPT-4 0314 35.3 22.1
Llama 3 70B Instruct 34.4 33.2
Mistral-7B-SPPO Iter3 (best-of-16) 32.1 34.9
GPT-4 0613 30.2 15.8
Snorkel (Mistral-PairRM-DPO best-of-16) 30.0 34.9
Mistral Medium 28.6 21.9
Mistral-7B-SPPO Iter3 28.5 31.0
Claude 2 28.2 17.2
Snorkel (Mistral-PairRM-DPO) 26.4 30.2
Gemini Pro 24.4 18.2
Mistral 8×7B v0.1 23.7 18.1
Llama 3 8B Instruct 22.9 22.6
GPT-3.5 Turbo 0613 22.7 14.1
Vicuna 33B v1.3 17.6 12.7
respectively—SPPO shows a more moderate length increase, moving from 1676 in the base model to
2163 at the third iteration. This suggests that SPPO improves the performance while more effectively
controlling the tendency towards longer output lengths compared to DPO and IPO. Finally, we
present the best-of-16 results for each model, selected using the PairRM reward model. We find
that re-ranking with the preference model at test time can consistently improve the performance of
base model (Mistral-7B-Instruct-v0.2), DPO (Snorkel), and SPPO (Iter3) by 5.34%, 3.57%, and
3.6%, respectively. Notably, this shows that while SPPO significantly enhances model alignment
using PairRM-0.4B as the sole external supervision, it has not resulted in over-optimization against
the preference model (Gao et al.,). Future work will explore further improvements in model
alignment, potentially through additional iterations beyond the current three (following Snorkel’s
methodology).
In Table, we compare SPPO on the AlpacaEval 2.0 leaderboard with other state-of-the-art AI
chatbots. We found our SPPO model outperforms many competing models trained on proprietary
alignment data (e.g., Claude 2, Gemini Pro, & Llama 3 8B Instruct). When applied to Llama 3
8B Instruct, our Llama-3-8B-SPPO exhibits an even higher win rate. With test-time reranking,
Mistral-7B-SPPO Iter3 (best-of-16) is even competitive to GPT-4 0613 and Llama 3 70B Instruct.
In Figure
7B-SPPO Iter3 outperforms all baseline models, achieving an average score of 7.59. While we are not
15

Iter1 Iter2 Iter3
20
25
30
35
LC. Win Rate (%)
GPT-4 0314
Snorkel (Mistral-PairRM-DPO)
Mistral-7B-Instruct-v0.2
Mistral-7B-DPO
Mistral-7B-IPO
Mistral-7B-SPPO (a)Iter1 Iter2 Iter3
10
15
20
25
30
Win Rate (%)
GPT-4 0314
Snorkel (Mistral-PairRM-DPO)
Mistral-7B-Instruct-v0.2
Self-Rewarding 70B
Mistral-7B-DPO
Mistral-7B-IPO
mistral-7B-SPPO (b)
Figure 1: Win Rate against GPT-4-Turbo with (a) and without (b) Length Controlling (LC) on
AlpacaEval 2.0. SPPO demonstrates steady improvements on both LC and raw win rates.
certain why the MT-Bench performance drops at the first two iterations, the performance of SPPO
at the final iteration still improves over the base model. Since the length-controlled AlpacaEval
2.0 has a 98% Pearson correlation with human evaluations and 10×more evaluation prompts, it
likely provides a more reliable evaluation than MT-Bench. To gain a deeper understanding on
MT-Bench performance, we plot the improvement in Figure
prompt category. Mistral-7B-SPPO Iter3 demonstrates notable gains in RolePlay, Reasoning, Math,
and Coding tasks.
Open LLM Leaderboard We further evaluate the capabilities of SPPO models using Huggingface
Open LLM Leaderboard (Beeching et al.,). This leaderboard encompasses 6 different datasets,
each focusing on a specific capability of LLMs: Arc (Clark et al.,), HellaSwag (Zellers et al.,
2019), Winogrande (Sakaguchi et al.,), MMLU (Hendrycks et al.,), TruthfulQA (Lin et al.,
2021), and GSM8k (Cobbe et al.,). The models are prompted with zero or few-shot exemplars.
The results, presented in Table, demonstrate that SPPO can enhance the performance of the
base model on Arc, TruthfulQA, and GSM8k, and achieve the state-of-the-art performance with an
averagte score of 66.75. However, these improvements do not hold in subsequent alignment iterations:
DPO, IPO, and SPPO’s performance declines after the first or second iterations. This limitation
may be attributed to the “alignment tax” phenomenon (Askell et al.,), which suggests that
aligning with human preferences (simulated by PairRM preference in our study) might not improve
or even hurt the general performance. Improving language model capabilities through alignment
iterations remains a topic for future research, and we posit that incorporating high-quality SFT
annotations (Chen et al.,) could play a significant role in this endeavor.
Evaluation using PairRM as a judge As SPPO identifies the von Neumann winner (see(3.3))
in a two-player constant-sum game, we examine the pairwise preferences among SPPO models and
other baselines. The pairwise win rates, measured by PairRM, are depicted in Figure. We observe
that in all algorithms—namely DPO, IPO, and SPPO—the newer model iterations surpass the
16

Model
MT-Bench
1st Turn 2nd TurnAverage
Mistral-7B-Instruct-v0.2 7.78 7.25 7.51
Snorkel (Mistral-PairRM-DPO)7.83 7.33 7.58
Mistral-7B-DPO Iter1 7.45 6.58 7.02
Mistral-7B-DPO Iter2 7.57 6.56 7.06
Mistral-7B-DPO Iter3 7.49 6.69 7.09
Mistral-7B-SPPO Iter1 7.63 6.79 7.21
Mistral-7B-SPPO Iter2 7.90 7.08 7.49
Mistral-7B-SPPO Iter3 7.84 7.34 7.59Writing
Roleplay
Reasoning
Math
Coding
Extraction
STEM
Humanities
0 2 4 6 8 10
model
Mistral-7B-Instruct-v0.2
Mistral-7B-SPPO Iter1
Mistral-7B-SPPO Iter2
Mistral-7B-SPPO Iter3
Figure 2:MT-Bench Evaluation.Left: Mistral-7B-SPPO Iter3 outperforms all baseline models
by achieving an average score of 7.59. Despite initial drops in performance in the first two iterations,
SPPO Iter3 improves upon the base model by the final iteration. Right: Radar chart of MT-Bench
results for Mistral-7B-SPPO. SPPO Iter3’s improves across different MT-Bench categories, showing
significant gains in RolePlay, Reasoning, Math, and Coding tasks.
Table 3:Open LLM Leaderboard Evaluation . SPPO fine-tuning improves the base model’s
performance on different tasks, reaching a state-of-the-art average score of 66.75 for Mistral-7B
and 70.29 for Llama-3-8B. For Mistral-7B, subsequent iterations of DPO, IPO, and SPPO see
a decline in performance. It is possible that aligning with human preferences (simulated by the
PairRM preference model in our study) may not always enhance, and can even detract from, overall
performance.
Models Arc TruthfulQA WinoGrande GSM8k HellaSwag MMLU Average
Mistral-7B-Instruct-v0.263.65 66.85 77.98 41.93 84.89 59.15 65.74
Snorkel 66.04 70.86 77.74 36.77 85.64 60.83 66.31
Mistral-7B-DPO Iter163.14 68.39 77.19 40.33 85.25 59.41 65.62
Mistral-7B-DPO Iter264.16 67.84 76.09 39.95 85.23 59.03 65.38
Mistral-7B-DPO Iter365.19 67.89 77.27 32.30 85.49 59.00 64.52
Mistral-7B-IPO Iter164.68 68.60 77.98 43.75 85.08 59.04 66.52
Mistral-7B-IPO Iter262.12 66.30 77.51 39.20 83.15 59.70 64.66
Mistral-7B-IPO Iter362.97 67.12 77.51 37.45 83.69 59.57 64.72
Mistral-7B-SPPO Iter165.02 69.40 77.82 43.82 85.11 58.8466.67
Mistral-7B-SPPO Iter265.53 69.55 77.03 44.35 85.29 58.7266.75
Mistral-7B-SPPO Iter365.36 69.97 76.80 42.68 85.16 58.4566.40
Llama-3-8B-Instruct 62.29 51.65 76.09 75.89 78.73 65.59 68.37
Llama-3-8B-SPPO Iter163.82 54.96 76.40 75.44 79.80 65.6569.35
Llama-3-8B-SPPO Iter264.93 56.48 76.87 75.13 80.39 65.6769.91
Llama-3-8B-SPPO Iter365.19 58.04 77.11 74.91 80.86 65.6070.29
17

0.5000.5520.5720.5770.6310.6560.6640.7190.7130.7410.809
0.4480.5000.5200.5230.5770.6010.6270.6760.6820.7060.781
0.4280.4800.5000.4960.5680.5800.6090.6590.6520.6730.758
0.4230.4770.5040.5000.5520.5770.6080.6360.6310.6670.750
0.3690.4230.4320.4480.5000.5170.5510.5950.5980.6310.722
0.3440.3990.4200.4230.4830.5000.5300.5740.5770.6130.707
0.3360.3730.3910.3920.4490.4700.5000.5500.5620.5870.670
0.2810.3240.3410.3640.4050.4260.4500.5000.5140.5400.651
0.2870.3180.3480.3690.4020.4230.4380.4860.5000.5260.633
0.2590.2940.3270.3330.3690.3870.4130.4600.4740.5000.613
0.1910.2190.2420.2500.2780.2930.3300.3490.3670.3870.500
IPO Iter3 SPPO Iter3 Snorkel (Mistral-PairRM-DPO)DPO Iter3 SPPO Iter2 IPO Iter2 DPO Iter2 SPPO Iter1 DPO Iter1 IPO Iter1 Mistral-7B-Instruct-v0.2
IPO Iter3
SPPO Iter3
Snorkel (Mistral-PairRM-DPO)
DPO Iter3
SPPO Iter2
IPO Iter2
DPO Iter2
SPPO Iter1
DPO Iter1
IPO Iter1
Mistral-7B-Instruct-v0.2
0.2
0.3
0.4
0.5
0.6
0.7
0.8 Figure 3: Pairwise win rates among base model (Mistral-7B-Instruct-v0.2), DPO models, IPO
models, and SPPO models usingPairRM-0.4Bas a judge, which may favor models with longer
outputs. On benchmarks with more powerful judge models (e.g., GPT-4), such as AlpacaEval 2.0
and MT-Bench, SPPO outperforms other baseline algorithms by a large margin.
previous ones. For example, SPPO Iteration 3 outperforms SPPO Iteration 2. Both SPPO and IPO
consistently outperform DPO across all iterations. While SPPO is superior to IPO in the first two
iterations, IPO exceeds SPPO in performance during the final iteration. Considering the superior
performance of SPPO in standard benchmarks evaluated by GPT-4 or against ground-truth answers
(e.g., AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard), along with IPO’s tendency to
produce longer sequence outputs (see Avg. Len in Table), we believe this is due to IPO exploiting
the length bias in PairRM that favors longer sequences. Conversely, SPPO models benefit from a
more robust regularization within a multiplicative weight update framework.
5.3 Ablation Study
We study the effect of mini-batch size when estimating the win rateP(y≻πt|x). Specifically, for
each prompt, we still generate 5 responses and choose the winnerywand loserylaccording to the
PairRM score. When estimating the probability, we varies the batch size to beK= 2,3,5. For
18

Mini-Batch
Size
Iteration
AlpacaEval 2.0
Win RateAvg. Len
(chars)LC. Raw
K= 2
Iter123.85 23.53 1948
Iter226.91 27.24 1999
Iter328.26 28.22 1961
Iter124.7923.511855
Iter226.8927.622019K= 5
Iter328.5331.022163Iter1 Iter2 Iter3
18
20
22
24
26
28
LC. Win Rate (%)
Snorkel (Mistral-PairRM-DPO)
Mistral-7B-Instruct-v0.2
SPPO (K=2)
SPPO (K=5)
Figure 4: AlpacaEval 2.0 evaluation on SPPO of different mini-batch size in terms of both normal
and length-controlled (LC) win rates in percentage (%).K= 2,5 denote different mini-batch sizes
when estimating the win rateP(y≻πt|x).
K= 2, we estimateP(y≻πt|x) with only 2 samplesywandyl:
b
P(yw≻πt|x) =
P(yw≻yw|x) +P(yw≻yl|x)
2
=
1/2 +P(yw≻yl|x)
2
,
and
b
P(yl≻πt|x) similarly.K= 5 indicates the original setting we use.
We compare the results on AlpacaEval 2.0, as shown in Figure. We find that the performance
of SPPO is robust to the noise in estimatingP(y≻πt|x). WhileK= 5 initially outperforms
K= 2 in the first iteration, the difference in their performance diminishes in subsequent iterations.
Additionally, we observe thatK= 2 exhibits a reduced tendency to increase output length.
6 Conclusions
This paper introducedSelf-Play Preference Optimization(SPPO), an approach to fine-tuning Large
Language Models (LLMs) from Human/AI Feedback. SPPO has demonstrated significant im-
provements over existing methods such as DPO and IPO across multiple benchmarks, including
AlpacaEval 2.0, MT-Bench, and the Open LLM Leaderboard. By integrating a preference model and
employing a batched estimation process, SPPO aligns LLMs more closely with human preferences
and avoids common pitfalls such as “length bias” reward hacking.
LimitationsTheoretically, approximating the optimal policy update via regression relies on the
assumption that the model class is expressive enough and the input space is well covered by the
generated data. Replacing the log-partition factor with a constant can become inaccurate if the
preference model has different behavior. The experiments are run on one dataset UltraFeedback
and the models are tested on a few benchmarks due to limited computational resources, but the
proposed methods can be further validated on more models, datasets, and benchmarks to have a
holistic evaluation if there are more computational resources.
19

Acknowledgement
We would like to express our gratitude to Alekh Agarwal for his insightful comments that helped
clarify the technical contributions of our work and its connection and distinction from SPO (Swamy
et al.,). We would also like to thank Wen Sun for pointing out the concurrent work (Gao et al.,
2024) and for the valuable discussion on the design choice of objective functions.
A Proof of Theorem
Proof of Theorem. Suppose the optimization problem is realizable, we have exactly that
πt+1(y|x)∝πt(y|x) exp(ηP(y≻πt|x)),fort= 1,2, . . . . (A.1)
To prove that the exponential weight update can induce the optimal policy, we directly invoke a
restated version of Theorem 1 in1999):
Lemma A.1(Theorem 1 in1999), restated) .For any oraclePand for any
sequence of mixed policiesµ1, µ2, . . . , µT , the sequence of policiesπ1, π2, . . . , πT produced by(A.1)
satisfies:
T
X
t=1
P(πt≺µt)≤min
π
ˇ
η
1−e
−η
T
X
t=1
P(π≺µt) +
KL(π∥π0)
1−e
−η
˘
.
By settingµt=πt, we have that
T
2
≤min
π
ˇ
ηT
1−e
−η
P(π≺¯πT) +
KL(π∥π0)
1−e
−η
˘
,
where the LHS comes from thatP(πt≺πt ) = 1/2 and the RHS comes from that
1
T
P
T
t=1
P
(π≺
πt) =P(π≺¯πt). Now rearranging terms gives
1−e
−η
2η
≤min
π
ˇ
P(π≺¯πT) +
KL(π∥π0)
ηT
˘
.
We can naively bound the KL-divergenceKL(π∥π0)≤ ∥logπ0 (·)∥∞, which can be seen as a (large)
constant.
By choosingη=
∥logπ0(·)∥∞
√
T
, we have
1
2
−
∥logπ0(·)∥∞
4
√
T
+O(T
−1
)≤min
π
Θ
P(π≺¯πT)
Λ
+
r
∥logπ0(·)∥∞
T
,
where the LHS comes from Taylor’s expansion
1−e
−η
2η
=
1
2
−
η
4
+O(η
2). Notice that 1/2 at the
LHS is already the value of the symmetric two-player constant-sum game. This shows that for
appropriately chosenηandT, the mixture policy¯πTis close to the minimax optimal policy (Nash
equilibrium).
The optimality gap is thus bounded by
max
π
Θ
P(π≻¯πT)
Λ
−min
π
Θ
P(π≺¯πT)
Λ
20

= max
π
Θ
1−P(π≺¯πT)
Λ
−min
π
Θ
P(π≺¯πT)
Λ
= 2
`
1
2
−min
π
Θ
P(π≺¯πT)
Λ
´
=O
`
1
√
T
´
.
B Response Examples in Different Iterations
21

Table 4: Generation example of our fine-tuned model by SPPO at different iterations.
Prompt You will be given a definition of a task first, then some input of the task. You are
given a dialog between 2 or more individuals. The dialog will include information
as to the relationship between 2 of the participants. At the end, you will be asked
to provide an answer as to the relationship of those 2 participants. In this scenario,
’0’ will represent a spousal relationship, ’1’ will represent siblings, and ’2’ will
represent a child/parent relationship. Speaker 1: I’ll be right with you. Okay?
Thanks, Wendy. Speaker 2: Oh my God! How cute is the on-call doctor? Speaker
3: Ooh, so cute, that I’m thinking about jamming this pen in my eye. Speaker 4:
Dr. Burke will see you know. Speaker 2: Oh no-no-no, not Dr. Burke. Dr. Burke
is out of town. The-the on-call doctor will see me now. Speaker 4: Dr. Richard
Burke is out of town. Dr. Timothy Burke, his son, will see you now. Speaker 5:
Ready? What is the relationship between Speaker 5 and Richard Burke? Output:
Mistral-7BThe relationship between Speaker 5 and Richard Burke is that of a parent and
child, as Richard Burke is identified as the son of another character in the dialog.
Iteration 1The relationship between Speaker 5 and Richard Burke is that of a father and
child. Based on the dialogue, it was revealed that Dr. Timothy Burke is the on-call
doctor and the son of Dr. Richard Burke.
Iteration 2The relationship between Speaker 5 and Richard Burke is that of a father and
child. The conversation reveals that Dr. Timothy Burke is the on-call doctor and
he is identified as the son of Dr. Richard Burke.
Iteration 3Speaker 5 is the individual asking about the relationship, so they are not directly
related to Dr. Richard Burke. The information provided in the dialog reveals
that Dr. Timothy Burke is the on-call doctor and the son of Dr. Richard Burke.
Therefore, the answer is: The relationship between Speaker 5 and Dr. Richard
Burke is that of a child (Dr. Timothy Burke) and parent. Speaker 5 is inquiring
about the relationship, but they are not themselves a part of it. The specified
relationship codes (0 for spousal, 1 for sibling, 2 for child/parent) align with
this description: Speaker 5 (0, 1, or 2) and Dr. Richard Burke (2) connect as a
parent-child duo. Output: 2 (child/parent relationship) In this scenario, Speaker
5’s role is to determine the relationship between herself and Dr. Richard Burke,
but since they are not related (she is an outside inquirer), the correct answer
pertains only to the relationship identified within the dialog: Dr. Richard Burke
(father) and Dr. Timothy Burke (son).
22

Table 5: Another generation example of our fine-tuned model by SPPO at different iterations.
Prompt Q: Context: Roman Brady is a fictional character from the long running NBC soap
opera, Days of Our Lives. The role was originated in 1981 by Wayne Northrop
who played the role until 1984, was filled by Drake Hogestyn from 1986 – 1991,
again by Northrop from 1991 – 1994, and is currently being portrayed by series
veteran Josh Taylor, who stepped into the role in 1997. Question: who is roman
brady on days of our lives? Let’s give stream of consciousness below
Mistral-7BRoman Brady is a fictional character on the NBC soap opera ”Days of Our Lives.”
He was first portrayed by Wayne Northrop from 1981 to 1984. Later, Drake
Hogestyn took over the role from 1986 to 1991. Northrop returned to the role from
1991 to 1994. Since 1997, the character is being portrayed by Josh Taylor. Roman
Brady is a significant character in the show’s long-running storylines.
Iteration 1Roman Brady is a fictional character on the NBC soap opera ”Days of Our Lives.”
He was first portrayed by Wayne Northrop from 1981 to 1984. Drake Hogestyn
took over the role from 1986 to 1991. Northrop returned to play Roman again
from 1991 to 1994. Since 1997, the character has been portrayed by Josh Taylor.
So, in summary, Roman Brady is a long-standing character on ”Days of Our Lives”
who has been played by three different actors over the years: Wayne Northrop
(1981-1984, 1991-1994), Drake Hogestyn (1986-1991), and currently by Josh Taylor
(1997-present).
Iteration 2Roman Brady is a fictional character on the NBC soap opera ”Days of Our Lives.”
He was first portrayed by Wayne Northrop from 1981 to 1984. Drake Hogestyn
took over the role from 1986 to 1991. Northrop returned to play Roman again
from 1991 to 1994. Since 1997, the character has been portrayed by Josh Taylor.
Overall, Roman Brady is a long-standing and complex figure in the show’s history,
with multiple actors bringing their unique interpretations to the role.
Iteration 3Roman Brady is a fictional character on the NBC soap opera ”Days of Our Lives.”
He was originally played by Wayne Northrop from 1981 to 1984. Drake Hogestyn
took over the role from 1986 to 1991. Northrop returned to portray Roman
once more from 1991 to 1994. Since 1997, the character has been portrayed by
Josh Taylor.Roman Brady is a complex and prominent figure in the long-running
daytime drama. Initially played by Northrop in the early 1980s, his tenure included
involvement in various storylines, such as romantic relationships and business
dealings. Hogestyn’s portrayal during the late 1980s and early 1990s introduced
new dimensions to the character. Northrop’s second stint, from 1991 to 1994,
further developed Roman’s narrative. Since 1997, Josh Taylor has assumed the
role, continuing to contribute to the evolving story of Roman Brady on ”Days of
Our Lives.” Throughout these different portrayals, Roman has remained a central
character, known for his intricate relationships, business ventures, and personal
struggles.
23

References
Askell, A.,Bai, Y.,Chen, A.,Drain, D.,Ganguli, D.,Henighan, T.,Jones, A.,Joseph,
N.,Mann, B.,DasSarma, N. et al.(2021). A general language assistant as a laboratory for
alignment.arXiv preprint arXiv:2112.00861.
Azar, M. G.,Rowland, M.,Piot, B.,Guo, D.,Calandriello, D.,Valko, M.andMunos,
R.(2023). A general theoretical paradigm to understand learning from human preferences.arXiv
preprint arXiv:2310.12036.
Beeching, E.,Fourrier, C.,Habib, N.,Han, S.,Lambert, N.,Rajani, N.,Sanseviero,
O.,Tunstall, L.andWolf, T.(2023a). Open llm leaderboard.https://huggingface.co/
spaces/HuggingFaceH4/open_llm_leaderboard.
Beeching, E.,Fourrier, C.,Habib, N.,Han, S.,Lambert, N.,Rajani, N.,Sanseviero, O.,
Tunstall, L.andWolf, T.(2023b). Open llm leaderboard.Hugging Face.
Bradley, R. A.andTerry, M. E.(1952). Rank Analysis of Incomplete Block Designs: I. The
Method of Paired Comparisons.Biometrika39324–345.
Calandriello, D.,Guo, D.,Munos, R.,Rowland, M.,Tang, Y.,Pires, B. A.,Richemond,
P. H.,Lan, C. L.,Valko, M.,Liu, T. et al.(2024). Human alignment of large language
models through online preference optimisation.arXiv preprint arXiv:2403.08635.
Chen, Z.,Deng, Y.,Yuan, H.,Ji, K.andGu, Q.(2024). Self-play fine-tuning converts weak
language models to strong language models.arXiv preprint arXiv:2401.01335.
Christiano, P. F.,Leike, J.,Brown, T.,Martic, M.,Legg, S.andAmodei, D.(2017).
Deep reinforcement learning from human preferences.Advances in neural information processing
systems30.
Clark, P.,Cowhey, I.,Etzioni, O.,Khot, T.,Sabharwal, A.,Schoenick, C.andTafjord,
O.(2018). Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv
preprint arXiv:1803.05457.
Cobbe, K.,Kosaraju, V.,Bavarian, M.,Chen, M.,Jun, H.,Kaiser, L.,Plappert, M.,
Tworek, J.,Hilton, J.,Nakano, R. et al.(2021). Training verifiers to solve math word
problems.arXiv preprint arXiv:2110.14168.
Cui, G.,Yuan, L.,Ding, N.,Yao, G.,Zhu, W.,Ni, Y.,Xie, G.,Liu, Z.andSun, M.
(2023). Ultrafeedback: Boosting language models with high-quality feedback.arXiv preprint
arXiv:2310.01377.
Dubois, Y.,Galambosi, B.,Liang, P.andHashimoto, T. B.(2024a). Length-controlled
alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475.
Dubois, Y.,Li, C. X.,Taori, R.,Zhang, T.,Gulrajani, I.,Ba, J.,Guestrin, C.,Liang,
P. S.andHashimoto, T. B.(2024b). Alpacafarm: A simulation framework for methods that
learn from human feedback.Advances in Neural Information Processing Systems36.
24

Dud´ık, M.,Hofmann, K.,Schapire, R. E.,Slivkins, A.andZoghi, M.(2015). Contextual
dueling bandits. InConference on Learning Theory. PMLR.
Ethayarajh, K.,Xu, W.,Muennighoff, N.,Jurafsky, D.andKiela, D.(2024). Kto: Model
alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306.
Freund, Y.andSchapire, R. E.(1997). A decision-theoretic generalization of on-line learning
and an application to boosting.Journal of computer and system sciences55119–139.
Freund, Y.andSchapire, R. E.(1999). Adaptive game playing using multiplicative weights.
Games and Economic Behavior2979–103.
Gao, L.,Schulman, J.andHilton, J.(2023). Scaling laws for reward model overoptimization.
InInternational Conference on Machine Learning. PMLR.
Gao, Z.,Chang, J. D.,Zhan, W.,Oertell, O.,Swamy, G.,Brantley, K.,Joachims, T.,
Bagnell, J. A.,Lee, J. D.andSun, W.(2024). Rebel: Reinforcement learning via regressing
relative rewards.arXiv preprint arXiv:2404.16767.
Haarnoja, T.,Zhou, A.,Abbeel, P.andLevine, S.(2018). Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference
on machine learning. PMLR.
He, P.,Gao, J.andChen, W.(2021). Debertav3: Improving deberta using electra-style
pre-training with gradient-disentangled embedding sharing.
Hendrycks, D.,Burns, C.,Basart, S.,Zou, A.,Mazeika, M.,Song, D.andSteinhardt, J.
(2020). Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300.
Hong, J.,Lee, N.andThorne, J.(2024). Reference-free monolithic preference optimization with
odds ratio.arXiv preprint arXiv:2403.07691.
Ji, K.,He, J.andGu, Q.(2024). Reinforcement learning from human feedback with active queries.
arXiv preprint arXiv:2402.09401.
Jiang, A. Q.,Sablayrolles, A.,Mensch, A.,Bamford, C.,Chaplot, D. S.,Casas, D.
d. l.,Bressand, F.,Lengyel, G.,Lample, G.,Saulnier, L. et al.(2023a). Mistral 7b.
arXiv preprint arXiv:2310.06825.
Jiang, D.,Ren, X.andLin, B. Y.(2023b). Llm-blender: Ensembling large language models with
pairwise ranking and generative fusion.arXiv preprint arXiv:2306.02561.
Li, J.,Sun, S.,Yuan, W.,Fan, R.-Z.,Zhao, H.andLiu, P.(2023a). Generative judge for
evaluating alignment.arXiv preprint arXiv:2310.05470.
Li, X.,Zhang, T.,Dubois, Y.,Taori, R.,Gulrajani, I.,Guestrin, C.,Liang, P.and
Hashimoto, T. B.(2023b). Alpacaeval: An automatic evaluator of instruction-following models.
https://github.com/tatsu-lab/alpaca_eval.
Lin, S.,Hilton, J.andEvans, O.(2021). Truthfulqa: Measuring how models mimic human
falsehoods.arXiv preprint arXiv:2109.07958.
25

Liu, T.,Zhao, Y.,Joshi, R.,Khalman, M.,Saleh, M.,Liu, P. J.andLiu, J.(2023).
Statistical rejection sampling improves preference optimization.arXiv preprint arXiv:2309.06657
.
Lou, H.,Jin, T.,Wu, Y.,Xu, P.,Gu, Q.andFarnoud, F.(2022). Active ranking without
strong stochastic transitivity.Advances in neural information processing systems.
Munos, R.,Valko, M.,Calandriello, D.,Azar, M. G.,Rowland, M.,Guo, Z. D.,Tang,
Y.,Geist, M.,Mesnard, T.,Michi, A. et al.(2023). Nash learning from human feedback.
arXiv preprint arXiv:2312.00886.
OpenAI, J., Achiam,Adler, S.,Agarwal, S.,Ahmad, L.,Akkaya, I.,Aleman, F. L.,
Almeida, D.,Altenschmidt, J.,Altman, S.,Anadkat, S. et al.(2023). Gpt-4 technical
report.arXiv preprint arXiv:2303.08774.
Ouyang, L.,Wu, J.,Jiang, X.,Almeida, D.,Wainwright, C.,Mishkin, P.,Zhang, C.,
Agarwal, S.,Slama, K.,Ray, A. et al.(2022). Training language models to follow instructions
with human feedback.Advances in Neural Information Processing Systems3527730–27744.
Pal, A.,Karkhanis, D.,Dooley, S.,Roberts, M.,Naidu, S.andWhite, C.(2024). Smaug:
Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228
.
Rafailov, R.,Sharma, A.,Mitchell, E.,Manning, C. D.,Ermon, S.andFinn, C.(2024).
Direct preference optimization: Your language model is secretly a reward model.Advances in
Neural Information Processing Systems36.
Rosset, C.,Cheng, C.-A.,Mitra, A.,Santacroce, M.,Awadallah, A.andXie, T.(2024).
Direct nash optimization: Teaching language models to self-improve with general preferences.
arXiv preprint arXiv:2404.03715.
Sakaguchi, K.,Bras, R. L.,Bhagavatula, C.andChoi, Y.(2021). Winogrande: An
adversarial winograd schema challenge at scale.Communications of the ACM6499–106.
Schulman, J.,Wolski, F.,Dhariwal, P.,Radford, A.andKlimov, O.(2017). Proximal
policy optimization algorithms.arXiv preprint arXiv:1707.06347.
Singh, A.,Co-Reyes, J. D.,Agarwal, R.,Anand, A.,Patil, P.,Liu, P. J.,Harrison,
J.,Lee, J.,Xu, K.,Parisi, A. et al.(2023). Beyond human data: Scaling self-training for
problem-solving with language models.arXiv preprint arXiv:2312.06585.
Swamy, G.,Dann, C.,Kidambi, R.,Wu, Z. S.andAgarwal, A.(2024). A minimaximalist
approach to reinforcement learning from human feedback.arXiv preprint arXiv:2401.04056.
Thurstone, L.(1927). A law of comparative judgment.Psychological Review34273.
Tversky, A.(1969). Intransitivity of preferences.Psychological review7631.
Wang, Y.,Liu, Q.andJin, C.(2024). Is rlhf more difficult than standard rl? a theoretical
perspective.Advances in Neural Information Processing Systems36.
26

Wu, Y.,Jin, T.,Di, Q.,Lou, H.,Farnoud, F.andGu, Q.(2023). Borda regret minimization for
generalized linear dueling bandits. InICML 2023 Workshop The Many Facets of Preference-Based
Learning.
Xiong, W.,Dong, H.,Ye, C.,Zhong, H.,Jiang, N.andZhang, T.(2023). Gibbs sampling from
human feedback: A provable kl-constrained framework for rlhf.arXiv preprint arXiv:2312.11456.
Xu, J.,Lee, A.,Sukhbaatar, S.andWeston, J.(2023). Some things are more cringe than
others: Preference optimization with the pairwise cringe loss.arXiv preprint arXiv:2312.16682.
Ye, C.,Xiong, W.,Zhang, Y.,Jiang, N.andZhang, T.(2024). A theoretical analysis
of nash learning from human feedback under general kl-regularized preference.arXiv preprint
arXiv:2402.07314.
Yuan, W.,Pang, R. Y.,Cho, K.,Sukhbaatar, S.,Xu, J.andWeston, J.(2024). Self-
rewarding language models.arXiv preprint arXiv:2401.10020.
Zellers, R.,Holtzman, A.,Bisk, Y.,Farhadi, A.andChoi, Y.(2019). Hellaswag: Can a
machine really finish your sentence?arXiv preprint arXiv:1905.07830.
Zhao, Y.,Joshi, R.,Liu, T.,Khalman, M.,Saleh, M.andLiu, P. J.(2023). Slic-hf: Sequence
likelihood calibration with human feedback.arXiv preprint arXiv:2305.10425.
Zheng, L.,Chiang, W.-L.,Sheng, Y.,Zhuang, S.,Wu, Z.,Zhuang, Y.,Lin, Z.,Li, Z.,Li,
D.,Xing, E. et al.(2023). Judging llm-as-a-judge with mt-bench and chatbot arena.Advances
in Neural Information Processing Systems36.
Zheng, L.,Chiang, W.-L.,Sheng, Y.,Zhuang, S.,Wu, Z.,Zhuang, Y.,Lin, Z.,Li, Z.,Li,
D.,Xing, E. et al.(2024). Judging llm-as-a-judge with mt-bench and chatbot arena.Advances
in Neural Information Processing Systems36.
Zhu, B.,Jiao, J.andJordan, M. I.(2023). Principled reinforcement learning with human
feedback from pairwise ork-wise comparisons.arXiv preprint arXiv:2301.11270.
27