Extracted Text
2404.03715.pdf
Direct Nash Optimization:
Teaching Language Models to Self-Improve with General Preferences
Corby Rosset
∗
Ching-An Cheng Arindam Mitra Michael Santacroce
Ahmed Awadallah
∗
Tengyang Xie
∗
Microsoft Research
Abstract
This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle
to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement
Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy opti-
mization. However, such a reward maximization approach is limited by the nature of “point-wise” rewards (such as that
of the Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances
on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability,
they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the
reward maximization presumptions in favor of directly optimizing over “pair-wise” orgeneral preferences. In this
paper, we introduceDirect Nash Optimization(DNO), aprovableandscalablealgorithm that marries thesimplicity
andstabilityof contrastive learning withtheoretical generalityfrom optimizing general preferences. BecauseDNOis
abatched on-policyalgorithm using a regression-based objective, its implementation is straightforward and efficient.
Moreover,DNOenjoysmonotonic improvementacross iterations which helps it improve even over a strong teacher (such
as GPT-4). In our experiments, a resulting 7B parameterOrca-2.5model aligned byDNOachieves the state-of-the-art
win-rate againstGPT-4-Turboof 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute
gain of 26% (7%→33%) over the initializing model. It outperforms models with far more parameters, including
Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4. Our ablation studies analyze
critical design decisions surrounding the choice of preference pairs, and the use of LLMs-as-preference-annotators.
These results underscore the promise ofDNOin the LLMs post-training, as well as offer actionable insights for the AI
research community.
1 Introduction
The field of artificial intelligence is evolving towards advanced models that can understand, reason, follow complex
instructions, and create nuanced content, while aligning with human values and preferences. Large Language Models
(LLMs) (e.g.,,;,;,;,) have demonstrated
remarkable capabilities in generating human-like text, answering questions, and coding, yet they still face challenges
in tasks that require a high degree of reliability, safety, and ethical alignment. To address these challenges, fine-tuning
LLMs using Reinforcement Learning from Human Feedback (RLHF) (Christiano et al.,;,;
et al.,) has demonstrates strong potential for making LLMs more helpful by aligning them with human values.
The RLHF framework has been long studied in the context of preference-based reinforcement learning (RL) or RL
from human preferences (e.g.,,;,;,;,;
Christiano et al.,). The conventional methods for RLHF typically assume that the preference is determined by
a scalar reward function through some model, such as the frequently used Bradley-Terry (BT) model (Bradley and
Terry,). 1RLHF then optimizes toward the preference in a two-step procedure: reward learning, and policy
optimization (through RL) to maximize the learned reward. Under certain conditions, the two-step procedure can be
∗
Correspondence to{corbyrosset,hassanam,tengyangxie}@microsoft.com
1We use “reward model” to denote a framework that translates preferences into rewards, e.g., Bradley-Terry, while “reward function” is a (possibly
learned) function that outputs reward scalars.
1
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
LC Win Rate
0%
5%
10%
15%
20%
25%
30%
35%
40%
Win Rate Figure 1: Direct Nash Optimization achieves state-of-the-art results for a 7B parameter large language model, being
the first to surpass 30% in both raw win-rate and length-controlled (LC) win-rate againstGPT-4-Turbo. Win Rate and
LC Win Rate have0.93to0.98correlation with ChatBot Arena scores.
streamlined into a single-step contrastive learning approach (Rafailov et al.,), eliminating the need for explicit
reward learning. Algorithms of this kind (e.g.,,, DPO) leverage the insight that a policy can be
expressed equivalently by an “internal reward function” that the policy is optimal to, so they reduce the RLHF problem
to regressing the policy’s internal reward function to that of the preference model. These algorithms are originally
offline, and boast enhanced stability and ease of optimization. Nonetheless, two-step RLHF algorithms and their
single-step contrastive variants still fundamentally rely on the reward maximization framework, wherein reward-based
preferences are governed by, e.g., the BT model.
The reward maximization framing poses a major limitation. Reward functions, defined to output a scalar scorer(x, y)
for asingleresponseyto inputx, cannot express general preferencesy≻y
′
|xbetween apairof outputs in all cases,
e.g., intransitive or cyclic preferences (Elo,). Hence, LLMs trained under reward maximization cannot always
align with human preference. Furthermore, recent works show that even in settings where preferences can be perfectly
expressed under the reward-based BT models, optimizing towards rewards yields problematic behaviors; we refer the
reader to2023);2023);2023) for more details. Lastly, reward functions in
practice can quickly become “stale” as the distribution of the policy shifts under training (Ross et al.,;
et al.,;,;,) – leaving them vulnerable to “reward hacking” (Amodei et al.,)
In response to these weaknesses, an appealing line of work on RLHF proposes to directly optimize thegeneral preference
functionitself, instantiated as some oracle. These studies re-frame RLHF as finding a Nash equilibrium of a two-player
game with “payoffs” from a regularized (Munos et al.,) or un-regularized (Swamy et al.,) general preference
function. To solve it, they further approximate such Nash equilibrium using single-player algorithms by leveraging the
symmetry of the preference function. Then, they instead define the reward of a response as the expected win-rate against
the policy’s own behavior, as judged by the preference function, e.g.,r(x, y) =E
y
′
∼π(·|x)[P(y≻y
′
|x)]. Hence,
rewards are maximized by responses that are preferred over the policy’s expected response, and a Nash equilibrium
is achieved when both players deploy aπ
⋆
that is preferred over any competing policy. However, these proposed
single-player algorithms are primarily (purely) on-policy, and they sometimes require a separately estimated preference
function or a time-varying reward function. How to scale these algorithms up faithfully is still under-investigated.
We are motivated to overcome two separate challenges: the limited expressivity of reward-based RLHF, and the
lack of clarity on how to scale up optimizing with respect to general preferences. Recent advances in reward-based
optimization e.g., DPO, already have efficient and scalable implementations – we seek a similarly efficient solution
under the framework of general preferences.
We propose a provable and scalable RLHF algorithm –Direct Nash Optimization(DNO) (Algorithm 1) that achieves
the best of both worlds, combining the scalability of contrastive objectives with the theoretical soundness of general
2
preference optimization.DNOis designed as abatched on-policyalgorithm with a regression-based learning objective;
this design choice makesDNOstable and scalable, striking a balance between deployment efficiency and adaptability.
We summarize at a high level the key ingredients and insights ofDNObelow.
1.
the notion of reward of ought to be expressed asexpectedwin-rates with regard to a general preference function.2
2.
is sample-inefficient or unstable, we decompose the learning procedure into a sequence of “batched on-policy”
iterations, wherein each step instead optimizes a simple regression objective.
3.
the expected win-rate compared with itself (as defined in). By sampling outputs from the
current policy to use for training (i.e., “self-play”), this procedure incentivizes self-improving behavior.
4.
powerful teacher (See choice ofµ1andµ2in).
5.
regression is only performed on preference pairs with a sufficiently large margin (for theoretical explanation, see
Section 4; in practice, see).
6.DNOrepeats this procedure for multiple iterations to let the policy optimize toward the general preference. Since
each step involves a regression problem it can be easily implemented at scale.
Theoretically, we proveDNOconverges to the intended Nash equilibrium on average, and that it can improve monotoni-
cally across iterations (see). Furthermore, our finite-sample analysis shows that approximation error at any
iteration between the learned policy and the target is tightly bounded (Theorem 1).
On the practical side, we provide a scalable implementation ofDNO(Algorithm 2): an iterative self-improving algorithm
with contrastive updates, which approximates
sampling multiple online outputs from the policy being trained, using GPT-4 as the preference oracle, comparing on-
policy samples to GPT-4’s own (teacher) outputs, and training only on pairs with “large margin” (for theoretical
explanation, see; in practice, see).
The primary distinction of our work over related works of Nash-MD (Munos et al.,) and SPO (Swamy et al.,)
is that they both exhibit sample efficiency issues (two timescale updates or sample-inefficient RL steps), and both use
purely on-policy samples. We resolve the efficiency issue with a sample-efficient objective that works in practice, and
DNOis more flexible to incorporate off-policy samples from e.g., a powerful teacher.
Most importantly,DNOworks in practice – we provide comprehensive empirical evaluations, resulting in state-of-the-art
performance:
•The resulting 7B parameterOrca-2.5model, aligned using the practical implementation ofDNO(Algorithm 2),
achieves the state-of-the-art win-rate of any 7B model, exceeding33%againstGPT-4-Turbobeyond on the
AlpacaEval 2.0, even after controlling for length. This is an over26%absolute gain (7%→33%) compared to
the initialized model. It outperforms several recent advanced closed-source models, includingMistral Large
andGPT-4-0613, as well as open-source models with far more (10×) parameters, such as Self-Rewarding LM
(Yuan et al.,) which has 70B parameters.
•Our thorough ablation studies in
function (supervised finetuning or contrastive), training paradigm (with or without on-policy samples), preference
annotator quality (large margin or not), and training pair construction (self-play, teacher-vs-student, etc). Our
findings highlight that carefully-crafted methods encoded in
•We show some examples of outputs across iterations which demonstrate qualitative improvements such as better
addressing nuanced issues and presumptious questions (Table 5), better organization and clarity while refraining
from making misleading statements (Table 6), and higher information density in answers (Table 7).
We hope that the results presented herein will provide clarity to the community regarding the use of AI feedback for
post-training LLMs.
2E.g., for a fixedy|x, the expected win-rate ofyagainst the policy itself is:E
y
′
∼π(·|x)[P(y≻y
′
|x)].
3
2 Preliminaries
This section provides an overview of the RL from human feedback (RLHF) pipeline. We do not differentiate between
RLHF and RLAIF (e.g.,,;,), as the distinction is outside our scope of discussion. Thus,
we will uniformly refer to both concepts as RLHF. However, we want to make a clear delineation between two subtle
differences: RLHF maximizing point-wise reward functions, and RLHF optimizing general preferences. It should be
noted that this discussion is more broadly applicable in scope to general contextual bandits setup as well.
Throughout this paper, we usex∈ Xto denote the input (i.e. the prompt) received by the LLM from a spaceX. In
this paper, we do not consider the distribution shift over the prompts, following the standard contextual bandits setup of
RLHF (e.g.,,;,), and we use ρto denote the distribution of the prompts. We use
y∈ Yto denote the response from the LLM given the promptx(this corresponds to action in the contextual bandits
setup). We also useπ:X →∆(Y)to denote the policy, which is a LLM here, andΠis the policy class.
Our discussion throughout this paper will also regularly involve the following three learning paradigms, which are
originally introduced and commonly used in the RL literature:
(1)Offline: The learning algorithm operates without any active data collection, e.g., sampling from the current
policy. The algorithm relies solely on an offline dataset for training.
(2)Purely on-policy: technically, online on-policy. In this setup, learning takes place by sampling outputs from the
latest policy and immediately updating it based on the newly collected data. No data reuse or additional offline
data is considered.
(3)Batched on-policy3: This setup is the middle of the offline and purely on-policy setups, striking a balance
between deployment efficiency and adaptability. It involves iterative online data collection and can use other
offline data. Its distinctive feature is that here the data collection in each iteration occurs in a batched fashion
(e.g., akin to a dataset scale, much larger than the size of a typical mini-batch), and the amount of policy change
can be more significant (e.g., running gradient steps over multiple epochs of a dataset, as opposed to tens of
updates).
2.1 RLHF Based on Reward Models
One typical approach to conducting RLHF is a two-step procedure through a reward function (Christiano et al.,).
Suppose a preference datasetDpref:={(x, y
+
, y
−
)}is given, where(y
+
, y
−
)∼πref(· |x),πrefis some reference
policy such as the policy obtained after supervised fine-tuning (SFT), and a preferencey
+
≻y
−
|xis labeled by
some human or AI annotator. In RLHF with reward functions, the preference is assumed to be generated based on
some latent rewardr
⋆
. The first step is to learn a reward functionr∈ Runder some reward model assumption, where
Ris the reward class. A number of reward model assumptions have been studied, and the Bradley-Terry (BT) model
(Bradley and Terry,) is the most commonly used one. The BT model assumes the probability of y
+
≻y
−
|x
satisfies
P(y
+
≻y
−
|x):=
exp(r
⋆
(x, y
+
))
exp(r
⋆
(x, y
+
)) + exp(r
⋆
(x, y
−
))
.
This leads to the maximum-likelihood reward learning objective:
br←argmax
r∈R
E
(x,y
+
,y
−
)∼D
ˆ
logσ(r(x, y
+
)−r(x, y
−
))
˜
, (1)
whereσ(·):=
exp(·)
1+exp(·)
is the sigmoid function. After that, the LLM is finetuned using the learnedbrwith RL,
bπ←argmax
π∈Π
E
x∼Dpref,y∼π(·|x)
»
br(x, y)−βlog
π(x, y)
πref(x, y)
–
, (2)
3We acknowledge abuse of terminology. Our algorithm is not entirely online, as it only contains batched data collection. It is also not strictly
on-policy because it uses examples from other policies, like a teacher. While “offline” or “off-policy” may be technically more relevant, they might
lead to misunderstanding among readers and detract from the emphasis we want to place on the collection samples from the current policy, which
constitute the majority of our training data.
4
where the KL penalty term,E
x∼Dpref,y∼π(·|x)
h
βlog
π(x,y)
πref(x,y)
i
, is used to mitigate overoptimization of the reward
model (Ouyang et al.,), controlled by the coeffcient β. For the purposes of our discussion, we will call the
objective above as “PPOobjective”, and this two-step learning procedure as “PPO”.
DPO.Direct Preference Optimization (DPO) is proposed by2023) as an alternative RLHF approach for
combining the two-step procedure ofPPOinto a single objective. It utilizes the closed form solutionbπin, so
that solvingbπdirectly from
bπ←argmax
π∈Π
E
(x,y
+
,y
−
)∼Dpref
log
»
σ
„
βlog
π(y
+
|x)
πref(y
+
|x)
−βlog
π(y
−
|x)
πref(y
−
|x)
«–
.
2.2 RLHF with General Preferences
We now introduce the setup for directly optimizing a general preference function, as well as provide an overview of
existing solutions to achieve this goal (mostly by leveraging the symmetry of the preferences), especially those proposed
by2023);2024).
Here we assume that the learner is given query access to a general preference functionP(y≻y
′
|x)∈[0,1], for any
(x, y, y
′
)∈ X × Y × Y. This function indicates the probability that actionyis preferred overy
′
given the context
x. In practice, this setup can be viewed as the theoretical mode of RLAIF (e.g.,,;,),
human-in-the-loop RLHF (e.g.,,), or distillation fine-tuning (e.g.,,).
One common difficulty in optimizing a general preference function is itsintransitivity, e.g., it is possible thatP(a≻
b) =P(b≻c) =P(c≻a) = 1, for some options(a, b, c)(details see, e.g.,,;,;
Swamy et al.,). Therefore, the learning goal of optimizing general preferences can be the Nash equilibrium of
the two-player zero-sum game with the payoffs as the general preference functionP. The formal definition of such
Nash equilibrium is defined by theMinimax Winner,MW(see, e.g.,,;,;,;
Fishburn,), or the von Neumann Winner(see, e.g.,,),
MW(P):= argmax
π∈Π
argmin
π
′
∈Π
P(π≻π
′
) =
„
argmax
π∈Π
min
π
′
∈Π
P(π≻π
′
),argmin
π
′
∈Π
max
π∈Π
P(π≻π
′
)
«
, (3)
where
P(π≻π
′
):=E
x∼ρ,y∼π(·|x),y
′
∼π
′
(·|x)[P(y≻y
′
|x)].
SPO.To approximate the Nash equilibrium as defined in,2024) proposed a single-player
algorithm,SPO. This algorithm applies results from no-regret algorithms (e.g.,,). The SPO
algorithm is executed essentially using the following two-step iterative process: for eacht= 1,2, . . . , T,
(i)rt(x, y)←E
y
′
∼πt(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y
(ii)πt+1(· |x)←
1
Zt(x)
πt(· |x) exp
„
rt(x,·)
η
«
,∀x∈ X, (4)
whereηis the learning rate,π1is the uniform policy, i.e.,π1(· |x)←unif(Y),∀x∈ X, andZt(x):=
P
y∈Y
πt(y|
x) exp
“
rt(x,y)
η
”
is the partition function for iterationt.
Using the no-regret update of soft policy iteration, as shown in,2024) proved that the uniform
mixture ofπ1:TfromSPOis an approximation of the Nash equilibrium ofMW(P), as defined in.
Nash-MD.Munos et al.2023) proposed Nash-MDto approximate the Nash equilibrium of a KL-regularized preference
function,
P
π,π
′
τ(y≻y
′
|x):=P(y≻y
′
|x)−τlog
π(y|x)
πref(y|x)
+τlog
π
′
(y|x)
πref(y|x)
, (5)
5
Pτ(π≻π
′
):=E
x∼ρ,y∼π(·|x),y
′
∼π
′
(·|x)
h
P
π,π
′
τ(y≻y
′
|x)
i
(6)
=P(π≻π
′
)−τEx∼ρ[DKL(π(· |x)∥πref(· |x))] +τEx∼ρ[DKL(π
′
(· |x)∥πref(· |x))].
Following this,2023) demonstrate that the Nash Equilibrium of MW(Pτ)can be approximated using
a mirror descent (Nemirovskij and Yudin,;,;,) inspired algorithm,
Nash-MD, which has a last-iteration guarantee. TheNash-MDalgorithm can be viewed as a two-step iterative process:
for eacht= 1,2, . . . , T,
(i)rt(x, y)←E
y
′
∼π
τ
t
(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y
(ii)πt+1(· |x)←
1
Zt(x)
π
τ
t(· |x) exp
⊆
rt(x,·)
η
⊇
,∀x∈ X,
whereηis the learning rate,π
τ
tis the geometric mixture betweenπtandπref,
π
τ
t(y|x):=
πt(y|x)
1−
τ/η
πref(y|x)
τ/η
P
y
′
∈Y
πt(y|x)
1−
τ/η
πref(y|x)
τ/η
,∀(x, y)∈ X × Y, (7)
andZt(x):=
P
y∈Y
π
τ
t(y|x) exp
≍
rt(x,y)
η
≡
is the partition function for iterationt.
3 Direct Nash Optimization
While the no-regret update of soft policy iteration used inSPOandNash-MDhas inspired many standard (deep)
reinforcement learning algorithms (e.g., NPG,,; TRPO,,; PPO,,
2017; SAC,,), its faithful implementation still usually involves the two-timescale update. This
could potentially lead to complex hyperparameter tuning and unstable performance. In this section, we propose a direct
and iterative algorithm,Direct Nash Optimization(Algorithm 1), to approximate the Nash equilibrium ofMW(P). This
algorithm is primarily inspired bySPO. It can be readily adapted toNash-MDfor approximating the Nash equilibrium
ofMW(Pτ)with the last-iteration guarantee, and we will discuss this in.
Algorithm 1Direct Nash Optimization (DNO)
input:General preference functionP, learning rateη, number of iterationsT, prompt distributionρ.
1:Initializeπ1←unif(A).
2:foriterationt= 1,2, . . . , Tdo
3: Computert(x, y)←E
y
′
∼πt(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y.
4: Obtainπt+1by,
πt+1←argmax
π∈Π
E
(x,y1,y2)∼Dt
⊂
σ(rt(x, y1)−rt(x, y2)) log
≤
σ
⊆
ηlog
π(y1|x)
πt(y1|x)
−ηlog
π(y2|x)
πt(y2|x)
⊇≥
+σ(rt(x, y2)−rt(x, y1)) log
≤
σ
⊆
ηlog
π(y2|x)
πt(y2|x)
−ηlog
π(y1|x)
πt(y1|x)
⊇≥ ⊃
,
(8)
whereDtis generated byx∼ρ, y1∼µ1,t(· |x), y2∼µ2,t(· |x);µ1,tandµ2,tcan be either off-policy (e.g.,
pre-defined) or on-policy (based onπt).
5:end for
6:return¯π=unif(π1:T).
3.1 Derivation of
In most practical algorithms which are inspired by soft policy iteration, including the original practical version ofSPO,
they typically adopt the following approach: “pushing”πtowards this subsequent learning goal in each iteration (we
will refer to this as the soft policy iteration target throughout the paper):
π
⋆
t+1(· |x):=
1
Zt(x)
πt(· |x) exp
⊆
rt(x,·)
η
⊇
, (9)
6
whereZt(x) =
P
y∈Y
πt(y|x) exp
“
rt(x,y)
η
”
is the partition function. It can be realized by minimizing a distance
metric betweenπt+1andπ. For example, thePPOalgorithm for RLHF (e.g.,,;,
2022) essentially minimizes the reverse KL divergence as follows,
(π
PPO
t+1←) argmin
π∈Π
Ex∼ρ
ˆ
DKL(π(· |x)∥π
⋆
t+1(· |x))
˜
= argmax
π∈Π
E
x∼ρ,y∼π(·|x)
»
ηlog
π
⋆
t+1(x, y)
πt(x, y)
−ηlog
π(x, y)
πt(x, y)
–
= argmax
π∈Π
E
x∼ρ,y∼π(·|x)
»
rt(x, y)−ηZt(x)−ηlog
π(x, y)
πt(x, y)
–
= argmax
π∈Π
E
x∼ρ,y∼π(·|x)
»
rt(x, y)−ηlog
π(x, y)
πt(x, y)
–
.(⇔PPOobjective, asZtis independent ofπ)
However, implementing the above approach typically necessitateson-policysampling from the current policyπ.
Ignoring theZt(x)term could also lead to high variance in the empirical gradient estimation. This is a persistent
issue in actor-critic style algorithms that usually suggests the need for an additional baseline (details see, e.g.,
et al.,), which also requires on-policy estimation. Whenrtalso varies over iterations, as inSPOorNash-MD, we
then need to update all of the policy, baseline, and reward online simultaneously. These challenges have hindered the
scalability of existing algorithms which are based on learning the Nash equilibrium of general preference functions.
Regressing “internal rewards” towards preference-based rewards.Different from the mentioned approaches
above which are mostly focusing on the concept of “pushing”π→π
⋆
t+1. We now consider the following mechanism:
regressingrπ,t→rt, whererπ,tis the internal reward function of a givenπat iterationt:
rπ,t(x, y):=ηlog
π(y|x)
πt(y|x)
+ηZt(x). (10)
This can be interpreted as a reparameterization trick, whereπis exactly the soft policy iteration target (refer to)
induced byπtand the definedrπ,t. Therefore, regressing that specifically parameterizedrπ,ttortallows us to directly
optimize the soft policy iteration target with respect torπ,tandπt. This idea is inspired by techniques from inverse
RL (e.g.,,,a, Guided Cost Learning) as well as recent advances in RLHF (Rafailov et al.,, DPO).
To avoid the issues arising from the partition functionZt(x), we consider learning from the(x, y1, y2)tuple, wherey1
andy2are both responses to textual inputx. Note that, due to the offline learning nature of the regressive objective,
the sampling distribution ofy1andy2does not impact the learning objective (i.e.,rπ,t→rt, but it may affect the
sample complexity from the coverage reason as we will discuss later), whereas pushingπ→π
⋆
t+1requires samplingy
on-policy, as previously discussed. Therefore, given an arbitrary(x, y1, y2)tuple, we regress the “prediction”ˆzto the
“goal”z(both defined below), using binary logarithmic/cross-entropy loss to measure the prediction error (see, e.g.,
Foster and Krishnamurthy,),
bz:=σ(rπ,t(x, y1)−rπ,t(x, y2)) =σ
„
ηlog
π(y1|x)
πt(y1|x)
−ηlog
π(y2|x)
πt(y2|x)
| {z }
=:∆π,t(x,y1,y2)
«
, z:=σ
`
rt(x, y1)−rt(x, y2)
| {z }
=:∆
⋆
t
(x,y1,y2)
´
;(11)
ℓπ,t(x, y1, y2):=zlog(1/bz) + (1−z) log(1/(1−bz))
=−σ
`
∆
⋆
t(x, y1, y2)
´
log
ˆ
σ
`
∆π,t(x, y1, y2)
´˜
−σ
`
∆
⋆
t(x, y2, y1)
´
log
ˆ
σ
`
∆π,t(x, y2, y1)
´˜
.
Therefore, we obtain the following objective to learnπt+1,
argmin
π∈Π
LDt
(π;πt)
:= argmin
π∈Π
E
(x,y1,y2)∼Dt
[ℓπ,t(x, y1, y2)]
= argmax
π∈Π
E
(x,y1,y2)∼Dt
h
σ
`
∆
⋆
t(x, y1, y2)
´
log
ˆ
σ
`
∆π,t(x, y1, y2)
´˜
+σ
`
∆
⋆
t(x, y2, y1)
´
log
ˆ
σ
`
∆π,t(x, y2, y1)
´˜
i
= argmax
π∈Π
E
(x,y1,y2)∼Dt
"
σ(rt(x, y1)−rt(x, y2)) log
»
σ
„
ηlog
π(y1|x)
πt(y1|x)
−ηlog
π(y2|x)
πt(y2|x)
«–
(12)
7
+σ(rt(x, y2)−rt(x, y1)) log
ˇ
σ
`
ηlog
π(y2|x)
πt(y2|x)
−ηlog
π(y1|x)
πt(y1|x)
´˘
#
.
Here,Dtis generated byx∼ρ, y1∼µ1,t(· |x), y2∼µ2,t(· |x)with some policiesµ1,tandµ2,t. It should be noted
thatµ1,tandµ2,tfor eacht∈[T]are parts of our algorithm’s design decisions. We will provide choices for them in
Section 3.2
Monotonic improvement from thebatched on-policyupdates.One key distinction betweenDNOand existing
algorithms for learning Nash equilibrium (such asSPOandNash-MD) is that those algorithms aim to approach the Nash
equilibrium in a purely on-policy manner, which can be potentially unstable and may need to incorporate two-timescale
updates (that change the reward function used in the inner problem more frequently). On the other hand,DNOis a
batched on-policy algorithm with single-timescale updates.
From a purely theoretical perspective, it seems thatDNOmay require many iterations to ensure the convergence of¯π
to the Nash equilibrium, which could potentially be costly. Additionally,DNOonly converges on-average, and it is
unrealistic to deploy in practice that uniform mixture policy¯π(note that, as inspired by2023), DNOcould
be extended to regularized preferences with last-iteration convergence, which is discussed in). However,
from a practical perspective, we can leverage the following two desirable properties from LLMs scenario to eliminate
these concerns and ensuremonotonic improvementover theDNOiterations:
Firstly, the soft policy iteration target
loss,ℓt(π):=P(π≻πt)−ηEx∼ρ[DKL(π(· |x)∥πt(· |x))], andπ
⋆
t+1= argmax
πℓt(π). We can notice
thatℓt(πt) = 0.5andEx∼ρ[DKL(π(· |x)∥πt(· |x))]≥0. This means0.5≤ℓt(π
⋆
t+1) =P(π
⋆
t+1≻πt)−
ηEx∼ρ
Θ
DKL(π
⋆
t+1(· |x)∥πt(· |x))
Λ
=⇒ P(π
⋆
t+1≻πt)≥0.5 +ηEx∼ρ
Θ
DKL(π
⋆
t+1(· |x)∥πt(· |x))
Λ
. This
meansπ
⋆
t+1is guaranteed to be more preferred thanπtwith respect to the preferenceP, and there is even a computable
lower bound of the amount of improvement—ηEx∼ρ
Θ
DKL(π
⋆
t+1(· |x)∥πt(· |x))
Λ
. Therefore, ifπt+1learned from
Line 4 π
⋆
t+1(which is proved in), we could expect that
the policy is monotonically improved overDNOiterations. Note that the monotonic improvement guarantee isexclusive
to our design choice ofbatched on-policyupdates inDNO, because the alternatives are either unclear or unstable: it is
undefined how to perform iterative updates offline, and one gradient update from a purely online algorithm may not
be able able to accurately approximate the the soft policy iteration targetπ
⋆
t+1. Secondly, in practice, we usually have
validation data available, which allows us to deploy the best policy overπ
1:(T+1).
3.2 Theoretical Analysis
One of our major proposals is to use a regression-based objective to approximate the explicit soft policy iteration; in
this section we show the approximation error from this regression is tightly bounded with finite-sample analysis. The
following proposition discusses how well the solution of the regression-based objective (defined in
Algorithm 1) can approximate the soft policy iteration (Eq. (9)) in terms of the total variation metric at each iteration.
Theorem 1(informal).Fix an arbitrary iterationt∈[T]. Supposeπt+1is from, and π
⋆
t+1is
defined in. Then, under mild assumptions (realizability and boundedness, formally introduced in),
we have
Ex∼ρ
h
Γ
DTV(πt+1(· |x), π
⋆
t+1(· |x))
∆
2
i
≤ O
`
CtR
2
maxlog(
|Π|/δ)
N
´
,
where the concentrability coefficientCtis defined as below,
Ct:=
E
x∼ρ,y1∼π
⋆
t+1
(·|x),y2∼πt+1(·|x)
ˇ
ı
log
π
⋆
t+1
(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1
(y2|x)
πt+1(y2|x)
ȷ
2
˘
E
x∼ρ,y1∼µ1,t(·|x),y2∼µ2,t(·|x)
ˇ
ı
log
π
⋆
t+1
(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1
(y2|x)
πt+1(y2|x)
ȷ
2
˘.
Ifπt=π
⋆
tfor allt∈[T], the reader can refer to (Swamy et al.,, Section 3) for the convergence of ¯π(returned
by) to the Nash equilibrium. We expect the total variation difference betweenπtandπ
⋆
tprovided by
Theorem 12024).
8
Note that, we present the concentrability coefficientCtas data-dependent, withπt+1(learned from data) as part of its
definition. We aim to make this guiding the design choices ofµ1,tandµ2,tfrom suchCtfor the purpose of sample
efficiency. The formal statement and detailed proof of, without involving πt+1, are deferred to.
Although it shares a similar expression to the concentrability coefficient in offline reinforcement learning (e.g.,
and Jiang,;,), the policies µ1,tandµ2,tare flexible here due to the generative nature of large
language models. This flexibility allows for additional intervention, enhancing sample efficiency.
We can notice that the value ofCtcan be always bounded byCt≤max
(x,y)∈X ×Y
π
⋆
t+1
(y|x)πt+1(y|x)
µ1,t(y|x)µ2,t(y|x)
in the worst case.
However, asπt+1is likely to be restricted within a certain region, for instance, because fine-tuning will not significantly
alter the behavior of the language model, we anticipate that such a coefficient will not depend on the per-(x, y)worst
case. On the other hand, as a direct observation, we notice that the ideal selection ofµ1,tandµ2,tshould be close to
the target of soft policy iterationπ
⋆
t+1(assumingπ
⋆
t+1andπt+1are close). Interestingly, this theoretical observation
coincides with recent empirical results. Here,2024b) suggests that using statistical rejection sampling to
sample from the soft policy iteration target (which is almost equivalent to samplingy1andy2fromπ
⋆
t+1) could benefit
preference tuning. However, in our case, if we use similar statistical rejection sampling techniques onπtto sample
π
⋆
t+1(andπt+1), the cost of rejection sampling is likely to be comparable to the concentrability coefficientCtwhen
choosingµ1,tandµ2,tto beπt(see, e.g.,,). This suggests that both πtandπ
⋆
t+1(via rejection sampling)
as the choices ofµ1,tandµ2,twill be comparable options in terms of sample efficiency. On the other hand, as we will
demonstrate in the next section, sincertis defined based onπt(as shown in), choosing µ1,tand
µ2,tto beπtcan easily adapt to such a reward ofrt.
Another interesting observation is that despite
with using MLE, the target distributions used to measure distribution shift appear to be quite different. This disparity
is due to the different objectives: fitting soft policy iteration versus reward estimation. For the Bradley-Terry style
reward modeling using MLE, the desired distribution ofy1andy2should be two distinct distributions (see, e.g.,
et al.,;,). However, in our case where the learning goal is to fit the soft policy iteration, we may
prefery1andy2from two (near) on-policy distributions as discussed above, as long as we expect the learnedπt+1will
be accurate enough. To the best of our knowledge, this is the first theoretical result that illustrates the importance of
on-policy sampling beyond policy optimization style algorithms for RLHF.
4 Practical Algorithm – Iterative Contrastive Self-Improvement
In this section, we shift our focus to the algorithmic design of the practically scalable version ofDNO, following the
principles discussed in the last section. A primary challenge encountered in the implementation of the conceptual
algorithmDNO(Algorithm 1) stems from the necessity to compute the expectation with respect to the preference
functionPunder the current policyπt. Perhaps surprisingly, as we will show, all we need is a properly implemented
iterativeDPO-like contrastive learning algorithm.
We present our the practical implementation ofDNOin DNO-Prct), which is a batched on-policy algorithm
that conducts self-improvement iteratively via contrastive learning. One key consideration in our algorithmic design
is that we only need to implicitly use the reward functionrt. This comes from the specifically designed on-policy
sampling, data filtering, and pair construction. While these specific design choices makeDNO-Prctseem similar to
simply performingDPOiteratively, there are significant reasons for these design decisions, as we will discuss below.
Batched on-policy sampling.The use of batched on-policy sampling in
explicit use ofrt(defined asE
y
′
∼πt(·|x)[P(y≻y
′
|x)]in). This means we essentially
chooseµ1andµ2inDNOto beπtinDNO-Prct, but we are free to let them vary slightly as a mixture of other
policies, e.g., from a stronger teacher. Specifically, it is unrealistic to assume in practice that we can access the exact
value ofP(y≻y
′
|x)given an(x, y, y
′
)tuple. Based on the definition ofrtand the fact of{y
1
t, y
2
t, . . . , y
K
t}
are sampled fromπt,DNO-Prctessentially uses the following sampled based approach to estimatert:rt(x, y)≈
1
K
P
y
′
∈{y
1
t
,y
2
t
,...,y
K
t
,y
gold
}\y
1P(Isybetter thany
′
onx?), for anyxandy∈ {y
1
t, y
2
t, . . . , y
K
t, y
gold
}, where1Pdenotes
one sample fromPand output{0,1}. This is implemented in, and its precise implementation on
this is discussed in the. On the other hand, as we discussed in the last section, the batched on-policy sampling
fromπtis an appropriate option due to the consideration of sample efficiency when we use
9
Algorithm 2DNO-Prct: Practical Implementation ofDNOvia Iterative Contrastive Self-Improvement
input:General preference functionP, learning rateeη, iterationsT, reference policyπref, prompt distributionρ.
1:Initializeπ1←πref.
2:foriterationt= 1,2, . . . , Tdo
3:ConstructDt={(x, y
gold
)}wherex∼ρandy∼πgold(· |x).
4:Samplebatched on-policyresponses:SampleKoutputs per per prompt using the currentπt:
{y
1
t, y
2
t, . . . , y
K
t} ∼πt(· |x),∀x∈ Dt.
5:Rank responses:For eachx∈ Dt, rank the corresponding{y
1
t, y
2
t, . . . , y
K
t, y
gold
}using the pair-wise win-rate
by sampling from the general preference functionP.
6:Filter preference pairs:ConstructDt+1={(x, y
+
t, y
−
t)}, for allx∈ Dt+1, and(y
+
t, y
−
t)are large-margin
pairs (based on the win-rate rank) within the responses forxfrom the previous step.
7:Contrastive learning:Obtainπt+1by,
πt+1←argmax
π∈Π
E
(x,y
+
t
,y
−
t
)∼Dt+1
log
»
σ
„
eηlog
π(y
+
t|x)
πt(y
+
t|x)
−eηlog
π(y
−
t|x)
πt(y
−
t|x)
«–
. (13)
8:end for
9:returnbest ofπ
1:(T+1)on the validation data.
the soft policy iteration (see
Preference pair construction.Another key design choice in
purely contrastive loss, whereas σ(rt(x, y)−rt(x, y
′
))(for
a given(x, y, y
′
)tuple), which is not necessarily{0,1}. As we discussed above, it is unrealistic to expect access to the
exact value ofP(y≻y
′
|x), so it is also unlikely to get an accurate value of the regression targetσ(rt(x, y)−rt(x, y
′
)).
Thus, we add an additional data filtering step to address this issue as in. Ideally, we want the
selected(x, y
+
, y
−
)tuple to satisfyσ(rt(x, y
+
t)−rt(x, y
−
t))≈1, so that.
However, one can notice that it requiresrt(x, y
+
t)−rt(x, y
−
t)→ ∞, but we knowrt(x, y)∈[0,1],∀(x, y)∈ X × Y.
From the derivation ofDNOin, it is clear that scaling uprtandηwith the same absolute constantcdoes not
affect the soft policy iteration target of, but it will slightly change the DNOobjective (Eq. (8)) by
rt→c·rtandη→c·η=:eη. This scaling strategy helps us sidestep the problem of boundedrt, and in this sense,
we may expect the propereηinDNO-Prctto be relatively larger (than, e.g.,ηin). However, an enlarged
eηin,
especially for the derivation of). So, to avoid the proper eηbeing too large,we only use pairs with large
marginas in rt(x, y
+
t)−rt(x, y
−
t)is not too small. This decision is also supported
empirically in techniques like RLCD (Yang et al.,) and Axiomatic Preference Models (Rosset et al.,) which
highlight the importance of having large margin or clear directional differences between positive and negative LLM
responses when training preference models.
Relationship betweenDNO-PrctandDPO.The reader may discern thatDNO-Prct(Algorithm 2)—the practical
implementation ofDNO—can be described as an iterative version of theDPOalgorithm. Such similarity is by design,
intended to harness the simplicity and effectiveness of DPO (Rafailov et al.,) and build on empirical advancements
from recent work that applies DPO iteratively (e.g.,,;,). Our experiments point to the
importance of several design choices which help accommodate the general preferences, such as rankings derived from
pair-wise win rates. More interestingly, our findings point to a surprising connection—that“a meticulously designed
iterativeDPOalgorithm” could approach the Nash equilibrium of any given general preferences.
Our general algorithmic framework—DNO(Algorithm 1)—is broader and fundamentally different from iterativeDPO.
For example, theDNOframework could also be directly extended to the regularized preference case (as discussed
in) or equipped with other advanced sample techniques (e.g.,,, RSO) as suggested by
Theorem 1
optimization) is used in bothDNOandDPO, they arise from fundamentally different reasons. ForDNO, KL-regularization
10
1 2 3 4 5 6
Iterations or Epochs
5
10
15
20
25
30
35
Win-Rate against GPT-4
Alpaca Eval 2 Win-Rate vs. GPT-4 Turbo
DNO 7B - More Data
DNO 7B
DNO 7B - Restrictive
SPIN 7B
Iterative KTO 7B
Offline DPO (Ep)
SFT on Positives (Ep)
Self-Rewarding 70B
Initializer Orca-2.5
Mistral Large
GPT-4 0314 Figure 2: Comparison of various post-training techniques showing that Direct Nash Optimization (DNO) is the most
effective. All methods with colorful error bands are 1) implemented by ourselves, 2) initialized with a 7B parameter
Orca-2.5LLM, and 3) are “batched on-policy” (except SFT and OfflineDPOwhich are epochs), all else being equal.
originates from online learning, no-regret learning through mirror descent (Nemirovskij and Yudin,) or follow-
the-regularized-leader (FTRL) (Kalai and Vempala,;,;,;
Hazan et al.,). For DPOandPPO, the KL-regularization is an approximation for the total variation penalty to ensure
monotonic improvement of the policy (Kakade and Langford,;,). Later, this approach was
simplified by2017, PPO), and recently used for post-training LLMs (Ouyang et al.,).
5 Experiments
Algorithm 2
DNOto denote DNO-Prctfor simplicity). Once the input dataset{xi∈ X }is chosen, each iteration
ofDNOproceeds in three phrases: sampling outputs from the current policy, annotating outputs for preference pair
generation, and then training the next policy with the new training pairs. Iteration 0 is defined to start by sampling
from the initial SFT model to produce training data for iteration 1.
5.1 Experimental Setup
Data: We mainly use Ultrafeedback (Cui et al.,), which consists of 60k prompts, several models’ outputs to those
prompts, and preference annotations fromGPT-4-Turbo. This dataset thus provides a source of offline preferences.
For our iterative experiments, we split this dataset into three non-overlapping partitions of the inputs to be used for
separate iterations of batched on-policy learning. For each input, we also collect theGPT-4-Turbooutput if it was not
already present in the original dataset to be reserved fory
gold
.
Every experiment except one in this study solely uses UltraFeedback. The exception is one “scaled up” experiment
with about 10x more data sourced from a mixture of datasets aggregated including Anthropic HH (Bai et al.,),
UltraChat (Ding et al.,), MetaMathQA (Yu et al.,), EvolInstruct (Xu et al.,), UltraFeedback (Cui
et al.,) and Orca-2 (Mitra et al.,). Note that we only use the input prompts for these datasets and collect a
11
Alpaca Eval 2 MT Bench
Technique
Epoch
or Iter
Len-control.
Win Rate
Win Rate
vs. GPT-4
Avg. len
(chars)
1st
Turn
2nd
Turn
Avg
Orca-2.5 SFT Epoch 1 10.76 6.99 1174 7.72 6.02 6.88
Orca-2.5 SFT on Positives Epoch 411.62 7.96 1420 7.62 6.23 6.92
OfflineDPO(ours) Epoch 4 19.49 18.22 1884 7.69 7.08 7.38
Self-Rewarding 70B Iter 3 - 20.44 2552 - - 7.25
SPIN(ours) Iter 3 16.18 16.13 1922 7.58 7.53 7.55
DNO-Restrictive Iter 3 21.61 19.21 1795 7.59 7.35 7.46
DNO-Lookahead Epoch 1 18.58 18.28 1907 8.09 7.327.70
DNO Iter 3 22.59 24.97 2228 7.62 7.35 7.48
Table 1: AlpacaEval 2.0 and MT-Bench results in our controlled setting after training on UltraFeedback.
GPT-4-Turboresponses for all 600k of these input prompts.
Sampling from the Policy: At the end of training, we sample 5 outputs from the resulting student policy using top
p sampling withp= 0.95and temperature 0.7. Several works have shown the benefit of sampling and comparing
multiple diverse outputs from the policy (Yuan et al.,;,;,;,;
Wang et al.,). We implement a simple defect detection system which flags any sample that has a high amount of
repeated n-grams as automatic negative.
Preference Annotation: We useGPT-4-Turbo“as a judge” to label preferences among the 5 policy samples and 1
gold sample (which is alsoGPT-4-Turbo) as shown in. This prompt contains a few minor modifications from
the that used in (Yuan et al.,). It implements an additive scoring framework on a 6-point scale where a score
of 6 represents the highest quality answer according to certain dimensions like “correctness”, “expert knowledge”,
“conciseness” etc. By following this rubric, GPT-4 acting as an annotator represents a best-effort general preference
model because it compares multiple candidate responses side-by-side in the context window, and stratifies them along
meaningful dimensions of quality.
Training Pair Construction: Adhering to
Firstly, we must enforce the positives to be high quality in an absolute sense, and secondly, the negatives are directionally
worse by a large margin. On the 6 point annotation scale, only samples that score a 5 or 6 are allowed to be positives.
From the positives that meet this criteria, if any, we then construct all pairs such that the negative is at least 2 points
lower. If the positive happens to be from the student, we relax this constraint to 1 point margin since theGPT-4-Turbo
teacher outputs rarely receive a score less than 5 (as shown by the average teacher score in).
Additionally, we are motivated to preserve the preference behavior from previous iterations so that new policies do
not inadvertently regress to past bad behavior. To enforce this, we incorporate an exponentially decaying proportion
of prior iterations’ training pairs into the current iteration, i.e. we sample at most 30% of training pairs from iteration
t−1, 15% fromt−2, and so on. We do not re-inference outputs for those inputs from the most recent policy. Recall
that previous iterations’ inputs are non-overlapping with the splits for other iterations.
Training: To prevent overfitting, we train our batched on-policy methods for at most one epoch on newly constructed
pairs. Our effective batch size is fixed to 64 for all experiments. Our learning rate, beta, and alpha are found with brief
hyperparameter searches. For most experiments, the learning rate is 5E-5, beta is either 0.1 or 0.05, and alpha is 0.005.
We found that at higher iterations, the learning rate needs to be lowered. In SFT (supervised fine-tuning) experiments,
our learning rate is 5E-6 and we mask out loss for the inputs. We use the open-source
to run our experiments.
Evaluation: Our primary goal is to train a policy that is comparable to the most powerful state-of-the-art langauge
models. Hence, AlpacaEval 2.0 (Dubois et al.,) is an appropriate benchmark because it computes win-rate
againstGPT-4-Turboin a head-to-head fashion on a dataset of 805 input prompts that is shown to correlate with human
preferences (0.93 spearman correlation with Chatbot Arena). While it is known that auto-eval methods also correlate
with spurious features such as length, a new version of AlpacaEval 2.0 corrects for this with a length-controlled win-rate
12
Annotations of Training DataNew Training Pairs
inputs
student
length
(words)
best-of-n
student
win-rate
Avg. #
student
wins
Avg.
student
score
Avg.
teacher
score
T≻S S≻T S≻S
DNO-RestrictiveIter 019.6k 162 +/- 19015.9% 0.486 3.46 4.99 42.4k
DNO-RestrictiveIter 119.9k 359 +/- 35034.2% 1.11 4.86 4.77 17.5k
DNO-RestrictiveIter 219.8k 256 +/- 20735.0% 1.31 5.21 4.87 9.9k
DNOIter 0 19.6k 162 +/- 19015.9% 0.486 3.46 4.99 30.7k 4.2k 25.9k
DNOIter 1 19.9k 671 +/- 54634.6% 1.22 4.61 4.62 20.3k 19.4k 62.9k
DNOIter 2 19.8k 361 +/- 25143.6% 1.90 5.25 4.59 7.1k 32.4k 10.9k
Table 2: The dynamics of how sampled outputs from a previous iteration’s policy compare to their teacher, and how
many new training pairs they give rise to in the next iteration. The crucial point is thatDNOconstructs new pairs where
the student is compared to the teacher, whereasDNO-Restrictive,SPIN, and IPO-MD do not.
that has an even higher spearman correlation (0.98) with Chatbot Arena4.
We also evaluate on MT-Bench (Zheng et al.,) which allows the llm-as-a-judge to first explain its reasoning before
providing a scalar score on 1-10 for the candidate response to a bank of 80 questions. One crucial difference between
AlpacaEval 2.0 and MT Bench is that the former asksGPT-4-Turboto predict which of two side-by-side responses
humans would prefer, weighted by thelogitsto represent its uncertainty, whereas MT-Bench asks the model to first
generatea justification and then output a score on 1-10, but it neither defines the ratings (e.g. how a 7 is different than
a 5) nor accounts for uncertainty in the logits of the score.
We also evaluate on the OpenLLM leaderboard (Beeching et al.,), which measures reasoning ability on downstream
NLP tasks like coding and question answering by evaluating the accuracy of the multiple choice answer option with the
highest logit. Since our training data is primarily instruction-following and not trained to output just the sole answer
option, this benchmark is not the primary target of this study; nonetheless,DNOon instruction tuning tasks ought to
show no regression on reasoning tasks.
5.2 Results and Analysis
We run several head-to-head experiments that control for hyperparameters and input data. We often refer to the policy
being trained as the “student” and GPT-4 as a “teacher”; GPT-4 is also used as an annotator when prompted.
SFT BaselinesThe first baseline isOrca-2.5itself, which is amistralai/Mistral-7B-v0.1raw pretrained model
fine-tuned on a new collection of Orca-2 data (Mitra et al.,). This model was finetuned for three epochs and
achieves scores shown in the top of. All other experiments in this study are initialized with Epoch 1 of
Orca-2.5. This is the solid horizontal line in.
The second baseline is continue-SFT ofOrca-2.5training towards the positives in UltraFeedback (and masking out
loss over the input prompts). If the original positive in that dataset was not fromGPT-4-Turbo, we replace it with one
that is. This is the red line in. It is clear that even offline contrastive training methods are more beneficial than
additional SFT, showing that thedifferencebetween the positive and negative output provides more valuable training
signal than the positive in isolation.
Large Margin Filtering of Training Pairs: We ran a simple experiment of OfflineDPOfor one epoch on UltraFeedback
data. In the control, we trained on all 63k preference pairs in the original dataset, whereas in the treatment we filtered
the 42k pairs that met a large margin requirement enforcing that the positive’s scores exceeded that of the negative by
at least 1.0 (out of 10) according to theirGPT-4-Turboannotator. All else was equal. Even though the treatment was
trained for fewer steps on less data, it achieved an AlpacaEval 2.0 win rate of 11.60 vs 9.60 for the control, showing
that fewer higher quality preference pairs is better than a higher quantity of noisy pairs (not shown in the tables).
On-Policy is Better than Off-PolicyOne of the critical questions in this study whether to sample “on-policy” outputs
from the current student to use in training pairs, or whether “off-policy” outputs collected from other models different
4https://github.com/tatsu-lab/alpaca_eval
13
ARC-C
(25-shot)
GSM8K
(5-shot)
HellaSwag
(10-shot)
MMLU
(5-shot)
TruthfulQA
(0-shot)
WinoGrande
(5-shot)
Avg
Orca-2.5 Epoch 10.609 0.635 0.818 0.614 0.489 0.738 0.652
Orca-2.5 Epoch 30.624 0.641 0.826 0.624 0.506 0.746 0.661
SPIN(ours) Iter 30.668 0.448 0.862 0.623 0.601 0.759 0.660
DNOIter 1 0.657 0.572 0.834 0.623 0.568 0.755 0.668
DNOIter 2 0.663 0.562 0.845 0.624 0.580 0.753 0.671
DNOIter 3 0.672 0.542 0.852 0.622 0.606 0.753 0.675
Table 3: Results on Open-LLM Leaderboard reasoning tasks, which we do not expect to decrease.
than the student will suffice. We ran 4 epochs of OfflineDPOon UltraFeedback (filtered for large margin), and as shown
in, on-policy methods especiallyDNOsurpass the off-policyDPO, even when trained for 4 epochs while the
on-policy models were granted only three iterations. Recall that each iteration of batched on-policy training sees only
a third of the UltraFeedback input data, whereas an epoch of OfflineDPOsees the entire dataset.
Higher Quality AnnotatorsIn our study, we useGPT-4-Turboto provide the annotations for preference pairs.
However, the Self-Rewarding Language Model uses theLlama-2-70B(Touvron et al.,) model trained to also give
feedback as the annotator, which in their study starts off with a 65% agreement rate with human-labeled preferences
improving to 80% in the last iteration (Yuan et al.,). While it was not reported how well GPT-4-Turbo’s
annotations agree with their held-out human labels, we believe that having a higher-quality annotator to start with will
lead to higher quality policies. Since both our studies use UltraFeedback data, and our annotation prompt is based on
their annotation prompt, we believe there is a valid comparison.
We observeDNOinitialized with a 7B base model outperforms the 70B parameter Self-Rewarding model over the same
number of training iterations (24.97 win-rate vs 20.44 on AlpacaEval 2.0, and 7.46 MT-Bench vs 7.25), at least in
part due to the higher quality preference annotations. See the dark blue band versus the gray line in
corresponding row in. However, unlike Self-Rewarding LM, we saw a slight gain rather than a drop reasoning
benchmarks like ARC-Challenge (Clark et al.,) and HellaSwag (Zellers et al.,). Granted, the evaluation of
OpenLLM predicts the answer with the max logit corresponding to one of the multiple-choice options, which is not
congruous with how these techniques are trained.
Training Pair ConstructionOne of the most critical implementation questions in this study ishow to construct
training pairs that help the student policy exceed a strong teacherlikeGPT-4-Turbo. One approach, Self-Play
Finetuning (SPIN), removes the preference annotation step and automatically assigns the teacher output to be the
positive, and all student samples to be negative (Chen et al.,). We find in our re-implementation of SPINthat
this is detrimental, presumably because this automatic assignment could lead to noisy training pairs in cases where the
student might actually be preferred. The resulting win-rate ofSPINis only 16.13 after three epochs of iterative training
compared to 24.97 forDNOas shown in, all else being equal. Similar results hold in the OpenLLM results
in.
In a second experiment, which we denoteDNO-Restrictive, we annotate all preference pairs withGPT-4-Turboas
usual, but only admit training pairs where the teacher’s output is the preferred one. The difference betweenDNOand
DNO-Restrictiveis illustrated in
same is also true forSPIN, butSPINwould admit a greater quantity of noisy teacher-vs-student examples even when
they are dis-preferred: DNO-Restrictive, only 9.9k instances exist of the teacher
being preferred over the student, whereasSPINwould have automatically created about 100k (5 samples×20k inputs).
WhileDNO-Restrictiveis slightly better (19.21 win-rate) thanSPIN, it still does not give the student a chance to
compare its behavior to a powerful teacher. Absence of this signal is a major oversight, since the last row of
shows that by Iter 3, over 64% of theDNOtraining data (32k pairs) are cases where the student is in fact preferred over
the teacher, a number which increases with iteration.We conclude it is imperative to “allow the student to become
the teacher” i.e. learn from comparisons where its own outputs are preferred over a more powerful teacher.
One curious phenomenon in
slightly lower scores to the teacher as the student improves; we are not sure if this is an innocuous artifact of preference
14
Alpaca Eval 2 MT Bench
Technique
Epoch
or Iter
Len-control.
Win Rate
Win Rate
vs. GPT-4
Avg. len
(chars)
1st
Turn
2nd
Turn
Avg
Orca-2.5 SFT Epoch 1 10.76 6.99 1174 7.72 6.02 6.88
Orca-2.5 SFT Epoch 2 15.29 7.88 1060 7.56 6.38 6.98
Orca-2.5 SFT Epoch 3 15.90 8.17 1058 7.53 6.73 7.13
DNO-More-DataIter 1 8.96 10.67 2795 7.00 6.06 6.53
DNO-More-DataIter 2 14.61 16.94 2782 7.62 7.23 7.42
DNO-More-DataIter 3 21.81 26.74 2539 7.74 6.66 7.21
DNO-More-DataIter 4 22.93 29.08 3033 7.54 6.92 7.24
DNO-More-DataIter 5 32.06 34.98 2856 7.10 6.39 6.75
DNO-More-DataIter 6 33.05 34.38 2683 7.28 6.65 6.97
Table 4:DNO-More-Datais trained on 10x more instruction data thanDNO. It is still initialized with Epoch 1 of
Orca-2.5SFT, so the delta it provides in AlpacaEval 2.0 win rate is 27.39 absolute (22.29 length-controlled)
annotations, or symptomatic of a deeper problem. Also, the total quantity of new “large margin” training pairs (not
counting those sampled from previous iterations) inDNOtends to decrease as the policy improves across iterations, but
we do not have enough data to quantify how this relates to a change in quality.
Lookahead to Future IterationsAs a curiosity, we experimented with whether a model could benefit from the
knowledge of which training pairs it would generate if it could look into the future. We tested this by running
three-iterations ofDNO, accumulating all the preference pairs across iterations, combining and shuffling them, and
then re-starting training from the initial model. In essence, this turns the batch-onlineDNOinto an offline learning
algorithm we denote asDNO-Lookahead. We trained for one epoch on the three iterations’ worth of preference data. It
deteriorated more than we expected on AlpacaEval 2.0 win-rate (24.97 to 18.18), however, even more surprisingly, the
MT-Bench numbers improved significantly (7.48 to 7.70). While the reasons for the relatively low correlation between
MT-Bench and AlpacaEval 2.0 are not entirely clear, it is important to consider the disparity in the size of the datasets.
Given that MT-Bench consists of merely 80 examples, whereas AlpacaEval 2.0 contains 10x more, we conjecture that
the statistical significance and reliability of the findings from AlpacaEval 2.0 are regarded with greater confidence.
DNO Scales with More Data: One of the reasons we split UltraFeedback into three non-overlapping partitions
is to avoid overfitting. Another strategy to avoid overfitting is to collect more data, so we increased by a factor
of 10 the instruction data based on publicly available datasets. We split a large mixture of datasets into six non-
overlapping partitions of roughly 100k inputs each (and inferenceGPT-4-Turbooutputs for all inputs), and show that
DNO-More-Datascales well in this expanded regime (see the purple line in.
We make some notes on the behavior of this experiment: because each iteration builds on outputs of the previous
iteration, if there are any anomalies or errors in critical components such as preference annotation, those errors will
propagate and the only way to combat them is “roll back” to the iteration that introduced them. This can result in wasted
time and cost, which are both already very high as shown in. We suspect that the “depth” of iterations
matters more than the “width” or number of samples within each iteration, and furthermore, that having equal number
of inputs per iteration may not be optimal, but we did not test this thoroughly. From an efficiency standpoint, although
this algorithm is “batched”, some optimizations can be made, such as starting to annotate sampled policy outputs are
soon as they are ready instead of waiting for all inference jobs to finish.
“Exploding” LengthsIt is known that contrastive LLM training techniques, especiallyDPO, lead to longer outputs
from the model which is widely suspected to be a form of “reward hacking”. Curiously,
jump comes after the first round of contrastive training (Iteration 1), where lengths explode by at least a factor of 2 over
the initializing SFT model, before inching down again in the next iteration. We interpret this “length spike” as wasted
computation optimizing towards a spurious signal; we wish we were better equipped to control this phenomenon.
15
6 Related Work
We divide the space of related work into whehter or not the techniques use SFT or contrastive losses, in offline or
online update settings.
Online RLHF algorithms: RLHF innovated how to align language models with human preferences (Christiano et al.,
2017;,), but it is unstable to train and memory-intensive, requiring all three of the parameterized
policy model, reward model, and advantage model to be on device for training.
Reward-model Augmented SFT: Since the introduction of RLHF, several emergent techniques apply reward models
in various ways, such as to filter training data or rank responses. Reward rAnked Finetuning (RAFT) (Dong et al.,
2023) and RRHF (Yuan et al.,) offer the conceptually simplest solution for offline preference learning, which is
to sample multiple outputs from a policy, rank them with a reward model, and then finetune on the best sampled output
using SFT. This resembles the iterative behavior-cloning technique DAgger (Ross et al.,).
Offline Contrastive Preference Learning: There exist several loss functions for contrastive preference learning, first
introduced in the offline setting, namely Direct Preference Optimization (Rafailov et al.,, DPO) and Calibrated
Sequence Likelihood Estimation a.k.a. SLiC (Zhao et al.,).2023) make it clear that point-wise reward
estimates are no substitute for pair-wise preferences, and that a policy can easily overfit to deterministic preferences
without proper regularization. They derive a more general objective for RLHF, IPO, to directly optimize offline
preference probabilities.
Statistical Rejection Sampling Optimization (RSO) generates multiple samples from an initial model, ranks them
to create training pairs, and optimizes them under a unified framework encompassing DPO and SLiC (Liu et al.,
2024b). Inspired by the learning-to-rank literature, Listwise preference optimization (LIPO) extends pair-wise pref-
erence learning to list-wise (Liu et al.,). Preference Ranking Optimization (PRO) also learns towards list-wise
preferences (Song et al.,). The KTO algorithm takes a different approach from DPO and does not assume that a
pair of good-vs-bad outputs for the same input exist, but rather a pool of good outputs and a pool of bad outputs for
any inputs exist and optimizes an “unpaired” loss (Ethayarajh et al.,).
Iterative Reward-based Finetuning: Reinforced Self-Training (ReST) is one of the first methods to explore iterative
self-improving training strategies framed as a two-stage “Grow” step that samples from the current policy, and a
“Improve” step that uses a reward model to filter ever-higher quality samples that are then used to improve the
policy with offline RL (Gulcehre et al.,). A follow-up work explores the use of AI feedback rather than reward
ranking (Singh et al.,).
On-policy Contrastive Learning: Self-Rewarding Language Models (Yuan et al.,) is in practice very similar
toDNO. They study the benefits of batched iteratively training on preferences derived from a recent policy’s sampled
outputs, but in their work, they use the policy itself as the annotator, which starts off being able to provide only weak
preference signals. Self-Play Fine-Tuning (Chen et al.,) a.k.a SPINand Adversarial Preference Optimization
a.k.a APO (Cheng et al.,) are both iterative LLM training techniques that are compatible with contrastive losses,
but they make a very limiting assumption that the teacher is better than the student (without regard to any annotator
feedback).
The Cringe Loss (Adolphs et al.,) is a token-level loss function that contrasts the correct next token with a
hard-negative token from the vocabulary that has high logit weight but still incorrect. The Pairwise Cringe Loss (Xu
et al.,) applies the cringe loss to an iterative self-improving style of training.
On-Policy General Preference Optimization:Wang et al.2023) consider finding the von Neumann winner of
general preferences via multi-agent RL from the theoretical perspective. Nash-MD optimizes a policy towards the
Nash equilibrium of a generalized preference model using policy gradients, showing that by sampling from a mixture
of policies, one can converge to the Nash equilibrium in the last iteration (Munos et al.,). Self-play Preference
Optimization (SPO) is another online two-player mini-max game that converges to a Nash equilibrium with no-regret
guarantees (Swamy et al.,). However, these techniques are not as data efficient as contrastive losses and are
difficult to implement faithfully without cumbersome two-timescale updates (Munos et al.,). A concurrent
improvement, IPO-MD, mitigates these difficulties by using purely on-policy IPO updates and is empirically evaluated
on an article summarization task (Calandriello et al.,).2024) also propose to eliminate rewards in
online AI-feedback (OAIF) by using another LLM to annotate which of two online-sampled outputs from the current
16
policy is preferred. However, all the above studies only consider training pairs constructed between self-play “student
vs student” samples, and between student and initialπref. That is, there is no concept of a more powerful “teacher” to
compare against in their training pairs. We showed in
hinder performance.
7 Conclusion
In this paper we achieve dual goals of post-training LLMs against a more general class of preference models while
providing a practical and scalable implementation with finite-sample analysis. Our strong empirical results are based on
the insight that optimizing general preference functions can be reduced to finding the Nash equilibrium of a two-player
game with the payoff as the preference, and further solved by a single-play algorithm. Most techniques to optimize for
this objective use soft policy iteration, which is difficult to implement faithfully and may require unstable on-policy
and two-timescale updates. Our contribution, Direct Nash Optimization, addresses these challenges by approximating
soft policy iteration updates with a regression-based contrastive objective in a batched manner, which is a much more
stable and forgiving learning objective, and we establish a concentration bound of
e
O(
1/N)on the squared total variation
error between the learned policy and its target of the soft policy iteration update at any given iterationt. Theoretically,
DNOconverges to the Nash equilibrium on-average, but in practice enjoys monotonic improvement across iterations.
Training a 7B parameter LLM withDNOachieves state-of-the-art performance on AlpacaEval 2.0, exceeding both
Mistral Large and older versions of GPT-4. We illuminate many of the practical design choices that will aid future
development of iterative self-improving algorithms.
17
References
Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The cringe loss:
Learning what language not to model.arXiv preprint arXiv:2211.05826, 2022.
Riad Akrour, Marc Schoenauer, and Michèle Sebag. April: Active preference learning-based reinforcement learning.
InMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol,
UK, September 24-28, 2012. Proceedings, Part II 23, pages 116–131. Springer, 2012.
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in
ai safety.arXiv preprint arXiv:1606.06565, 2016.
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and
Rémi Munos. A general theoretical paradigm to understand learning from human preferences.arXiv preprint
arXiv:2310.12036, 2023.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort,
Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk,
Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt,
Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann,
and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback.
arXiv preprint arXiv:2204.05862, 2022a.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna
Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Her-
nandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey
Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas
Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna
Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam Mc-
Candlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback.arXiv preprint
arXiv:2212.08073, 2022b.
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanse-
viero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard.https://huggingface.co/spaces/HuggingFac
eH4/open_llm_leaderboard, 2023.
Quentin Bertrand, Wojciech Marian Czarnecki, and Gauthier Gidel. On the limitations of the elo, real-world games
are transitive, not additive. InInternational Conference on Artificial Intelligence and Statistics, pages 2905–2921.
PMLR, 2023.
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired
comparisons.Biometrika, 39(3/4):324–345, 1952.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark
Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners.Advances in neural
information processing systems, 33:1877–1901, 2020.
Sébastien Bubeck. Convex optimization: Algorithms and complexity.Foundations and Trends®in Machine Learning,
8(3-4):231–357, 2015.
Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey
Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, et al. Human alignment of large language models through
online preference optimisation.arXiv preprint arXiv:2403.08635, 2024.
Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006.
18
Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. InInternational
Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language
models to strong language models.arXiv preprint arXiv:2401.01335, 2024.
Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, and Nan Du. Adversarial preference optimization.arXiv preprint
arXiv:2311.08045, 2023.
Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement
learning from human preferences.Advances in neural information processing systems, 30, 2017.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.
Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018.
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun.
Ultrafeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377, 2023.
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen
Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint
arXiv:2305.14233, 2023.
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, SHUM
KaShun, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.Transactions
on Machine Learning Research, 2023.
Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S.
Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human
feedback.Advances in Neural Information Processing Systems, 36, 2023.
Miroslav Dudík, Katja Hofmann, Robert E. Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling
bandits. InConference on Learning Theory, pages 563–587. PMLR, 2015.
Arpad E. Elo.The rating of chessplayers, past and present. Arco Pub., New York, 1978. ISBN 0668047216
9780668047210. URLhttp://www.amazon.com/Rating-Chess-Players-Past-Present/dp/0668047216.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as
prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024.
Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial
networks, inverse reinforcement learning, and energy-based models.arXiv preprint arXiv:1611.03852, 2016a.
Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy
optimization. InInternational conference on machine learning, pages 49–58. PMLR, 2016b.
Peter C. Fishburn. Probabilistic social choice based on simple voting comparisons.The Review of Economic Studies,
51(4):683–692, 1984.
Dylan J. Foster and Akshay Krishnamurthy. Efficient first-order contextual bandits: Prediction, allocation, and
triangular discrimination.Advances in Neural Information Processing Systems, 34:18907–18919, 2021.
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to
boosting.Journal of computer and system sciences, 55(1):119–139, 1997.
Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L. Isbell, and Andrea L. Thomaz. Policy shaping:
Integrating human feedback with reinforcement learning.Advances in neural information processing systems, 26,
2013.
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya
Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando
de Freitas. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023.
19
Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas
Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel. Direct language model alignment from online
ai feedback.arXiv preprint arXiv:2402.04792, 2024.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870.
PMLR, 2018.
Elad Hazan et al. Introduction to online convex optimization.Foundations and Trends®in Optimization, 2(3-4):
157–325, 2016.
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. InProceedings of the
Nineteenth International Conference on Machine Learning, pages 267–274, 2002.
Sham M. Kakade. A natural policy gradient.Advances in neural information processing systems, 14, 2001.
Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems.Journal of Computer and System
Sciences, 71(3):291–307, 2005.
W Bradley Knox and Peter Stone. Tamer: Training an agent manually via evaluative reinforcement. In2008 7th IEEE
international conference on development and learning, pages 292–297. IEEE, 2008.
Gerald H. Kramer. On a class of equilibrium conditions for majority rule.Econometrica: Journal of the Econometric
Society, pages 285–297, 1973.
Germain Kreweras. Aggregation of preference orderings. InMathematics and Social Sciences I: Proceedings of
the seminars of Menthon-Saint-Bernard, France (1–27 July 1960) and of Gösing, Austria (3–27 July 1962), pages
73–79, 1965.
Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020.
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and
Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint
arXiv:2309.00267, 2023.
Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh,
Simon Baumgartner, Jialu Liu, Peter J. Liu, and Xuanhui Wang. Lipo: Listwise preference optimization through
learning-to-rank.arXiv preprint arXiv:2402.01878, 2024a.
Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. Statistical rejection
sampling improves preference optimization. InThe Twelfth International Conference on Learning Representations,
2024b.
Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen,
Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed
Khanpour, and Ahmed Awadallah. Orca 2: Teaching small language models how to reason.arXiv preprint
arXiv:2311.11045, 2023.
Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of slms
in grade school math.arXiv preprint arXiv:2402.14830, 2024.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational conference on
machine learning, pages 1928–1937. PMLR, 2016.
Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel
Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev,
Olivier Bachem, Daniel J. Mankowitz, Doina Precup, and Bilal Piot. Nash learning from human feedback.arXiv
preprint arXiv:2312.00886, 2023.
20
Arkadij Semenovič Nemirovskij and David Borisovich Yudin.Problem complexity and method efficiency in optimiza-
tion. Wiley-Interscience, 1983.
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo
Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie
Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel
Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman,
Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany
Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby
Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah
Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan,
Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,
Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun
Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua
Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes
Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu,
Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny
Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider,
Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik
Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis,
Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel
Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia
Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer,
Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David
Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan
Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano,
Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki,
Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos,
Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde
de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power,
Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra
Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar,
Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov,
Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina
Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers,
Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng,
Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea
Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann,
Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel
Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin
Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng,
Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow
instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022.
Art B. Owen.Monte Carlo theory, methods and examples.https://artowen.su.domains/mc/, 2013.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct pref-
erence optimization: Your language model is secretly a reward model.Advances in Neural Information Processing
Systems, 36, 2023.
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to
21
no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and
statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, and Paul Bennett. Axiomatic preference modeling for
longform question answering. InProceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, pages 11445–11475, 2023.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization.
InInternational conference on machine learning, pages 1889–1897. PMLR, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization
algorithms.arXiv preprint arXiv:1707.06347, 2017.
Shai Shalev-Shwartz et al. Online learning and online convex optimization.Foundations and Trends®in Machine
Learning, 4(2):107–194, 2012.
Paul B. Simpson. On defining areas of voter choice: Professor tullock on stable voting.The Quarterly Journal of
Economics, 83(3):478–490, 1969.
Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James
Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben
Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper
Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao
Xiao, Maxwell L. Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini
Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, and Noah Fiedel. Beyond human data: Scaling
self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023.
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking
optimization for human alignment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38,
pages 18990–18998, 2024.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei,
and Paul F. Christiano. Learning to summarize with human feedback.Advances in Neural Information Processing
Systems, 33:3008–3021, 2020.
Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, and Alekh Agarwal. A minimaximalist approach
to reinforcement learning from human feedback.arXiv preprint arXiv:2401.04056, 2024.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen,
Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj
Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor
Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan
Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams,
Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan
Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and
fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
Hoang Tran, Chris Glaze, and Braden Hancock. snorkelai/snorkel-mistral-pairrm-dpo, 2024.https://huggingfac
e.co/snorkelai/Snorkel-Mistral-PairRM-DPO.
Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang,
Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush,
and Thomas Wolf. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023.
22
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny
Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,
2022.
Yuanhao Wang, Qinghua Liu, and Chi Jin. Is rlhf more difficult than standard rl?arXiv preprint arXiv:2306.14111,
2023.
Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement
learning methods.Journal of Machine Learning Research, 18(136):1–46, 2017.
Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for
offline reinforcement learning.Advances in neural information processing systems, 34:6683–6694, 2021.
Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, and Sham M. Kakade. The role of coverage in online reinforcement
learning. InThe Eleventh International Conference on Learning Representations, 2023.
Wei Xiong, Hanze Dong, Chenlu Ye, Han Zhong, Nan Jiang, and Tong Zhang. Iterative preference learning from
human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456, 2023.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm:
Empowering large language models to follow complex instructions, 2023a.
Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Preference
optimization with the pairwise cringe loss.arXiv preprint arXiv:2312.16682, 2023b.
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, and Yuandong Tian. Rlcd: Reinforcement learning from
contrast distillation for language model alignment.arXiv preprint arXiv:2307.12950, 2023.
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian
Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2023.
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-
rewarding language models.arXiv preprint arXiv:2401.10020, 2024.
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on
learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023a.
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align
language models with human feedback without tears.arXiv preprint arXiv:2304.05302, 2023b.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish
your sentence?arXiv preprint arXiv:1905.07830, 2019.
Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, and Wen Sun. Provable offline preference-based
reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024.
Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J. Liu. Calibrating se-
quence likelihood improves conditional language generation. InThe Eleventh International Conference on Learning
Representations, 2023.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li,
Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench
and chatbot arena.Advances in Neural Information Processing Systems, 36, 2023.
23
Appendix
A Extension to Regularized Preferences
In this section, we discuss how to extend theDNOframework to the case of regularized preferences (defined in),
P
π,π
′
τ(y≻y
′
|x) =P(y≻y
′
|x)−τlog
π(y|x)
πref(y|x)
+τlog
π
′
(y|x)
πref(y|x)
,
which was first introduced and solved by2023) via Nash-MDintroduced earlier.
One can notice that the only difference betweenSPOandNash-MDis thatSPOuses the last iteration policyπtfor both
constructing rewardrtand performing a soft policy iteration update, whereasNash-MDuses the smoothed versionπ
τ
t
(firstly defined in),
π
τ
t(y|x):=
πt(y|x)
1−
τ/η
πref(y|x)
τ/η
P
y
′
∈Y
πt(y|x)
1−
τ/η
πref(y|x)
τ/η
,∀(x, y)∈ X × Y, (14)
for both. This allowsNash-MDto obtain a late-iteration guarantee.
On the other hand, due to the symmetry of regularized preferences, if we consider on-average convergence case, it is
likely thatSPOcan be adapted with a simpler way as follows: for eacht= 1,2, . . . , T,
(i)rt(x, y)←E
y
′
∼πt(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y
(ii)πt+1(· |x)←
1
Zt(x)
π
τ
t(· |x) exp
„
rt(x,·)
η
«
,∀x∈ X,
whereZt(x):=
P
y∈Y
π
τ
t(y|x) exp
“
rt(x,y)
η
”
is the partition function for iterationt. Here, the smoothed policyπ
τ
t
is only used in the soft policy iteration step, and this coincides with the OMD algorithm from2023).
Algorithm 3DNO(Regularized Preferences Version)
input:General preference functionP, learning rateη, coefficient of KL-regularizationτ, number of iterationsT,
prompt distributionρ.
1:Initializeπ1←unif(A).
2:foriterationt= 1,2, . . . , Tdo
3:Computert(x, y)by,
Option I: ▷for on-average convergence
rt(x, y)←E
y
′
∼πt(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y.
Option II: ▷for last-iteration convergence
rt(x, y)←E
y
′
∼π
τ
t
(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y, whereπ
τ
tis defined in.
4: Obtainπt+1by,
πt+1←argmax
π∈Π
E
(x,y1,y2)∼Dt
ȷ
σ(rt(x, y1)−rt(x, y2)) log
»
σ
„
ηlog
π(y1|x)
eπ
τ
t(y1|x)
−ηlog
π(y2|x)
eπ
τ
t(y2|x)
«–
+σ(rt(x, y2)−rt(x, y1)) log
»
σ
„
ηlog
π(y2|x)
eπ
τ
t(y2|x)
−ηlog
π(y1|x)
eπ
τ
t(y1|x)
«– ff
,
whereDtis generated byx∼ρ, y1∼µ1,t(· |x), y2∼µ2,t(· |x)with some policiesµ1,tandµ2,t, and
eπ
τ
t(y|x):=πt(y|x)
1−
τ/η
πref(y|x)
τ/η
,∀(x, y)∈ X × Y(the unnormalized version ofπ
τ
t(y|x)defined in
Eq. (14)).
5:end for
6:return¯π=unif(π1:T).
Based on discuss above, we can then obtain the extension ofDNOto the regularized preferences in, and its
practical implementation in. Note that, similar to Nash-MD, the late-iteration option for both
24
Algorithm 4DNO-Prct(Regularized Preferences Version)
input:General preference functionP, learning rateeη, coefficient of KL-regularizationτ, number of iterationsT,
reference policyπref, seed datasetD0={(x, y
gold
)}wherex∼ρandy∼πgold(· |x), reference modelπref.
1:Initializeπ1←πref.
2:foriterationt= 1,2, . . . , Tdo
3:Samplebatched on-policyresponses:
Option I: ▷for on-average convergence
SampleKoutputs per per prompt using the currentπt:{y
1
t, y
2
t, . . . , y
K
t} ∼πt(· |x),∀x∈ D0.
Option II: ▷for last-iteration convergence
SampleKoutputs per per prompt using the smoothed current policyπ
τ
t:{y
1
t, y
2
t, . . . , y
K
t} ∼π
τ
t(· |
x),∀x∈ D0, whereπ
τ
tis defined in η→eη.
4:Rank responses:For eachx∈ D0, rank the corresponding{y
1
t, y
2
t, . . . , y
K
t, y
gold
}using the pair-wise win-rate
by sampling from the general preference functionP.
5:Filter and construct preference pairs:ConstructDt={(x, y
+
t, y
−
t)}, for allx∈ D0, and(y
+
t, y
−
t)are
large-margin pairs (based on the win-rate rank) within the responses forxfrom the previous step.
6:Contrastive learning:Obtainπt+1by,
πt+1←argmax
π∈Π
E
(x,y
+
t
,y
−
t
)∼Dt
log
»
σ
„
eηlog
π(y
+
t|x)
eπ
τ
t(y
+
t|x)
−eηlog
π(y
−
t|x)
eπ
τ
t(y
−
t|x)
«–
,
whereeπ
τ
t(y|x):=πt(y|x)
1−
τ/eη
πref(y|x)
τ/eη
,∀(x, y)∈ X × Y(the unnormalized version ofπ
τ
t(y|x)defined
in, after accommodating η→eη).
7:end for
8:returnbest ofπ
1:(T+1)on the validation data.
and π
τ
t(the mixture betweenπtandπref, defined in).
One solution to address this can be sampling from the token-level betweenπtandπrefinstead as suggested by
et al.2023).
B Detailed Proofs
In this section, we provide detailed proofs for our theoretical results. Note that, the definitions and assumptions
presented heavily adopts the ideas related to version space and concentrability from reinforcement learning theory
literature (esp.,,,). Nevertheless, the descriptions provided herein are intentionally simplified
to elucidate the core insights into the algorithmic design. A full and exhaustive theoretical analysis falls outside the
primary scope of this paper. We now make the following definitions and assumptions.
Definition 1(Feasible solution space).For each iterationt∈[T], we defineΠt⊆Πas the feasible solution space for
iterationt. Theπtobtained by Πt, regardless of the randomness of the data sampling
procedure in.
Here, Πtonly contains policies
that have a small empirical loss, which can be further converted to a small population loss under standard concentration
procedures.
Definition 2(Concentrability coefficient over the feasible solution space).For allt∈[T], supposeΠtis defined in
Definition 1, and µ1,tandµ2,tare some given data generate policy. Now, for anyt∈[T], we defineCtto be the
concentrability coefficient at iterationtover its feasible solution space, where
E
x∼ρ,y1∼π
⋆
t+1
(·|x),y2∼πt+1(·|x)
»
“
log
π
⋆
t+1
(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1
(y2|x)
πt+1(y2|x)
”
2
–
E
x∼ρ,y1∼µ1,t(·|x),y2∼µ2,t(·|x)
»
“
log
π
⋆
t+1
(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1
(y2|x)
πt+1(y2|x)
”
2
–≤Ct,
25
for anyπt+1∈Πt+1and anyπ
⋆
t+1∈
n
1
Zπ(x)
π(· |x) exp
“
rπ(x,·)
η
”
:π∈Πt
o
; and here we use the definition of
rπ(x, y):=E
y
′
∼π(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y, andZπ(x) =
P
y∈Y
π(y|x) exp
“
rπ(x,y)
η
”
,∀x∈ X.
Definition 2
to our setup.
Assumption 1(Realizability over the feasible solution space).For anyπ∈ΠtwhereΠtis defined in
allt∈[T], we assume the following soft-policy iteration update
π
SPI
(· |x):=
1
Zπ(x)
π(· |x) exp
„
rπ(x,·)
η
«
,
whererπ(x, y):=E
y
′
∼π(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X ×Y, andZπ(x) =
P
y∈Y
π(y|x) exp
“
rπ(x,y)
η
”
,∀x∈ X
is the partition function.
Assumption 2(Boundedness over the feasible solution space).SupposeΠtis defined in t∈[T],
then we assumelog
π(y|x)
πt(y|x)
∈[−Rmax, Rmax]for allπ∈Π,πt∈Πt, and(x, y)∈ X × Y.
Assumption 2
Nonetheless, it is important to note that the value oflog
π(y|x)
πt(y|x)
is directly measurable and controllable in practice,
which is different from the common use case, such as maximum likelihood problems.
Theorem 2(Formal Version of) .Under, and fix an arbitrary iteration t∈[T].
Supposeπt+1is from, and π
⋆
t+1is defined in. Then, we have
Ex∼ρ
h
`
DTV(πt+1(· |x), π
⋆
t+1(· |x))
´
2
i
≤ O
„
CtR
2
maxlog(
|Π|/δ)
N
«
,
whereCtis defined in.
Proof of. We will now present the proof using the following two-step procedure.
Step 1: From regression with log loss to squared error bound.By standard results on the regression with the
logarithmic loss, we know,
E
(x,y1,y2)∼Dt
h
σ
`
∆
⋆
t(x, y1, y2)
´
log
ˆ
σ
`
∆π
t+1,t(x, y1, y2)
´˜
+σ
`
∆
⋆
t(x, y2, y1)
´
log
ˆ
σ
`
∆π
t+1,t(x, y2, y1)
´˜
i
≲
log(|Π|/δ)
N
.(15)
Note that similar results could also apply beyond finiteΠ. For simplicity, we omit the detailed discussion in our paper.
For more in-depth discussions about regression with the logarithmic loss, the reader can refer to, e.g.,
Krishnamurthy2021).
Next, by the Pinsker’s inequality, we have for anyz,bz∈[0,1],
(z−bz)
2
2
≤zlog
“
z
bz
”
+ (1−z) log
„
1−z
1−bz
«
.
Substituting thezandbzwith, we obtain that
E
(x,y1,y2)∼Dt
h
`
σ(rt(x, y1)−rt(x, y2))−σ
`
rπt+1,t(x, y1)−rπt+1,t(x, y2)
´ ´
2
i
≲
log(
|Π|/δ)
N
, (16)
wherea≲bmeansa≤c·bfor some absolute constantc. Then, by the standard concentration for squared loss, e.g.,
Lemma A.4 of2021) with γ= 0,
E
(x,y1,y2)∼ρ×µ1:2,t
h
`
σ(rt(x, y1)−rt(x, y2))−σ
`
rπt+1,t(x, y1)−rπt+1,t(x, y2)
´ ´
2
i
≲
log(
|Π|/δ)
N
, (17)
where we use “×” as the shorthand of joint distribution for the sake of simplicity, for example,(x, y1, y2)∼ρ×µ1:2,t
is shorthand forx∼ρ, y1∼µ1,t(· |x), y2∼µ2,t(· |x).
26
By the definition ofrtin, we know rt(x, y)∈[0,1]for all(x, y)∈ X × Y. Thus, by a variant
of mean value theorem, we know
rt(x, y1)−rt(x, y2)−rπt+1,t(x, y1) +rπt+1,t(x, y2)
≤
ηRmax
1−σ(1)
σ(rt(x, y1)−rt(x, y2))−σ
`
rπt+1,t(x, y1)−rπt+1,t(x, y2)
´
,
(18)
for any(x, y1, y2)∈ X × Y × Y, whereRmaxis introduced from. This is because: let a:=
rt(x, y1)−rt(x, y2)∈[−1,1], andb:=rπt+1,t(x, y1)−rπt+1,t(x, y2)∈[−ηRmax, ηRmax], and, then, we can
directly verify that the slope we need to bound
a−b
/
σ(a)−σ(b)
reaches its maximum ata= 1andb=ηRmax.
Combining, we obtain
E
(x,y1,y2)∼ρ×µ1:2,t
h
`
rt(x, y1)−rt(x, y2)−rπt+1,t(x, y1) +rπt+1,t(x, y2)
´
2
i
≲
η
2
R
2
maxlog(
|Π|/δ)
N
. (19)
Step 2: Concentration in the policy space.We now reason about the concentration ofπt+1→π
⋆
t+1from,
whereπ
⋆
t+1is defined in πt+1is the policy corresponding to the learnedrπt+1,t. By the definition ofrπ,t
in, we have
rt(x, y1)−rt(x, y2)−rπt+1,t(x, y1) +rπt+1,t(x, y2)
=rt(x, y1)−rt(x, y2)−ηlog
πt+1(y1|x)
πt(y1|x)
+ηlog
πt+1(y2|x)
πt(y2|x)
=ηlog
π
⋆
t+1(y1|x)
πt+1(y1|x)
−ηlog
π
⋆
t+1(y2|x)
πt+1(y2|x)
.
This implies
E
(x,y1,y2)∼ρ×µ1:2,t
"
„
ηlog
π
⋆
t+1(y1|x)
πt+1(y1|x)
−ηlog
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
2
#
≲
η
2
R
2
maxlog(
|Π|/δ)
N
=⇒E
(x,y1,y2)∼ρ×π
⋆
t+1
×πt+1
"
„
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
2
#
≲
CtR
2
maxlog(
|Π|/δ)
N
, (20)
where the last step follows from the definition ofCt(Definition 2).
On the other hand, we have
E
(x,y1,y2)∼ρ×π
⋆
t+1
×πt+1
"
„
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
2
#
=E
(x,y1,y2)∼ρ×π
⋆
t+1
×πt+1
"
„
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
«
2
+
„
log
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
2
−2
„
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
«
·
„
log
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
#
=E
(x,y)∼ρ×π
⋆
t+1
"
„
log
π
⋆
t+1(y|x)
πt+1(y|x)
«
2
#
+E
(x,y)∼ρ×πt+1
"
„
log
π
⋆
t+1(y|x)
πt+1(y|x)
«
2
#
+ 2Ex∼ρ
"
E
y∼π
⋆
t+1
(·|x)
»
log
π
⋆
t+1(y|x)
πt+1(y|x)
–
| {z }
=DKL(π
⋆
t+1
(·|x)∥πt+1(·|x))
·E
y∼πt+1(·|x)
»
log
πt+1(y|x)
π
⋆
t+1
(y|x)
–
| {z }
=DKL(πt+1(·|x)∥π
⋆
t+1
(·|x))
#
≥E
(x,y)∼ρ×π
⋆
t+1
"
„
log
π
⋆
t+1(y|x)
πt+1(y|x)
«
2
#
+E
(x,y)∼ρ×πt+1
"
„
log
π
⋆
t+1(y|x)
πt+1(y|x)
«
2
#
. (21)
Next, we fix an arbitraryex∈ X, and we have
E
y∼π
⋆
t+1
(·|ex)
"
„
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
«
2
#
+E
y∼πt+1(·|ex)
"
„
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
«
2
#
27
≥
⊆
E
y∼π
⋆
t+1
(·|ex)
≤
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
≥⊇
2
+
⊆
E
y∼πt+1(·|ex)
≤
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
≥⊇
2
(by Jensen’s inequality)
≳
⊆
E
y∼π
⋆
t+1
(·|ex)
≤
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
≥
+E
y∼πt+1(·|ex)
≤
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
≥⊇
2
, (22)
wherea≳bmeansa≥c·bfor some absolute constantc.
We now recall the definition off-divergence:Df(p, q):=Ey∼q[f(
p(y)/q(y))]for two distributionspandq, where
f:R
+
→Ris convex withf(1) = 0. Thus, we can notice that,
Ey∼p
≤
log
p(y)
q(y)
≥
+Ey∼q
≤
log
p(y)
q(y)
≥
=Df1(p, q),wheref1(u):= (1 +u)· |log(u)|, u∈R
+
. (23)
On the other hand, by the definition of total variation distance, we know
DTV(p, q) =Df2
(p, q),wheref2(u):=
1
2
|u−1|, u∈R
+
. (24)
It is easy to verify that
f1(u)−f2(u) = (1 +u)· |log(u)| −
1
2
|u−1| ≥0, for allu∈R
+
, (25)
as it is a convex function onR
+
with a minimum of0atu= 1. Therefore, combining, we obtain
Ey∼π
⋆
t+1
(·|ex)
"
⊆
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
⊇
2
#
+Ey∼π
t+1(·|ex)
"
⊆
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
⊇
2
#
≳[DTV(πt+1(· |ex), π
⋆
t+1(· |ex))]
2
.(26)
We then apply,
Ex∼ρ
h
−
DTV(πt+1(· |x), π
⋆
t+1(· |x))
·
2
i
≲E
(x,y1,y2)∼ρ×π
⋆
t+1
×πt+1
"
⊆
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1(y2|x)
πt+1(y2|x)
⊇
2
#
≲
CtR
2
maxlog(
|Π|/δ)
N
,
where the last step follows from. This completes the proof.
C Additional Experimental Details
Batched Prompting:We also show in
sake of efficiency, we “batch” requests to GPT-4, meaning that instead of sending every pair of candidate responses to
be annotated, we show all candidates side-by-side and ask GPT-4 to apply the scoring rubric to each one in the context
window.
Cost Analysis: We also do a brief cost analysis associated with the scaled-up experiment on 600k training inputs. The
major line items are the cost of sampling outputs, annotating them with GPT-4 to construct training pairs, and then
training the next iteration against those pairs. Foreachof the six iterations:
1.
depending on the average length, costing about $6,000 based on spot pricing.
2.
450M, with an average of about 60M completion tokens, amounting to about $34,000 based on the version of
the endpoint we were using.
3.
28
Review the user’s question and the corresponding response using the additive 6-point scoring system described below. Points
are accumulated based on the satisfaction of each criterion:
1.
the question or other random question/answer pairs or examples, etc.
2.
incomplete or contains some irrelevant content.
3.
completely resolve the query or provide a direct answer.
4.
direct way, and is largely devoid of errors, extraneous content, or excess wordiness.
5.
aspects of the user’s question in a helpful, well-organized, focused, and succinct style.
6.
without any unnecessary information or wordiness, reflecting expert knowledge, amounting to a superbly high-quality,
focused, and insightful answer.
### User: <instructions>
### Candidate Responses:
<candidate answers>
### Instructions: After examining the user’s instruction and the response:
1.
2.
Remember to assess from the AI Assistant perspective, utilizing web search knowledge as necessary. To evaluate the response
in alignment with this additive scoring model, we’ll systematically attribute points based on the outlined criteria. Remember to
score EACH candidate answer separately. Format your justification and score only as a python dictionary with no extraneous
output e.g. {“id1_justification”: “...”, “id1_score”: 3, “id2_justification” : ...} with “idX” corresponding to the id of the X’th
answer.
{“id1_justification” :
Figure 3: The annotation prompt submitted to GPT-4 to collect multiple side-by-side preference scores forDNO. One
of the candidates is also an output from GPT-4 itself, along with several outputs from the current policy.
Example 1
Question: When will the earth run out of fresh water?
Example 1: Initial SFT model output
The earth will not run out of fresh water. There is a finite amount of fresh water on the planet, but it is constantly being
replenished through the water cycle. The water cycle involves the evaporation of water from the earth’s surface, its condensation
into clouds, and its precipitation back to the earth’s surface as rain, snow, or other forms of precipitation. This cycle ensures
that fresh water is always available, although the distribution of fresh water may vary depending on factors such as climate,
geography, and human use.
29
Example 1:DNOIter 1
It is difficult to predict exactly when the Earth will run out of fresh water, as it depends on various factors such as climate
change, population growth, water usage, and technological advancements in water management and conservation. However,
current estimates suggest that we are already facing a global water crisis, with many regions experiencing water scarcity and stress.
According to the United Nations, about 2.2 billion people worldwide lack access to safe drinking water, and this number is
expected to increase due to population growth and climate change. The World Wildlife Fund (WWF) estimates that by 2025,
half of the world’s population will be living in water-stressed areas.
While the Earth’s total water supply remains constant, the availability of fresh water is decreasing due to overuse, pollution,
and climate change. It is important to note that fresh water is a renewable resource, but it takes time for it to replenish naturally.
Therefore, it is crucial to manage and conserve water resources to ensure their sustainability for future generations.
To address the global water crisis, it is necessary to implement sustainable water management practices, improve water
infrastructure, promote water conservation and efficiency, and raise awareness about the importance of water conservation.
In summary, while the Earth will not run out of water, the availability of fresh water is becoming increasingly scarce in many
regions. It is important to take action to manage and conserve water resources to ensure their sustainability for future generations.
It is also important to note that the availability of fresh water varies depending on the location and the specific water sources.
Some regions may face water scarcity sooner than others, and it is important to address these issues on a local level.
In conclusion, while it is difficult to predict exactly when the Earth will run out of fresh water, it is clear that we are already
facing a global water crisis. It is important to take action to manage and conserve water resources to ensure their sustainability
for future generations.
Reference:
United Nations. (2019). World Water Development Report 2019. Retrieved from https://www.unwater.org/publications/world-
water-development-report-2019/
World Wildlife Fund. (2019). Water Scarcity. Retrieved from https://www.worldwildlife.org/issues/water-scarcity
Note: The information provided in this response is based on current scientific knowledge and available data. However, it is
important to note that the field of water resources management is constantly evolving, and new research and findings may
emerge in the future.
Note 2: The response provided is based on the information available at the time of writing and may not reflect the most current
data or research. It is important to consult the latest scientific literature and data sources for the most up-to-date information on
the topic.
Note 3: The response provided is based on the assumption that the question is asking about the availability of fresh water
for human consumption. However, it is important to note that fresh water is also essential for ecosystems and other living
organisms. The availability of fresh water for all purposes is a complex issue that requires a holistic approach to management
and conservation.
Note 4: The response provided is based on the assumption that the question is asking about the availability of fresh water on a
global scale.
30
Example 1:DNOIter 2
It is not accurate to say that the Earth will "run out" of fresh water, but rather that we could potentially deplete or
contaminate fresh water resources to a point where they can no longer sustain us. Fresh water is a renewable resource, but
its availability depends on various factors such as climate, geography, and human activities. Here are some key points to consider:
1. Water cycle: The water cycle, which includes evaporation, condensation, precipitation, and runoff, constantly replenishes
fresh water resources. However, the rate of replenishment varies depending on the region and climate.
2. Groundwater: Groundwater is a significant source of fresh water, but it is not infinite. Over-extraction of groundwater can
lead to depletion and contamination. It takes thousands of years for groundwater to replenish naturally.
3. Surface water: Surface water, such as rivers and lakes, is also a renewable resource, but it can be affected by climate change,
deforestation, and other human activities. Droughts and changes in precipitation patterns can reduce the availability of surface
water.
4. Contamination: Contamination from pollution, agricultural runoff, and other sources can render fresh water unusable. This
includes both groundwater and surface water.
5. Population growth: As the global population continues to grow, the demand for fresh water is increasing. This can lead to
over-extraction, contamination, and other issues that can impact the availability of fresh water.
6. Climate change: Climate change is expected to exacerbate water scarcity in many regions due to changes in precipitation
patterns, increased evaporation, and melting of glaciers and snowpacks.
In conclusion, while fresh water is a renewable resource, its availability is not infinite and can be impacted by various factors. It
is essential to manage and conserve fresh water resources sustainably to ensure their long-term availability for future generations.
So, instead of asking when the Earth will "run out" of fresh water, it is more accurate to ask how we can manage and conserve
fresh water resources to ensure their long-term sustainability.
Example 1:DNOIter 3
The Earth as a planet does not "run out" of fresh water, as it continues to cycle through the hydrological cycle, which includes
processes such as evaporation, condensation, precipitation, and infiltration. However, the availability and accessibility of fresh
water for human use are significant concerns.
The freshwater crisis is primarily driven by population growth, urbanization, agricultural expansion, and climate change, which
lead to increased demand, pollution, and variability in water availability. According to the United Nations, approximately 2.2
billion people lack access to safe drinking water services, and 4.2 billion people experience severe water scarcity for at least
one month a year.
While the Earth will not run out of fresh water, the distribution and availability of fresh water for human use are be-
coming increasingly challenging. To address this issue, it is crucial to implement sustainable water management practices,
reduce water waste, and promote equitable access to fresh water resources.
In summary, the Earth will not run out of fresh water, but the availability and accessibility of fresh water for human
use are becoming increasingly limited, and it is essential to address this challenge through sustainable water management
practices and equitable access to resources.
Table 5: Outputs of the variousDNOmodels across iterations for the question:
water?. The output for Iter-1 is a bit too long as shown in. We believe this could be addressed with better
hyperparameter tuning or preference data. We fine the initial SFT model’s response to be missing important points
about how the premise of the question is best addressed by highlighting access to freshwater. The last reponse, for Iter
3 is more informative and specific than the initial SFT response.
Example 2
Question: what data structure would you use to design a url lookup system to quickly determine whether a url exists, or
if not, its closest prefix. Explain.
31
Example 2: Initial SFT model output
To design a URL lookup system that can quickly determine whether a URL exists or find its closest prefix, we can use a Trie
data structure.
A Trie is a tree-like data structure that stores strings in a hierarchical manner. Each node in the Trie represents a character in
the string, and the edges represent the characters themselves. The root node represents an empty string, and each subsequent
node represents a character appended to the previous node’s string.
In our URL lookup system, we can use a Trie to store all the existing URLs. Each node in the Trie will represent a URL prefix,
and the edges will represent the characters in the URL.
To determine whether a URL exists, we can traverse the Trie starting from the root node. If we reach a node that represents a
complete URL, we know that the URL exists. If we reach a node that represents an incomplete URL, we know that the URL
does not exist.
To find the closest prefix of a URL, we can traverse the Trie starting from the root node. We can keep track of the longest prefix
we have encountered so far. Whenever we encounter a node that represents a URL prefix, we compare its length with the length
of the longest prefix we have encountered so far. If the length of the current prefix is greater, we update the longest prefix.
By using a Trie, we can efficiently determine whether a URL exists or find its closest prefix in constant time complexity. This
is because the Trie allows us to traverse the data structure in a hierarchical manner, and we can quickly determine whether a
URL exists or find its closest prefix by following the appropriate edges.
Example 2:DNOIter 1
To design a URL lookup system that quickly determines whether a URL exists or finds its closest prefix, you can use a
combination of a Trie (Prefix Tree) and a Bloom Filter. This approach provides efficient lookup and prefix matching while
maintaining a low memory footprint.
1.
character in the string, and the edges represent the characters themselves. The root node is empty, and each subsequent
node contains a single character. This data structure allows for efficient prefix matching and string lookup.
2.
whether an element is a member of a set. It provides a fast way to check if a URL exists in the system, but it may produce
false positives.
Here’s how you can use these data structures together:
1.
characters in the URL.
2.
When a URL is added to the system, it is hashed using the hash functions, and the resulting bits are set in the Bloom
Filter.
3.
the system. If the URL is present in the Bloom Filter, look up the URL in the Trie. If the URL is found in the Trie, it
exists in the system. If the URL is not found in the Trie, it may be a prefix of an existing URL.
4.
Trie until you find the longest prefix that matches the input URL. This will give you the closest prefix.
This approach provides efficient lookup and prefix matching while maintaining a low memory footprint. The Trie allows for
efficient string lookup, and the Bloom Filter provides a fast way to check if a URL exists in the system. However, the Bloom
Filter may produce false positives, so it should be used in conjunction with the Trie to ensure accuracy.
Note: The Trie and Bloom Filter can be optimized further by using a compressed Trie or a Bloom Filter with a variable-size bit
array. These optimizations can reduce the memory footprint and improve the performance of the system.
32
Example 2:DNOIter 2
To design a URL lookup system that can quickly determine if a URL exists or find its closest prefix, you can use a data structure
called a Trie (also known as a Prefix Tree). A Trie is a tree-like data structure that stores strings as nodes, where each edge
is labeled with a single character. This data structure is particularly efficient for searching and inserting strings with common
prefixes.
Here’s how you can use a Trie to implement a URL lookup system:
1.
a dictionary to store child nodes.
2.
If the character is not present in the current node’s dictionary, create a new node and add it.
3.
current node’s dictionary, move to the corresponding child node. If the character is not present, return the closest prefix
by traversing the Trie upwards from the last valid node.
Here’s an example implementation in Python:
classTrieNode
def__init__
self . is_end_of_word
self . children {
classTrie
def__init__
self . root
definsert ( self , url )
current_node
forcharinurl
ifcharnot current_node . children
current_node . children
current_node
current_node . is_end_of_word
defsearch ( self , url )
current_node
forcharinurl
ifcharincurrent_node . children
current_node
else:
returnself . find_closest_prefix ( current_node )
returnurlifcurrent_node . is_end_of_word elseNone
deffind_closest_prefix ( self , current_node )
prefix
whilecurrent_node
prefix .
current_node
return’
This implementation allows you to quickly determine if a URL exists or find its closest prefix by searching the Trie. The ‘in-
sert‘ method inserts URLs into the Trie, and the ‘search‘ method searches for a URL or returns its closest prefix if it doesn’t exist.
33
Example 2:DNOIter 3
To design a URL lookup system that can quickly determine if a URL exists or find its closest prefix, you can use a data structure
called the Trie (also known as Prefix Tree). A Trie is a tree-like data structure where each node represents a character in the key
(URL in this case) and each branch corresponds to a possible character in the alphabet.
Here’s how you can implement a URL lookup system using a Trie:
1.
the URL exists.
2.
characters of the URL. If a character doesn’t exist in the Trie, create a new node for it.
3.
(a)
(b)
(c)
(d)
(e)
To search for a URL or find its closest prefix, follow these steps:
1.
2.
3.
4.
Table 6: Outputs of the variousDNOmodels across iterations for an interview question relating to the
URL lookup system.. The last response from Iter 3 is more informative and clearer, and doesn’t contain misleading
information (searching a trie runs in linear complexity in the length of the strings, not constant...). The response for
Iter 1 contains an implementation of a Trie, which may be unnecessary because the user did not ask for it.
Example 3
Question: What factors led to the outbreak of ww1
Example 3: Initial SFT model output
There were several factors that led to the outbreak of World War I:
1. Nationalism: The rise of nationalism in Europe led to a desire for territorial expansion and the desire to unite people of
the same ethnicity or language under one nation. This led to tensions between countries and a desire to assert dominance over
neighboring territories.
2. Imperialism: The scramble for colonies in Africa and Asia led to competition between European powers, which further
fueled tensions and rivalries.
3. Militarism: The arms race between European powers led to a buildup of military forces and weapons, which created a sense
of insecurity and fear among neighboring countries.
4. Alliances: The complex system of alliances between European powers created a web of interdependencies that made it
difficult for any one country to remain neutral in the event of a conflict.
5. Assassination of Archduke Franz Ferdinand: The assassination of Archduke Franz Ferdinand of Austria-Hungary by a
Serbian nationalist in 1914 was the immediate trigger for the war, but it was the culmination of these other factors that led to
the outbreak of the war.
34
Example 3:DNO-More-DataIter 1
The outbreak of World War I was the result of a complex interplay of factors, including:
1.
rivalries. The scramble for Africa and the race to acquire overseas territories created a sense of competition and mistrust
among nations.
2.
a buildup of military power. The development of new technologies, such as machine guns, artillery, and submarines,
increased the destructive potential of warfare.
3.
in one region could escalate into a larger war. The Triple Entente (France, Russia, and Britain) and the Triple Alliance
(Germany, Austria-Hungary, and Italy) were the main alliances that formed the basis of the conflict.
4.
contributed to the outbreak of war. The desire for self-determination and independence led to conflicts and tensions
among different ethnic groups.
5.
Serbian nationalist in 1914 was the immediate trigger for the war. The complex web of alliances and rivalries among
European powers led to a series of diplomatic maneuvers and ultimatums, which ultimately resulted in the declaration of
war.
6.
leading up to the war. The economic rivalry between Germany and Britain, particularly in the context of colonial
competition, added to the existing tensions.
7.
Alexander I of Serbia in 1903, created a sense of uncertainty and instability. These political upheavals contributed to the
overall atmosphere of tension and conflict.
8.
the breakdown of communication and trust. The inability of leaders to find peaceful solutions to disputes contributed to
the escalation of tensions and the eventual outbreak of war.
9.
also played a role in the outbreak of war. The lack of clear communication and misunderstandings between different
countries led to a series of missteps and escalations.
10.
where conflicts in one region could have a domino effect on other regions. The interconnectedness of these factors led
to a chain reaction of events that ultimately led to the outbreak of World War I.
35
Example 3:DNO-More-DataIter 6
The outbreak of World War 1, which started on July 28, 1914, resulted from a complex web of events and circumstances that
had been developing for years prior. Here are some of the key factors that led to the war:
1.
along their borders. This led to a sense of insecurity and a belief that preventive war was necessary to secure national
interests.
2.
the Triple Alliance (Germany, Austria-Hungary, and Italy). These alliances created a complex web of obligations that
meant if one member was attacked, all members were obliged to come to their aid.
3.
tensions between the major powers.
4.
assassination of Archduke Franz Ferdinand of Austria, which triggered the war, was a result of nationalist tensions in the
Balkans.
5.
the first country feel more insecure, leading to a spiral of actions and reactions that escalated the situation.
6.
Serbian nationalist group, provided the immediate trigger for the war.
7.
to Austria-Hungary declaring war on Serbia.
These factors combined to create a powder keg that was waiting for a spark to ignite the war. Once the spark was lit, the
complex web of alliances meant that the conflict quickly escalated into a global war.
Table 7: Outputs of first and lastDNO-More-Dataiterations for the third example
ww1?. The last response from Iter 6 has a higher information density; it recalls more key facts and entities, including
the specific date of the start of WW1, Triple Entente, the Triple Alliance, Gavrilo Princip, and the Black Hand. Iteration
1 also contains some of this information, but is too wordy. The initial SFT model seems a lot more superficial.
36
Teaching Language Models to Self-Improve with General Preferences
Corby Rosset
∗
Ching-An Cheng Arindam Mitra Michael Santacroce
Ahmed Awadallah
∗
Tengyang Xie
∗
Microsoft Research
Abstract
This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle
to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement
Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy opti-
mization. However, such a reward maximization approach is limited by the nature of “point-wise” rewards (such as that
of the Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances
on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability,
they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the
reward maximization presumptions in favor of directly optimizing over “pair-wise” orgeneral preferences. In this
paper, we introduceDirect Nash Optimization(DNO), aprovableandscalablealgorithm that marries thesimplicity
andstabilityof contrastive learning withtheoretical generalityfrom optimizing general preferences. BecauseDNOis
abatched on-policyalgorithm using a regression-based objective, its implementation is straightforward and efficient.
Moreover,DNOenjoysmonotonic improvementacross iterations which helps it improve even over a strong teacher (such
as GPT-4). In our experiments, a resulting 7B parameterOrca-2.5model aligned byDNOachieves the state-of-the-art
win-rate againstGPT-4-Turboof 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute
gain of 26% (7%→33%) over the initializing model. It outperforms models with far more parameters, including
Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4. Our ablation studies analyze
critical design decisions surrounding the choice of preference pairs, and the use of LLMs-as-preference-annotators.
These results underscore the promise ofDNOin the LLMs post-training, as well as offer actionable insights for the AI
research community.
1 Introduction
The field of artificial intelligence is evolving towards advanced models that can understand, reason, follow complex
instructions, and create nuanced content, while aligning with human values and preferences. Large Language Models
(LLMs) (e.g.,,;,;,;,) have demonstrated
remarkable capabilities in generating human-like text, answering questions, and coding, yet they still face challenges
in tasks that require a high degree of reliability, safety, and ethical alignment. To address these challenges, fine-tuning
LLMs using Reinforcement Learning from Human Feedback (RLHF) (Christiano et al.,;,;
et al.,) has demonstrates strong potential for making LLMs more helpful by aligning them with human values.
The RLHF framework has been long studied in the context of preference-based reinforcement learning (RL) or RL
from human preferences (e.g.,,;,;,;,;
Christiano et al.,). The conventional methods for RLHF typically assume that the preference is determined by
a scalar reward function through some model, such as the frequently used Bradley-Terry (BT) model (Bradley and
Terry,). 1RLHF then optimizes toward the preference in a two-step procedure: reward learning, and policy
optimization (through RL) to maximize the learned reward. Under certain conditions, the two-step procedure can be
∗
Correspondence to{corbyrosset,hassanam,tengyangxie}@microsoft.com
1We use “reward model” to denote a framework that translates preferences into rewards, e.g., Bradley-Terry, while “reward function” is a (possibly
learned) function that outputs reward scalars.
1
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
LC Win Rate
0%
5%
10%
15%
20%
25%
30%
35%
40%
Win Rate Figure 1: Direct Nash Optimization achieves state-of-the-art results for a 7B parameter large language model, being
the first to surpass 30% in both raw win-rate and length-controlled (LC) win-rate againstGPT-4-Turbo. Win Rate and
LC Win Rate have0.93to0.98correlation with ChatBot Arena scores.
streamlined into a single-step contrastive learning approach (Rafailov et al.,), eliminating the need for explicit
reward learning. Algorithms of this kind (e.g.,,, DPO) leverage the insight that a policy can be
expressed equivalently by an “internal reward function” that the policy is optimal to, so they reduce the RLHF problem
to regressing the policy’s internal reward function to that of the preference model. These algorithms are originally
offline, and boast enhanced stability and ease of optimization. Nonetheless, two-step RLHF algorithms and their
single-step contrastive variants still fundamentally rely on the reward maximization framework, wherein reward-based
preferences are governed by, e.g., the BT model.
The reward maximization framing poses a major limitation. Reward functions, defined to output a scalar scorer(x, y)
for asingleresponseyto inputx, cannot express general preferencesy≻y
′
|xbetween apairof outputs in all cases,
e.g., intransitive or cyclic preferences (Elo,). Hence, LLMs trained under reward maximization cannot always
align with human preference. Furthermore, recent works show that even in settings where preferences can be perfectly
expressed under the reward-based BT models, optimizing towards rewards yields problematic behaviors; we refer the
reader to2023);2023);2023) for more details. Lastly, reward functions in
practice can quickly become “stale” as the distribution of the policy shifts under training (Ross et al.,;
et al.,;,;,) – leaving them vulnerable to “reward hacking” (Amodei et al.,)
In response to these weaknesses, an appealing line of work on RLHF proposes to directly optimize thegeneral preference
functionitself, instantiated as some oracle. These studies re-frame RLHF as finding a Nash equilibrium of a two-player
game with “payoffs” from a regularized (Munos et al.,) or un-regularized (Swamy et al.,) general preference
function. To solve it, they further approximate such Nash equilibrium using single-player algorithms by leveraging the
symmetry of the preference function. Then, they instead define the reward of a response as the expected win-rate against
the policy’s own behavior, as judged by the preference function, e.g.,r(x, y) =E
y
′
∼π(·|x)[P(y≻y
′
|x)]. Hence,
rewards are maximized by responses that are preferred over the policy’s expected response, and a Nash equilibrium
is achieved when both players deploy aπ
⋆
that is preferred over any competing policy. However, these proposed
single-player algorithms are primarily (purely) on-policy, and they sometimes require a separately estimated preference
function or a time-varying reward function. How to scale these algorithms up faithfully is still under-investigated.
We are motivated to overcome two separate challenges: the limited expressivity of reward-based RLHF, and the
lack of clarity on how to scale up optimizing with respect to general preferences. Recent advances in reward-based
optimization e.g., DPO, already have efficient and scalable implementations – we seek a similarly efficient solution
under the framework of general preferences.
We propose a provable and scalable RLHF algorithm –Direct Nash Optimization(DNO) (Algorithm 1) that achieves
the best of both worlds, combining the scalability of contrastive objectives with the theoretical soundness of general
2
preference optimization.DNOis designed as abatched on-policyalgorithm with a regression-based learning objective;
this design choice makesDNOstable and scalable, striking a balance between deployment efficiency and adaptability.
We summarize at a high level the key ingredients and insights ofDNObelow.
1.
the notion of reward of ought to be expressed asexpectedwin-rates with regard to a general preference function.2
2.
is sample-inefficient or unstable, we decompose the learning procedure into a sequence of “batched on-policy”
iterations, wherein each step instead optimizes a simple regression objective.
3.
the expected win-rate compared with itself (as defined in). By sampling outputs from the
current policy to use for training (i.e., “self-play”), this procedure incentivizes self-improving behavior.
4.
powerful teacher (See choice ofµ1andµ2in).
5.
regression is only performed on preference pairs with a sufficiently large margin (for theoretical explanation, see
Section 4; in practice, see).
6.DNOrepeats this procedure for multiple iterations to let the policy optimize toward the general preference. Since
each step involves a regression problem it can be easily implemented at scale.
Theoretically, we proveDNOconverges to the intended Nash equilibrium on average, and that it can improve monotoni-
cally across iterations (see). Furthermore, our finite-sample analysis shows that approximation error at any
iteration between the learned policy and the target is tightly bounded (Theorem 1).
On the practical side, we provide a scalable implementation ofDNO(Algorithm 2): an iterative self-improving algorithm
with contrastive updates, which approximates
sampling multiple online outputs from the policy being trained, using GPT-4 as the preference oracle, comparing on-
policy samples to GPT-4’s own (teacher) outputs, and training only on pairs with “large margin” (for theoretical
explanation, see; in practice, see).
The primary distinction of our work over related works of Nash-MD (Munos et al.,) and SPO (Swamy et al.,)
is that they both exhibit sample efficiency issues (two timescale updates or sample-inefficient RL steps), and both use
purely on-policy samples. We resolve the efficiency issue with a sample-efficient objective that works in practice, and
DNOis more flexible to incorporate off-policy samples from e.g., a powerful teacher.
Most importantly,DNOworks in practice – we provide comprehensive empirical evaluations, resulting in state-of-the-art
performance:
•The resulting 7B parameterOrca-2.5model, aligned using the practical implementation ofDNO(Algorithm 2),
achieves the state-of-the-art win-rate of any 7B model, exceeding33%againstGPT-4-Turbobeyond on the
AlpacaEval 2.0, even after controlling for length. This is an over26%absolute gain (7%→33%) compared to
the initialized model. It outperforms several recent advanced closed-source models, includingMistral Large
andGPT-4-0613, as well as open-source models with far more (10×) parameters, such as Self-Rewarding LM
(Yuan et al.,) which has 70B parameters.
•Our thorough ablation studies in
function (supervised finetuning or contrastive), training paradigm (with or without on-policy samples), preference
annotator quality (large margin or not), and training pair construction (self-play, teacher-vs-student, etc). Our
findings highlight that carefully-crafted methods encoded in
•We show some examples of outputs across iterations which demonstrate qualitative improvements such as better
addressing nuanced issues and presumptious questions (Table 5), better organization and clarity while refraining
from making misleading statements (Table 6), and higher information density in answers (Table 7).
We hope that the results presented herein will provide clarity to the community regarding the use of AI feedback for
post-training LLMs.
2E.g., for a fixedy|x, the expected win-rate ofyagainst the policy itself is:E
y
′
∼π(·|x)[P(y≻y
′
|x)].
3
2 Preliminaries
This section provides an overview of the RL from human feedback (RLHF) pipeline. We do not differentiate between
RLHF and RLAIF (e.g.,,;,), as the distinction is outside our scope of discussion. Thus,
we will uniformly refer to both concepts as RLHF. However, we want to make a clear delineation between two subtle
differences: RLHF maximizing point-wise reward functions, and RLHF optimizing general preferences. It should be
noted that this discussion is more broadly applicable in scope to general contextual bandits setup as well.
Throughout this paper, we usex∈ Xto denote the input (i.e. the prompt) received by the LLM from a spaceX. In
this paper, we do not consider the distribution shift over the prompts, following the standard contextual bandits setup of
RLHF (e.g.,,;,), and we use ρto denote the distribution of the prompts. We use
y∈ Yto denote the response from the LLM given the promptx(this corresponds to action in the contextual bandits
setup). We also useπ:X →∆(Y)to denote the policy, which is a LLM here, andΠis the policy class.
Our discussion throughout this paper will also regularly involve the following three learning paradigms, which are
originally introduced and commonly used in the RL literature:
(1)Offline: The learning algorithm operates without any active data collection, e.g., sampling from the current
policy. The algorithm relies solely on an offline dataset for training.
(2)Purely on-policy: technically, online on-policy. In this setup, learning takes place by sampling outputs from the
latest policy and immediately updating it based on the newly collected data. No data reuse or additional offline
data is considered.
(3)Batched on-policy3: This setup is the middle of the offline and purely on-policy setups, striking a balance
between deployment efficiency and adaptability. It involves iterative online data collection and can use other
offline data. Its distinctive feature is that here the data collection in each iteration occurs in a batched fashion
(e.g., akin to a dataset scale, much larger than the size of a typical mini-batch), and the amount of policy change
can be more significant (e.g., running gradient steps over multiple epochs of a dataset, as opposed to tens of
updates).
2.1 RLHF Based on Reward Models
One typical approach to conducting RLHF is a two-step procedure through a reward function (Christiano et al.,).
Suppose a preference datasetDpref:={(x, y
+
, y
−
)}is given, where(y
+
, y
−
)∼πref(· |x),πrefis some reference
policy such as the policy obtained after supervised fine-tuning (SFT), and a preferencey
+
≻y
−
|xis labeled by
some human or AI annotator. In RLHF with reward functions, the preference is assumed to be generated based on
some latent rewardr
⋆
. The first step is to learn a reward functionr∈ Runder some reward model assumption, where
Ris the reward class. A number of reward model assumptions have been studied, and the Bradley-Terry (BT) model
(Bradley and Terry,) is the most commonly used one. The BT model assumes the probability of y
+
≻y
−
|x
satisfies
P(y
+
≻y
−
|x):=
exp(r
⋆
(x, y
+
))
exp(r
⋆
(x, y
+
)) + exp(r
⋆
(x, y
−
))
.
This leads to the maximum-likelihood reward learning objective:
br←argmax
r∈R
E
(x,y
+
,y
−
)∼D
ˆ
logσ(r(x, y
+
)−r(x, y
−
))
˜
, (1)
whereσ(·):=
exp(·)
1+exp(·)
is the sigmoid function. After that, the LLM is finetuned using the learnedbrwith RL,
bπ←argmax
π∈Π
E
x∼Dpref,y∼π(·|x)
»
br(x, y)−βlog
π(x, y)
πref(x, y)
–
, (2)
3We acknowledge abuse of terminology. Our algorithm is not entirely online, as it only contains batched data collection. It is also not strictly
on-policy because it uses examples from other policies, like a teacher. While “offline” or “off-policy” may be technically more relevant, they might
lead to misunderstanding among readers and detract from the emphasis we want to place on the collection samples from the current policy, which
constitute the majority of our training data.
4
where the KL penalty term,E
x∼Dpref,y∼π(·|x)
h
βlog
π(x,y)
πref(x,y)
i
, is used to mitigate overoptimization of the reward
model (Ouyang et al.,), controlled by the coeffcient β. For the purposes of our discussion, we will call the
objective above as “PPOobjective”, and this two-step learning procedure as “PPO”.
DPO.Direct Preference Optimization (DPO) is proposed by2023) as an alternative RLHF approach for
combining the two-step procedure ofPPOinto a single objective. It utilizes the closed form solutionbπin, so
that solvingbπdirectly from
bπ←argmax
π∈Π
E
(x,y
+
,y
−
)∼Dpref
log
»
σ
„
βlog
π(y
+
|x)
πref(y
+
|x)
−βlog
π(y
−
|x)
πref(y
−
|x)
«–
.
2.2 RLHF with General Preferences
We now introduce the setup for directly optimizing a general preference function, as well as provide an overview of
existing solutions to achieve this goal (mostly by leveraging the symmetry of the preferences), especially those proposed
by2023);2024).
Here we assume that the learner is given query access to a general preference functionP(y≻y
′
|x)∈[0,1], for any
(x, y, y
′
)∈ X × Y × Y. This function indicates the probability that actionyis preferred overy
′
given the context
x. In practice, this setup can be viewed as the theoretical mode of RLAIF (e.g.,,;,),
human-in-the-loop RLHF (e.g.,,), or distillation fine-tuning (e.g.,,).
One common difficulty in optimizing a general preference function is itsintransitivity, e.g., it is possible thatP(a≻
b) =P(b≻c) =P(c≻a) = 1, for some options(a, b, c)(details see, e.g.,,;,;
Swamy et al.,). Therefore, the learning goal of optimizing general preferences can be the Nash equilibrium of
the two-player zero-sum game with the payoffs as the general preference functionP. The formal definition of such
Nash equilibrium is defined by theMinimax Winner,MW(see, e.g.,,;,;,;
Fishburn,), or the von Neumann Winner(see, e.g.,,),
MW(P):= argmax
π∈Π
argmin
π
′
∈Π
P(π≻π
′
) =
„
argmax
π∈Π
min
π
′
∈Π
P(π≻π
′
),argmin
π
′
∈Π
max
π∈Π
P(π≻π
′
)
«
, (3)
where
P(π≻π
′
):=E
x∼ρ,y∼π(·|x),y
′
∼π
′
(·|x)[P(y≻y
′
|x)].
SPO.To approximate the Nash equilibrium as defined in,2024) proposed a single-player
algorithm,SPO. This algorithm applies results from no-regret algorithms (e.g.,,). The SPO
algorithm is executed essentially using the following two-step iterative process: for eacht= 1,2, . . . , T,
(i)rt(x, y)←E
y
′
∼πt(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y
(ii)πt+1(· |x)←
1
Zt(x)
πt(· |x) exp
„
rt(x,·)
η
«
,∀x∈ X, (4)
whereηis the learning rate,π1is the uniform policy, i.e.,π1(· |x)←unif(Y),∀x∈ X, andZt(x):=
P
y∈Y
πt(y|
x) exp
“
rt(x,y)
η
”
is the partition function for iterationt.
Using the no-regret update of soft policy iteration, as shown in,2024) proved that the uniform
mixture ofπ1:TfromSPOis an approximation of the Nash equilibrium ofMW(P), as defined in.
Nash-MD.Munos et al.2023) proposed Nash-MDto approximate the Nash equilibrium of a KL-regularized preference
function,
P
π,π
′
τ(y≻y
′
|x):=P(y≻y
′
|x)−τlog
π(y|x)
πref(y|x)
+τlog
π
′
(y|x)
πref(y|x)
, (5)
5
Pτ(π≻π
′
):=E
x∼ρ,y∼π(·|x),y
′
∼π
′
(·|x)
h
P
π,π
′
τ(y≻y
′
|x)
i
(6)
=P(π≻π
′
)−τEx∼ρ[DKL(π(· |x)∥πref(· |x))] +τEx∼ρ[DKL(π
′
(· |x)∥πref(· |x))].
Following this,2023) demonstrate that the Nash Equilibrium of MW(Pτ)can be approximated using
a mirror descent (Nemirovskij and Yudin,;,;,) inspired algorithm,
Nash-MD, which has a last-iteration guarantee. TheNash-MDalgorithm can be viewed as a two-step iterative process:
for eacht= 1,2, . . . , T,
(i)rt(x, y)←E
y
′
∼π
τ
t
(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y
(ii)πt+1(· |x)←
1
Zt(x)
π
τ
t(· |x) exp
⊆
rt(x,·)
η
⊇
,∀x∈ X,
whereηis the learning rate,π
τ
tis the geometric mixture betweenπtandπref,
π
τ
t(y|x):=
πt(y|x)
1−
τ/η
πref(y|x)
τ/η
P
y
′
∈Y
πt(y|x)
1−
τ/η
πref(y|x)
τ/η
,∀(x, y)∈ X × Y, (7)
andZt(x):=
P
y∈Y
π
τ
t(y|x) exp
≍
rt(x,y)
η
≡
is the partition function for iterationt.
3 Direct Nash Optimization
While the no-regret update of soft policy iteration used inSPOandNash-MDhas inspired many standard (deep)
reinforcement learning algorithms (e.g., NPG,,; TRPO,,; PPO,,
2017; SAC,,), its faithful implementation still usually involves the two-timescale update. This
could potentially lead to complex hyperparameter tuning and unstable performance. In this section, we propose a direct
and iterative algorithm,Direct Nash Optimization(Algorithm 1), to approximate the Nash equilibrium ofMW(P). This
algorithm is primarily inspired bySPO. It can be readily adapted toNash-MDfor approximating the Nash equilibrium
ofMW(Pτ)with the last-iteration guarantee, and we will discuss this in.
Algorithm 1Direct Nash Optimization (DNO)
input:General preference functionP, learning rateη, number of iterationsT, prompt distributionρ.
1:Initializeπ1←unif(A).
2:foriterationt= 1,2, . . . , Tdo
3: Computert(x, y)←E
y
′
∼πt(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y.
4: Obtainπt+1by,
πt+1←argmax
π∈Π
E
(x,y1,y2)∼Dt
⊂
σ(rt(x, y1)−rt(x, y2)) log
≤
σ
⊆
ηlog
π(y1|x)
πt(y1|x)
−ηlog
π(y2|x)
πt(y2|x)
⊇≥
+σ(rt(x, y2)−rt(x, y1)) log
≤
σ
⊆
ηlog
π(y2|x)
πt(y2|x)
−ηlog
π(y1|x)
πt(y1|x)
⊇≥ ⊃
,
(8)
whereDtis generated byx∼ρ, y1∼µ1,t(· |x), y2∼µ2,t(· |x);µ1,tandµ2,tcan be either off-policy (e.g.,
pre-defined) or on-policy (based onπt).
5:end for
6:return¯π=unif(π1:T).
3.1 Derivation of
In most practical algorithms which are inspired by soft policy iteration, including the original practical version ofSPO,
they typically adopt the following approach: “pushing”πtowards this subsequent learning goal in each iteration (we
will refer to this as the soft policy iteration target throughout the paper):
π
⋆
t+1(· |x):=
1
Zt(x)
πt(· |x) exp
⊆
rt(x,·)
η
⊇
, (9)
6
whereZt(x) =
P
y∈Y
πt(y|x) exp
“
rt(x,y)
η
”
is the partition function. It can be realized by minimizing a distance
metric betweenπt+1andπ. For example, thePPOalgorithm for RLHF (e.g.,,;,
2022) essentially minimizes the reverse KL divergence as follows,
(π
PPO
t+1←) argmin
π∈Π
Ex∼ρ
ˆ
DKL(π(· |x)∥π
⋆
t+1(· |x))
˜
= argmax
π∈Π
E
x∼ρ,y∼π(·|x)
»
ηlog
π
⋆
t+1(x, y)
πt(x, y)
−ηlog
π(x, y)
πt(x, y)
–
= argmax
π∈Π
E
x∼ρ,y∼π(·|x)
»
rt(x, y)−ηZt(x)−ηlog
π(x, y)
πt(x, y)
–
= argmax
π∈Π
E
x∼ρ,y∼π(·|x)
»
rt(x, y)−ηlog
π(x, y)
πt(x, y)
–
.(⇔PPOobjective, asZtis independent ofπ)
However, implementing the above approach typically necessitateson-policysampling from the current policyπ.
Ignoring theZt(x)term could also lead to high variance in the empirical gradient estimation. This is a persistent
issue in actor-critic style algorithms that usually suggests the need for an additional baseline (details see, e.g.,
et al.,), which also requires on-policy estimation. Whenrtalso varies over iterations, as inSPOorNash-MD, we
then need to update all of the policy, baseline, and reward online simultaneously. These challenges have hindered the
scalability of existing algorithms which are based on learning the Nash equilibrium of general preference functions.
Regressing “internal rewards” towards preference-based rewards.Different from the mentioned approaches
above which are mostly focusing on the concept of “pushing”π→π
⋆
t+1. We now consider the following mechanism:
regressingrπ,t→rt, whererπ,tis the internal reward function of a givenπat iterationt:
rπ,t(x, y):=ηlog
π(y|x)
πt(y|x)
+ηZt(x). (10)
This can be interpreted as a reparameterization trick, whereπis exactly the soft policy iteration target (refer to)
induced byπtand the definedrπ,t. Therefore, regressing that specifically parameterizedrπ,ttortallows us to directly
optimize the soft policy iteration target with respect torπ,tandπt. This idea is inspired by techniques from inverse
RL (e.g.,,,a, Guided Cost Learning) as well as recent advances in RLHF (Rafailov et al.,, DPO).
To avoid the issues arising from the partition functionZt(x), we consider learning from the(x, y1, y2)tuple, wherey1
andy2are both responses to textual inputx. Note that, due to the offline learning nature of the regressive objective,
the sampling distribution ofy1andy2does not impact the learning objective (i.e.,rπ,t→rt, but it may affect the
sample complexity from the coverage reason as we will discuss later), whereas pushingπ→π
⋆
t+1requires samplingy
on-policy, as previously discussed. Therefore, given an arbitrary(x, y1, y2)tuple, we regress the “prediction”ˆzto the
“goal”z(both defined below), using binary logarithmic/cross-entropy loss to measure the prediction error (see, e.g.,
Foster and Krishnamurthy,),
bz:=σ(rπ,t(x, y1)−rπ,t(x, y2)) =σ
„
ηlog
π(y1|x)
πt(y1|x)
−ηlog
π(y2|x)
πt(y2|x)
| {z }
=:∆π,t(x,y1,y2)
«
, z:=σ
`
rt(x, y1)−rt(x, y2)
| {z }
=:∆
⋆
t
(x,y1,y2)
´
;(11)
ℓπ,t(x, y1, y2):=zlog(1/bz) + (1−z) log(1/(1−bz))
=−σ
`
∆
⋆
t(x, y1, y2)
´
log
ˆ
σ
`
∆π,t(x, y1, y2)
´˜
−σ
`
∆
⋆
t(x, y2, y1)
´
log
ˆ
σ
`
∆π,t(x, y2, y1)
´˜
.
Therefore, we obtain the following objective to learnπt+1,
argmin
π∈Π
LDt
(π;πt)
:= argmin
π∈Π
E
(x,y1,y2)∼Dt
[ℓπ,t(x, y1, y2)]
= argmax
π∈Π
E
(x,y1,y2)∼Dt
h
σ
`
∆
⋆
t(x, y1, y2)
´
log
ˆ
σ
`
∆π,t(x, y1, y2)
´˜
+σ
`
∆
⋆
t(x, y2, y1)
´
log
ˆ
σ
`
∆π,t(x, y2, y1)
´˜
i
= argmax
π∈Π
E
(x,y1,y2)∼Dt
"
σ(rt(x, y1)−rt(x, y2)) log
»
σ
„
ηlog
π(y1|x)
πt(y1|x)
−ηlog
π(y2|x)
πt(y2|x)
«–
(12)
7
+σ(rt(x, y2)−rt(x, y1)) log
ˇ
σ
`
ηlog
π(y2|x)
πt(y2|x)
−ηlog
π(y1|x)
πt(y1|x)
´˘
#
.
Here,Dtis generated byx∼ρ, y1∼µ1,t(· |x), y2∼µ2,t(· |x)with some policiesµ1,tandµ2,t. It should be noted
thatµ1,tandµ2,tfor eacht∈[T]are parts of our algorithm’s design decisions. We will provide choices for them in
Section 3.2
Monotonic improvement from thebatched on-policyupdates.One key distinction betweenDNOand existing
algorithms for learning Nash equilibrium (such asSPOandNash-MD) is that those algorithms aim to approach the Nash
equilibrium in a purely on-policy manner, which can be potentially unstable and may need to incorporate two-timescale
updates (that change the reward function used in the inner problem more frequently). On the other hand,DNOis a
batched on-policy algorithm with single-timescale updates.
From a purely theoretical perspective, it seems thatDNOmay require many iterations to ensure the convergence of¯π
to the Nash equilibrium, which could potentially be costly. Additionally,DNOonly converges on-average, and it is
unrealistic to deploy in practice that uniform mixture policy¯π(note that, as inspired by2023), DNOcould
be extended to regularized preferences with last-iteration convergence, which is discussed in). However,
from a practical perspective, we can leverage the following two desirable properties from LLMs scenario to eliminate
these concerns and ensuremonotonic improvementover theDNOiterations:
Firstly, the soft policy iteration target
loss,ℓt(π):=P(π≻πt)−ηEx∼ρ[DKL(π(· |x)∥πt(· |x))], andπ
⋆
t+1= argmax
πℓt(π). We can notice
thatℓt(πt) = 0.5andEx∼ρ[DKL(π(· |x)∥πt(· |x))]≥0. This means0.5≤ℓt(π
⋆
t+1) =P(π
⋆
t+1≻πt)−
ηEx∼ρ
Θ
DKL(π
⋆
t+1(· |x)∥πt(· |x))
Λ
=⇒ P(π
⋆
t+1≻πt)≥0.5 +ηEx∼ρ
Θ
DKL(π
⋆
t+1(· |x)∥πt(· |x))
Λ
. This
meansπ
⋆
t+1is guaranteed to be more preferred thanπtwith respect to the preferenceP, and there is even a computable
lower bound of the amount of improvement—ηEx∼ρ
Θ
DKL(π
⋆
t+1(· |x)∥πt(· |x))
Λ
. Therefore, ifπt+1learned from
Line 4 π
⋆
t+1(which is proved in), we could expect that
the policy is monotonically improved overDNOiterations. Note that the monotonic improvement guarantee isexclusive
to our design choice ofbatched on-policyupdates inDNO, because the alternatives are either unclear or unstable: it is
undefined how to perform iterative updates offline, and one gradient update from a purely online algorithm may not
be able able to accurately approximate the the soft policy iteration targetπ
⋆
t+1. Secondly, in practice, we usually have
validation data available, which allows us to deploy the best policy overπ
1:(T+1).
3.2 Theoretical Analysis
One of our major proposals is to use a regression-based objective to approximate the explicit soft policy iteration; in
this section we show the approximation error from this regression is tightly bounded with finite-sample analysis. The
following proposition discusses how well the solution of the regression-based objective (defined in
Algorithm 1) can approximate the soft policy iteration (Eq. (9)) in terms of the total variation metric at each iteration.
Theorem 1(informal).Fix an arbitrary iterationt∈[T]. Supposeπt+1is from, and π
⋆
t+1is
defined in. Then, under mild assumptions (realizability and boundedness, formally introduced in),
we have
Ex∼ρ
h
Γ
DTV(πt+1(· |x), π
⋆
t+1(· |x))
∆
2
i
≤ O
`
CtR
2
maxlog(
|Π|/δ)
N
´
,
where the concentrability coefficientCtis defined as below,
Ct:=
E
x∼ρ,y1∼π
⋆
t+1
(·|x),y2∼πt+1(·|x)
ˇ
ı
log
π
⋆
t+1
(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1
(y2|x)
πt+1(y2|x)
ȷ
2
˘
E
x∼ρ,y1∼µ1,t(·|x),y2∼µ2,t(·|x)
ˇ
ı
log
π
⋆
t+1
(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1
(y2|x)
πt+1(y2|x)
ȷ
2
˘.
Ifπt=π
⋆
tfor allt∈[T], the reader can refer to (Swamy et al.,, Section 3) for the convergence of ¯π(returned
by) to the Nash equilibrium. We expect the total variation difference betweenπtandπ
⋆
tprovided by
Theorem 12024).
8
Note that, we present the concentrability coefficientCtas data-dependent, withπt+1(learned from data) as part of its
definition. We aim to make this guiding the design choices ofµ1,tandµ2,tfrom suchCtfor the purpose of sample
efficiency. The formal statement and detailed proof of, without involving πt+1, are deferred to.
Although it shares a similar expression to the concentrability coefficient in offline reinforcement learning (e.g.,
and Jiang,;,), the policies µ1,tandµ2,tare flexible here due to the generative nature of large
language models. This flexibility allows for additional intervention, enhancing sample efficiency.
We can notice that the value ofCtcan be always bounded byCt≤max
(x,y)∈X ×Y
π
⋆
t+1
(y|x)πt+1(y|x)
µ1,t(y|x)µ2,t(y|x)
in the worst case.
However, asπt+1is likely to be restricted within a certain region, for instance, because fine-tuning will not significantly
alter the behavior of the language model, we anticipate that such a coefficient will not depend on the per-(x, y)worst
case. On the other hand, as a direct observation, we notice that the ideal selection ofµ1,tandµ2,tshould be close to
the target of soft policy iterationπ
⋆
t+1(assumingπ
⋆
t+1andπt+1are close). Interestingly, this theoretical observation
coincides with recent empirical results. Here,2024b) suggests that using statistical rejection sampling to
sample from the soft policy iteration target (which is almost equivalent to samplingy1andy2fromπ
⋆
t+1) could benefit
preference tuning. However, in our case, if we use similar statistical rejection sampling techniques onπtto sample
π
⋆
t+1(andπt+1), the cost of rejection sampling is likely to be comparable to the concentrability coefficientCtwhen
choosingµ1,tandµ2,tto beπt(see, e.g.,,). This suggests that both πtandπ
⋆
t+1(via rejection sampling)
as the choices ofµ1,tandµ2,twill be comparable options in terms of sample efficiency. On the other hand, as we will
demonstrate in the next section, sincertis defined based onπt(as shown in), choosing µ1,tand
µ2,tto beπtcan easily adapt to such a reward ofrt.
Another interesting observation is that despite
with using MLE, the target distributions used to measure distribution shift appear to be quite different. This disparity
is due to the different objectives: fitting soft policy iteration versus reward estimation. For the Bradley-Terry style
reward modeling using MLE, the desired distribution ofy1andy2should be two distinct distributions (see, e.g.,
et al.,;,). However, in our case where the learning goal is to fit the soft policy iteration, we may
prefery1andy2from two (near) on-policy distributions as discussed above, as long as we expect the learnedπt+1will
be accurate enough. To the best of our knowledge, this is the first theoretical result that illustrates the importance of
on-policy sampling beyond policy optimization style algorithms for RLHF.
4 Practical Algorithm – Iterative Contrastive Self-Improvement
In this section, we shift our focus to the algorithmic design of the practically scalable version ofDNO, following the
principles discussed in the last section. A primary challenge encountered in the implementation of the conceptual
algorithmDNO(Algorithm 1) stems from the necessity to compute the expectation with respect to the preference
functionPunder the current policyπt. Perhaps surprisingly, as we will show, all we need is a properly implemented
iterativeDPO-like contrastive learning algorithm.
We present our the practical implementation ofDNOin DNO-Prct), which is a batched on-policy algorithm
that conducts self-improvement iteratively via contrastive learning. One key consideration in our algorithmic design
is that we only need to implicitly use the reward functionrt. This comes from the specifically designed on-policy
sampling, data filtering, and pair construction. While these specific design choices makeDNO-Prctseem similar to
simply performingDPOiteratively, there are significant reasons for these design decisions, as we will discuss below.
Batched on-policy sampling.The use of batched on-policy sampling in
explicit use ofrt(defined asE
y
′
∼πt(·|x)[P(y≻y
′
|x)]in). This means we essentially
chooseµ1andµ2inDNOto beπtinDNO-Prct, but we are free to let them vary slightly as a mixture of other
policies, e.g., from a stronger teacher. Specifically, it is unrealistic to assume in practice that we can access the exact
value ofP(y≻y
′
|x)given an(x, y, y
′
)tuple. Based on the definition ofrtand the fact of{y
1
t, y
2
t, . . . , y
K
t}
are sampled fromπt,DNO-Prctessentially uses the following sampled based approach to estimatert:rt(x, y)≈
1
K
P
y
′
∈{y
1
t
,y
2
t
,...,y
K
t
,y
gold
}\y
1P(Isybetter thany
′
onx?), for anyxandy∈ {y
1
t, y
2
t, . . . , y
K
t, y
gold
}, where1Pdenotes
one sample fromPand output{0,1}. This is implemented in, and its precise implementation on
this is discussed in the. On the other hand, as we discussed in the last section, the batched on-policy sampling
fromπtis an appropriate option due to the consideration of sample efficiency when we use
9
Algorithm 2DNO-Prct: Practical Implementation ofDNOvia Iterative Contrastive Self-Improvement
input:General preference functionP, learning rateeη, iterationsT, reference policyπref, prompt distributionρ.
1:Initializeπ1←πref.
2:foriterationt= 1,2, . . . , Tdo
3:ConstructDt={(x, y
gold
)}wherex∼ρandy∼πgold(· |x).
4:Samplebatched on-policyresponses:SampleKoutputs per per prompt using the currentπt:
{y
1
t, y
2
t, . . . , y
K
t} ∼πt(· |x),∀x∈ Dt.
5:Rank responses:For eachx∈ Dt, rank the corresponding{y
1
t, y
2
t, . . . , y
K
t, y
gold
}using the pair-wise win-rate
by sampling from the general preference functionP.
6:Filter preference pairs:ConstructDt+1={(x, y
+
t, y
−
t)}, for allx∈ Dt+1, and(y
+
t, y
−
t)are large-margin
pairs (based on the win-rate rank) within the responses forxfrom the previous step.
7:Contrastive learning:Obtainπt+1by,
πt+1←argmax
π∈Π
E
(x,y
+
t
,y
−
t
)∼Dt+1
log
»
σ
„
eηlog
π(y
+
t|x)
πt(y
+
t|x)
−eηlog
π(y
−
t|x)
πt(y
−
t|x)
«–
. (13)
8:end for
9:returnbest ofπ
1:(T+1)on the validation data.
the soft policy iteration (see
Preference pair construction.Another key design choice in
purely contrastive loss, whereas σ(rt(x, y)−rt(x, y
′
))(for
a given(x, y, y
′
)tuple), which is not necessarily{0,1}. As we discussed above, it is unrealistic to expect access to the
exact value ofP(y≻y
′
|x), so it is also unlikely to get an accurate value of the regression targetσ(rt(x, y)−rt(x, y
′
)).
Thus, we add an additional data filtering step to address this issue as in. Ideally, we want the
selected(x, y
+
, y
−
)tuple to satisfyσ(rt(x, y
+
t)−rt(x, y
−
t))≈1, so that.
However, one can notice that it requiresrt(x, y
+
t)−rt(x, y
−
t)→ ∞, but we knowrt(x, y)∈[0,1],∀(x, y)∈ X × Y.
From the derivation ofDNOin, it is clear that scaling uprtandηwith the same absolute constantcdoes not
affect the soft policy iteration target of, but it will slightly change the DNOobjective (Eq. (8)) by
rt→c·rtandη→c·η=:eη. This scaling strategy helps us sidestep the problem of boundedrt, and in this sense,
we may expect the propereηinDNO-Prctto be relatively larger (than, e.g.,ηin). However, an enlarged
eηin,
especially for the derivation of). So, to avoid the proper eηbeing too large,we only use pairs with large
marginas in rt(x, y
+
t)−rt(x, y
−
t)is not too small. This decision is also supported
empirically in techniques like RLCD (Yang et al.,) and Axiomatic Preference Models (Rosset et al.,) which
highlight the importance of having large margin or clear directional differences between positive and negative LLM
responses when training preference models.
Relationship betweenDNO-PrctandDPO.The reader may discern thatDNO-Prct(Algorithm 2)—the practical
implementation ofDNO—can be described as an iterative version of theDPOalgorithm. Such similarity is by design,
intended to harness the simplicity and effectiveness of DPO (Rafailov et al.,) and build on empirical advancements
from recent work that applies DPO iteratively (e.g.,,;,). Our experiments point to the
importance of several design choices which help accommodate the general preferences, such as rankings derived from
pair-wise win rates. More interestingly, our findings point to a surprising connection—that“a meticulously designed
iterativeDPOalgorithm” could approach the Nash equilibrium of any given general preferences.
Our general algorithmic framework—DNO(Algorithm 1)—is broader and fundamentally different from iterativeDPO.
For example, theDNOframework could also be directly extended to the regularized preference case (as discussed
in) or equipped with other advanced sample techniques (e.g.,,, RSO) as suggested by
Theorem 1
optimization) is used in bothDNOandDPO, they arise from fundamentally different reasons. ForDNO, KL-regularization
10
1 2 3 4 5 6
Iterations or Epochs
5
10
15
20
25
30
35
Win-Rate against GPT-4
Alpaca Eval 2 Win-Rate vs. GPT-4 Turbo
DNO 7B - More Data
DNO 7B
DNO 7B - Restrictive
SPIN 7B
Iterative KTO 7B
Offline DPO (Ep)
SFT on Positives (Ep)
Self-Rewarding 70B
Initializer Orca-2.5
Mistral Large
GPT-4 0314 Figure 2: Comparison of various post-training techniques showing that Direct Nash Optimization (DNO) is the most
effective. All methods with colorful error bands are 1) implemented by ourselves, 2) initialized with a 7B parameter
Orca-2.5LLM, and 3) are “batched on-policy” (except SFT and OfflineDPOwhich are epochs), all else being equal.
originates from online learning, no-regret learning through mirror descent (Nemirovskij and Yudin,) or follow-
the-regularized-leader (FTRL) (Kalai and Vempala,;,;,;
Hazan et al.,). For DPOandPPO, the KL-regularization is an approximation for the total variation penalty to ensure
monotonic improvement of the policy (Kakade and Langford,;,). Later, this approach was
simplified by2017, PPO), and recently used for post-training LLMs (Ouyang et al.,).
5 Experiments
Algorithm 2
DNOto denote DNO-Prctfor simplicity). Once the input dataset{xi∈ X }is chosen, each iteration
ofDNOproceeds in three phrases: sampling outputs from the current policy, annotating outputs for preference pair
generation, and then training the next policy with the new training pairs. Iteration 0 is defined to start by sampling
from the initial SFT model to produce training data for iteration 1.
5.1 Experimental Setup
Data: We mainly use Ultrafeedback (Cui et al.,), which consists of 60k prompts, several models’ outputs to those
prompts, and preference annotations fromGPT-4-Turbo. This dataset thus provides a source of offline preferences.
For our iterative experiments, we split this dataset into three non-overlapping partitions of the inputs to be used for
separate iterations of batched on-policy learning. For each input, we also collect theGPT-4-Turbooutput if it was not
already present in the original dataset to be reserved fory
gold
.
Every experiment except one in this study solely uses UltraFeedback. The exception is one “scaled up” experiment
with about 10x more data sourced from a mixture of datasets aggregated including Anthropic HH (Bai et al.,),
UltraChat (Ding et al.,), MetaMathQA (Yu et al.,), EvolInstruct (Xu et al.,), UltraFeedback (Cui
et al.,) and Orca-2 (Mitra et al.,). Note that we only use the input prompts for these datasets and collect a
11
Alpaca Eval 2 MT Bench
Technique
Epoch
or Iter
Len-control.
Win Rate
Win Rate
vs. GPT-4
Avg. len
(chars)
1st
Turn
2nd
Turn
Avg
Orca-2.5 SFT Epoch 1 10.76 6.99 1174 7.72 6.02 6.88
Orca-2.5 SFT on Positives Epoch 411.62 7.96 1420 7.62 6.23 6.92
OfflineDPO(ours) Epoch 4 19.49 18.22 1884 7.69 7.08 7.38
Self-Rewarding 70B Iter 3 - 20.44 2552 - - 7.25
SPIN(ours) Iter 3 16.18 16.13 1922 7.58 7.53 7.55
DNO-Restrictive Iter 3 21.61 19.21 1795 7.59 7.35 7.46
DNO-Lookahead Epoch 1 18.58 18.28 1907 8.09 7.327.70
DNO Iter 3 22.59 24.97 2228 7.62 7.35 7.48
Table 1: AlpacaEval 2.0 and MT-Bench results in our controlled setting after training on UltraFeedback.
GPT-4-Turboresponses for all 600k of these input prompts.
Sampling from the Policy: At the end of training, we sample 5 outputs from the resulting student policy using top
p sampling withp= 0.95and temperature 0.7. Several works have shown the benefit of sampling and comparing
multiple diverse outputs from the policy (Yuan et al.,;,;,;,;
Wang et al.,). We implement a simple defect detection system which flags any sample that has a high amount of
repeated n-grams as automatic negative.
Preference Annotation: We useGPT-4-Turbo“as a judge” to label preferences among the 5 policy samples and 1
gold sample (which is alsoGPT-4-Turbo) as shown in. This prompt contains a few minor modifications from
the that used in (Yuan et al.,). It implements an additive scoring framework on a 6-point scale where a score
of 6 represents the highest quality answer according to certain dimensions like “correctness”, “expert knowledge”,
“conciseness” etc. By following this rubric, GPT-4 acting as an annotator represents a best-effort general preference
model because it compares multiple candidate responses side-by-side in the context window, and stratifies them along
meaningful dimensions of quality.
Training Pair Construction: Adhering to
Firstly, we must enforce the positives to be high quality in an absolute sense, and secondly, the negatives are directionally
worse by a large margin. On the 6 point annotation scale, only samples that score a 5 or 6 are allowed to be positives.
From the positives that meet this criteria, if any, we then construct all pairs such that the negative is at least 2 points
lower. If the positive happens to be from the student, we relax this constraint to 1 point margin since theGPT-4-Turbo
teacher outputs rarely receive a score less than 5 (as shown by the average teacher score in).
Additionally, we are motivated to preserve the preference behavior from previous iterations so that new policies do
not inadvertently regress to past bad behavior. To enforce this, we incorporate an exponentially decaying proportion
of prior iterations’ training pairs into the current iteration, i.e. we sample at most 30% of training pairs from iteration
t−1, 15% fromt−2, and so on. We do not re-inference outputs for those inputs from the most recent policy. Recall
that previous iterations’ inputs are non-overlapping with the splits for other iterations.
Training: To prevent overfitting, we train our batched on-policy methods for at most one epoch on newly constructed
pairs. Our effective batch size is fixed to 64 for all experiments. Our learning rate, beta, and alpha are found with brief
hyperparameter searches. For most experiments, the learning rate is 5E-5, beta is either 0.1 or 0.05, and alpha is 0.005.
We found that at higher iterations, the learning rate needs to be lowered. In SFT (supervised fine-tuning) experiments,
our learning rate is 5E-6 and we mask out loss for the inputs. We use the open-source
to run our experiments.
Evaluation: Our primary goal is to train a policy that is comparable to the most powerful state-of-the-art langauge
models. Hence, AlpacaEval 2.0 (Dubois et al.,) is an appropriate benchmark because it computes win-rate
againstGPT-4-Turboin a head-to-head fashion on a dataset of 805 input prompts that is shown to correlate with human
preferences (0.93 spearman correlation with Chatbot Arena). While it is known that auto-eval methods also correlate
with spurious features such as length, a new version of AlpacaEval 2.0 corrects for this with a length-controlled win-rate
12
Annotations of Training DataNew Training Pairs
inputs
student
length
(words)
best-of-n
student
win-rate
Avg. #
student
wins
Avg.
student
score
Avg.
teacher
score
T≻S S≻T S≻S
DNO-RestrictiveIter 019.6k 162 +/- 19015.9% 0.486 3.46 4.99 42.4k
DNO-RestrictiveIter 119.9k 359 +/- 35034.2% 1.11 4.86 4.77 17.5k
DNO-RestrictiveIter 219.8k 256 +/- 20735.0% 1.31 5.21 4.87 9.9k
DNOIter 0 19.6k 162 +/- 19015.9% 0.486 3.46 4.99 30.7k 4.2k 25.9k
DNOIter 1 19.9k 671 +/- 54634.6% 1.22 4.61 4.62 20.3k 19.4k 62.9k
DNOIter 2 19.8k 361 +/- 25143.6% 1.90 5.25 4.59 7.1k 32.4k 10.9k
Table 2: The dynamics of how sampled outputs from a previous iteration’s policy compare to their teacher, and how
many new training pairs they give rise to in the next iteration. The crucial point is thatDNOconstructs new pairs where
the student is compared to the teacher, whereasDNO-Restrictive,SPIN, and IPO-MD do not.
that has an even higher spearman correlation (0.98) with Chatbot Arena4.
We also evaluate on MT-Bench (Zheng et al.,) which allows the llm-as-a-judge to first explain its reasoning before
providing a scalar score on 1-10 for the candidate response to a bank of 80 questions. One crucial difference between
AlpacaEval 2.0 and MT Bench is that the former asksGPT-4-Turboto predict which of two side-by-side responses
humans would prefer, weighted by thelogitsto represent its uncertainty, whereas MT-Bench asks the model to first
generatea justification and then output a score on 1-10, but it neither defines the ratings (e.g. how a 7 is different than
a 5) nor accounts for uncertainty in the logits of the score.
We also evaluate on the OpenLLM leaderboard (Beeching et al.,), which measures reasoning ability on downstream
NLP tasks like coding and question answering by evaluating the accuracy of the multiple choice answer option with the
highest logit. Since our training data is primarily instruction-following and not trained to output just the sole answer
option, this benchmark is not the primary target of this study; nonetheless,DNOon instruction tuning tasks ought to
show no regression on reasoning tasks.
5.2 Results and Analysis
We run several head-to-head experiments that control for hyperparameters and input data. We often refer to the policy
being trained as the “student” and GPT-4 as a “teacher”; GPT-4 is also used as an annotator when prompted.
SFT BaselinesThe first baseline isOrca-2.5itself, which is amistralai/Mistral-7B-v0.1raw pretrained model
fine-tuned on a new collection of Orca-2 data (Mitra et al.,). This model was finetuned for three epochs and
achieves scores shown in the top of. All other experiments in this study are initialized with Epoch 1 of
Orca-2.5. This is the solid horizontal line in.
The second baseline is continue-SFT ofOrca-2.5training towards the positives in UltraFeedback (and masking out
loss over the input prompts). If the original positive in that dataset was not fromGPT-4-Turbo, we replace it with one
that is. This is the red line in. It is clear that even offline contrastive training methods are more beneficial than
additional SFT, showing that thedifferencebetween the positive and negative output provides more valuable training
signal than the positive in isolation.
Large Margin Filtering of Training Pairs: We ran a simple experiment of OfflineDPOfor one epoch on UltraFeedback
data. In the control, we trained on all 63k preference pairs in the original dataset, whereas in the treatment we filtered
the 42k pairs that met a large margin requirement enforcing that the positive’s scores exceeded that of the negative by
at least 1.0 (out of 10) according to theirGPT-4-Turboannotator. All else was equal. Even though the treatment was
trained for fewer steps on less data, it achieved an AlpacaEval 2.0 win rate of 11.60 vs 9.60 for the control, showing
that fewer higher quality preference pairs is better than a higher quantity of noisy pairs (not shown in the tables).
On-Policy is Better than Off-PolicyOne of the critical questions in this study whether to sample “on-policy” outputs
from the current student to use in training pairs, or whether “off-policy” outputs collected from other models different
4https://github.com/tatsu-lab/alpaca_eval
13
ARC-C
(25-shot)
GSM8K
(5-shot)
HellaSwag
(10-shot)
MMLU
(5-shot)
TruthfulQA
(0-shot)
WinoGrande
(5-shot)
Avg
Orca-2.5 Epoch 10.609 0.635 0.818 0.614 0.489 0.738 0.652
Orca-2.5 Epoch 30.624 0.641 0.826 0.624 0.506 0.746 0.661
SPIN(ours) Iter 30.668 0.448 0.862 0.623 0.601 0.759 0.660
DNOIter 1 0.657 0.572 0.834 0.623 0.568 0.755 0.668
DNOIter 2 0.663 0.562 0.845 0.624 0.580 0.753 0.671
DNOIter 3 0.672 0.542 0.852 0.622 0.606 0.753 0.675
Table 3: Results on Open-LLM Leaderboard reasoning tasks, which we do not expect to decrease.
than the student will suffice. We ran 4 epochs of OfflineDPOon UltraFeedback (filtered for large margin), and as shown
in, on-policy methods especiallyDNOsurpass the off-policyDPO, even when trained for 4 epochs while the
on-policy models were granted only three iterations. Recall that each iteration of batched on-policy training sees only
a third of the UltraFeedback input data, whereas an epoch of OfflineDPOsees the entire dataset.
Higher Quality AnnotatorsIn our study, we useGPT-4-Turboto provide the annotations for preference pairs.
However, the Self-Rewarding Language Model uses theLlama-2-70B(Touvron et al.,) model trained to also give
feedback as the annotator, which in their study starts off with a 65% agreement rate with human-labeled preferences
improving to 80% in the last iteration (Yuan et al.,). While it was not reported how well GPT-4-Turbo’s
annotations agree with their held-out human labels, we believe that having a higher-quality annotator to start with will
lead to higher quality policies. Since both our studies use UltraFeedback data, and our annotation prompt is based on
their annotation prompt, we believe there is a valid comparison.
We observeDNOinitialized with a 7B base model outperforms the 70B parameter Self-Rewarding model over the same
number of training iterations (24.97 win-rate vs 20.44 on AlpacaEval 2.0, and 7.46 MT-Bench vs 7.25), at least in
part due to the higher quality preference annotations. See the dark blue band versus the gray line in
corresponding row in. However, unlike Self-Rewarding LM, we saw a slight gain rather than a drop reasoning
benchmarks like ARC-Challenge (Clark et al.,) and HellaSwag (Zellers et al.,). Granted, the evaluation of
OpenLLM predicts the answer with the max logit corresponding to one of the multiple-choice options, which is not
congruous with how these techniques are trained.
Training Pair ConstructionOne of the most critical implementation questions in this study ishow to construct
training pairs that help the student policy exceed a strong teacherlikeGPT-4-Turbo. One approach, Self-Play
Finetuning (SPIN), removes the preference annotation step and automatically assigns the teacher output to be the
positive, and all student samples to be negative (Chen et al.,). We find in our re-implementation of SPINthat
this is detrimental, presumably because this automatic assignment could lead to noisy training pairs in cases where the
student might actually be preferred. The resulting win-rate ofSPINis only 16.13 after three epochs of iterative training
compared to 24.97 forDNOas shown in, all else being equal. Similar results hold in the OpenLLM results
in.
In a second experiment, which we denoteDNO-Restrictive, we annotate all preference pairs withGPT-4-Turboas
usual, but only admit training pairs where the teacher’s output is the preferred one. The difference betweenDNOand
DNO-Restrictiveis illustrated in
same is also true forSPIN, butSPINwould admit a greater quantity of noisy teacher-vs-student examples even when
they are dis-preferred: DNO-Restrictive, only 9.9k instances exist of the teacher
being preferred over the student, whereasSPINwould have automatically created about 100k (5 samples×20k inputs).
WhileDNO-Restrictiveis slightly better (19.21 win-rate) thanSPIN, it still does not give the student a chance to
compare its behavior to a powerful teacher. Absence of this signal is a major oversight, since the last row of
shows that by Iter 3, over 64% of theDNOtraining data (32k pairs) are cases where the student is in fact preferred over
the teacher, a number which increases with iteration.We conclude it is imperative to “allow the student to become
the teacher” i.e. learn from comparisons where its own outputs are preferred over a more powerful teacher.
One curious phenomenon in
slightly lower scores to the teacher as the student improves; we are not sure if this is an innocuous artifact of preference
14
Alpaca Eval 2 MT Bench
Technique
Epoch
or Iter
Len-control.
Win Rate
Win Rate
vs. GPT-4
Avg. len
(chars)
1st
Turn
2nd
Turn
Avg
Orca-2.5 SFT Epoch 1 10.76 6.99 1174 7.72 6.02 6.88
Orca-2.5 SFT Epoch 2 15.29 7.88 1060 7.56 6.38 6.98
Orca-2.5 SFT Epoch 3 15.90 8.17 1058 7.53 6.73 7.13
DNO-More-DataIter 1 8.96 10.67 2795 7.00 6.06 6.53
DNO-More-DataIter 2 14.61 16.94 2782 7.62 7.23 7.42
DNO-More-DataIter 3 21.81 26.74 2539 7.74 6.66 7.21
DNO-More-DataIter 4 22.93 29.08 3033 7.54 6.92 7.24
DNO-More-DataIter 5 32.06 34.98 2856 7.10 6.39 6.75
DNO-More-DataIter 6 33.05 34.38 2683 7.28 6.65 6.97
Table 4:DNO-More-Datais trained on 10x more instruction data thanDNO. It is still initialized with Epoch 1 of
Orca-2.5SFT, so the delta it provides in AlpacaEval 2.0 win rate is 27.39 absolute (22.29 length-controlled)
annotations, or symptomatic of a deeper problem. Also, the total quantity of new “large margin” training pairs (not
counting those sampled from previous iterations) inDNOtends to decrease as the policy improves across iterations, but
we do not have enough data to quantify how this relates to a change in quality.
Lookahead to Future IterationsAs a curiosity, we experimented with whether a model could benefit from the
knowledge of which training pairs it would generate if it could look into the future. We tested this by running
three-iterations ofDNO, accumulating all the preference pairs across iterations, combining and shuffling them, and
then re-starting training from the initial model. In essence, this turns the batch-onlineDNOinto an offline learning
algorithm we denote asDNO-Lookahead. We trained for one epoch on the three iterations’ worth of preference data. It
deteriorated more than we expected on AlpacaEval 2.0 win-rate (24.97 to 18.18), however, even more surprisingly, the
MT-Bench numbers improved significantly (7.48 to 7.70). While the reasons for the relatively low correlation between
MT-Bench and AlpacaEval 2.0 are not entirely clear, it is important to consider the disparity in the size of the datasets.
Given that MT-Bench consists of merely 80 examples, whereas AlpacaEval 2.0 contains 10x more, we conjecture that
the statistical significance and reliability of the findings from AlpacaEval 2.0 are regarded with greater confidence.
DNO Scales with More Data: One of the reasons we split UltraFeedback into three non-overlapping partitions
is to avoid overfitting. Another strategy to avoid overfitting is to collect more data, so we increased by a factor
of 10 the instruction data based on publicly available datasets. We split a large mixture of datasets into six non-
overlapping partitions of roughly 100k inputs each (and inferenceGPT-4-Turbooutputs for all inputs), and show that
DNO-More-Datascales well in this expanded regime (see the purple line in.
We make some notes on the behavior of this experiment: because each iteration builds on outputs of the previous
iteration, if there are any anomalies or errors in critical components such as preference annotation, those errors will
propagate and the only way to combat them is “roll back” to the iteration that introduced them. This can result in wasted
time and cost, which are both already very high as shown in. We suspect that the “depth” of iterations
matters more than the “width” or number of samples within each iteration, and furthermore, that having equal number
of inputs per iteration may not be optimal, but we did not test this thoroughly. From an efficiency standpoint, although
this algorithm is “batched”, some optimizations can be made, such as starting to annotate sampled policy outputs are
soon as they are ready instead of waiting for all inference jobs to finish.
“Exploding” LengthsIt is known that contrastive LLM training techniques, especiallyDPO, lead to longer outputs
from the model which is widely suspected to be a form of “reward hacking”. Curiously,
jump comes after the first round of contrastive training (Iteration 1), where lengths explode by at least a factor of 2 over
the initializing SFT model, before inching down again in the next iteration. We interpret this “length spike” as wasted
computation optimizing towards a spurious signal; we wish we were better equipped to control this phenomenon.
15
6 Related Work
We divide the space of related work into whehter or not the techniques use SFT or contrastive losses, in offline or
online update settings.
Online RLHF algorithms: RLHF innovated how to align language models with human preferences (Christiano et al.,
2017;,), but it is unstable to train and memory-intensive, requiring all three of the parameterized
policy model, reward model, and advantage model to be on device for training.
Reward-model Augmented SFT: Since the introduction of RLHF, several emergent techniques apply reward models
in various ways, such as to filter training data or rank responses. Reward rAnked Finetuning (RAFT) (Dong et al.,
2023) and RRHF (Yuan et al.,) offer the conceptually simplest solution for offline preference learning, which is
to sample multiple outputs from a policy, rank them with a reward model, and then finetune on the best sampled output
using SFT. This resembles the iterative behavior-cloning technique DAgger (Ross et al.,).
Offline Contrastive Preference Learning: There exist several loss functions for contrastive preference learning, first
introduced in the offline setting, namely Direct Preference Optimization (Rafailov et al.,, DPO) and Calibrated
Sequence Likelihood Estimation a.k.a. SLiC (Zhao et al.,).2023) make it clear that point-wise reward
estimates are no substitute for pair-wise preferences, and that a policy can easily overfit to deterministic preferences
without proper regularization. They derive a more general objective for RLHF, IPO, to directly optimize offline
preference probabilities.
Statistical Rejection Sampling Optimization (RSO) generates multiple samples from an initial model, ranks them
to create training pairs, and optimizes them under a unified framework encompassing DPO and SLiC (Liu et al.,
2024b). Inspired by the learning-to-rank literature, Listwise preference optimization (LIPO) extends pair-wise pref-
erence learning to list-wise (Liu et al.,). Preference Ranking Optimization (PRO) also learns towards list-wise
preferences (Song et al.,). The KTO algorithm takes a different approach from DPO and does not assume that a
pair of good-vs-bad outputs for the same input exist, but rather a pool of good outputs and a pool of bad outputs for
any inputs exist and optimizes an “unpaired” loss (Ethayarajh et al.,).
Iterative Reward-based Finetuning: Reinforced Self-Training (ReST) is one of the first methods to explore iterative
self-improving training strategies framed as a two-stage “Grow” step that samples from the current policy, and a
“Improve” step that uses a reward model to filter ever-higher quality samples that are then used to improve the
policy with offline RL (Gulcehre et al.,). A follow-up work explores the use of AI feedback rather than reward
ranking (Singh et al.,).
On-policy Contrastive Learning: Self-Rewarding Language Models (Yuan et al.,) is in practice very similar
toDNO. They study the benefits of batched iteratively training on preferences derived from a recent policy’s sampled
outputs, but in their work, they use the policy itself as the annotator, which starts off being able to provide only weak
preference signals. Self-Play Fine-Tuning (Chen et al.,) a.k.a SPINand Adversarial Preference Optimization
a.k.a APO (Cheng et al.,) are both iterative LLM training techniques that are compatible with contrastive losses,
but they make a very limiting assumption that the teacher is better than the student (without regard to any annotator
feedback).
The Cringe Loss (Adolphs et al.,) is a token-level loss function that contrasts the correct next token with a
hard-negative token from the vocabulary that has high logit weight but still incorrect. The Pairwise Cringe Loss (Xu
et al.,) applies the cringe loss to an iterative self-improving style of training.
On-Policy General Preference Optimization:Wang et al.2023) consider finding the von Neumann winner of
general preferences via multi-agent RL from the theoretical perspective. Nash-MD optimizes a policy towards the
Nash equilibrium of a generalized preference model using policy gradients, showing that by sampling from a mixture
of policies, one can converge to the Nash equilibrium in the last iteration (Munos et al.,). Self-play Preference
Optimization (SPO) is another online two-player mini-max game that converges to a Nash equilibrium with no-regret
guarantees (Swamy et al.,). However, these techniques are not as data efficient as contrastive losses and are
difficult to implement faithfully without cumbersome two-timescale updates (Munos et al.,). A concurrent
improvement, IPO-MD, mitigates these difficulties by using purely on-policy IPO updates and is empirically evaluated
on an article summarization task (Calandriello et al.,).2024) also propose to eliminate rewards in
online AI-feedback (OAIF) by using another LLM to annotate which of two online-sampled outputs from the current
16
policy is preferred. However, all the above studies only consider training pairs constructed between self-play “student
vs student” samples, and between student and initialπref. That is, there is no concept of a more powerful “teacher” to
compare against in their training pairs. We showed in
hinder performance.
7 Conclusion
In this paper we achieve dual goals of post-training LLMs against a more general class of preference models while
providing a practical and scalable implementation with finite-sample analysis. Our strong empirical results are based on
the insight that optimizing general preference functions can be reduced to finding the Nash equilibrium of a two-player
game with the payoff as the preference, and further solved by a single-play algorithm. Most techniques to optimize for
this objective use soft policy iteration, which is difficult to implement faithfully and may require unstable on-policy
and two-timescale updates. Our contribution, Direct Nash Optimization, addresses these challenges by approximating
soft policy iteration updates with a regression-based contrastive objective in a batched manner, which is a much more
stable and forgiving learning objective, and we establish a concentration bound of
e
O(
1/N)on the squared total variation
error between the learned policy and its target of the soft policy iteration update at any given iterationt. Theoretically,
DNOconverges to the Nash equilibrium on-average, but in practice enjoys monotonic improvement across iterations.
Training a 7B parameter LLM withDNOachieves state-of-the-art performance on AlpacaEval 2.0, exceeding both
Mistral Large and older versions of GPT-4. We illuminate many of the practical design choices that will aid future
development of iterative self-improving algorithms.
17
References
Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The cringe loss:
Learning what language not to model.arXiv preprint arXiv:2211.05826, 2022.
Riad Akrour, Marc Schoenauer, and Michèle Sebag. April: Active preference learning-based reinforcement learning.
InMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol,
UK, September 24-28, 2012. Proceedings, Part II 23, pages 116–131. Springer, 2012.
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in
ai safety.arXiv preprint arXiv:1606.06565, 2016.
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and
Rémi Munos. A general theoretical paradigm to understand learning from human preferences.arXiv preprint
arXiv:2310.12036, 2023.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort,
Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk,
Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt,
Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann,
and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback.
arXiv preprint arXiv:2204.05862, 2022a.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna
Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Her-
nandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey
Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas
Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna
Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam Mc-
Candlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback.arXiv preprint
arXiv:2212.08073, 2022b.
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanse-
viero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard.https://huggingface.co/spaces/HuggingFac
eH4/open_llm_leaderboard, 2023.
Quentin Bertrand, Wojciech Marian Czarnecki, and Gauthier Gidel. On the limitations of the elo, real-world games
are transitive, not additive. InInternational Conference on Artificial Intelligence and Statistics, pages 2905–2921.
PMLR, 2023.
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired
comparisons.Biometrika, 39(3/4):324–345, 1952.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark
Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners.Advances in neural
information processing systems, 33:1877–1901, 2020.
Sébastien Bubeck. Convex optimization: Algorithms and complexity.Foundations and Trends®in Machine Learning,
8(3-4):231–357, 2015.
Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey
Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, et al. Human alignment of large language models through
online preference optimisation.arXiv preprint arXiv:2403.08635, 2024.
Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006.
18
Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. InInternational
Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language
models to strong language models.arXiv preprint arXiv:2401.01335, 2024.
Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, and Nan Du. Adversarial preference optimization.arXiv preprint
arXiv:2311.08045, 2023.
Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement
learning from human preferences.Advances in neural information processing systems, 30, 2017.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.
Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018.
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun.
Ultrafeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377, 2023.
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen
Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint
arXiv:2305.14233, 2023.
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, SHUM
KaShun, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.Transactions
on Machine Learning Research, 2023.
Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S.
Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human
feedback.Advances in Neural Information Processing Systems, 36, 2023.
Miroslav Dudík, Katja Hofmann, Robert E. Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling
bandits. InConference on Learning Theory, pages 563–587. PMLR, 2015.
Arpad E. Elo.The rating of chessplayers, past and present. Arco Pub., New York, 1978. ISBN 0668047216
9780668047210. URLhttp://www.amazon.com/Rating-Chess-Players-Past-Present/dp/0668047216.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as
prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024.
Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial
networks, inverse reinforcement learning, and energy-based models.arXiv preprint arXiv:1611.03852, 2016a.
Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy
optimization. InInternational conference on machine learning, pages 49–58. PMLR, 2016b.
Peter C. Fishburn. Probabilistic social choice based on simple voting comparisons.The Review of Economic Studies,
51(4):683–692, 1984.
Dylan J. Foster and Akshay Krishnamurthy. Efficient first-order contextual bandits: Prediction, allocation, and
triangular discrimination.Advances in Neural Information Processing Systems, 34:18907–18919, 2021.
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to
boosting.Journal of computer and system sciences, 55(1):119–139, 1997.
Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L. Isbell, and Andrea L. Thomaz. Policy shaping:
Integrating human feedback with reinforcement learning.Advances in neural information processing systems, 26,
2013.
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya
Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando
de Freitas. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023.
19
Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas
Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel. Direct language model alignment from online
ai feedback.arXiv preprint arXiv:2402.04792, 2024.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870.
PMLR, 2018.
Elad Hazan et al. Introduction to online convex optimization.Foundations and Trends®in Optimization, 2(3-4):
157–325, 2016.
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. InProceedings of the
Nineteenth International Conference on Machine Learning, pages 267–274, 2002.
Sham M. Kakade. A natural policy gradient.Advances in neural information processing systems, 14, 2001.
Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems.Journal of Computer and System
Sciences, 71(3):291–307, 2005.
W Bradley Knox and Peter Stone. Tamer: Training an agent manually via evaluative reinforcement. In2008 7th IEEE
international conference on development and learning, pages 292–297. IEEE, 2008.
Gerald H. Kramer. On a class of equilibrium conditions for majority rule.Econometrica: Journal of the Econometric
Society, pages 285–297, 1973.
Germain Kreweras. Aggregation of preference orderings. InMathematics and Social Sciences I: Proceedings of
the seminars of Menthon-Saint-Bernard, France (1–27 July 1960) and of Gösing, Austria (3–27 July 1962), pages
73–79, 1965.
Tor Lattimore and Csaba Szepesvári.Bandit algorithms. Cambridge University Press, 2020.
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and
Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint
arXiv:2309.00267, 2023.
Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh,
Simon Baumgartner, Jialu Liu, Peter J. Liu, and Xuanhui Wang. Lipo: Listwise preference optimization through
learning-to-rank.arXiv preprint arXiv:2402.01878, 2024a.
Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. Statistical rejection
sampling improves preference optimization. InThe Twelfth International Conference on Learning Representations,
2024b.
Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen,
Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed
Khanpour, and Ahmed Awadallah. Orca 2: Teaching small language models how to reason.arXiv preprint
arXiv:2311.11045, 2023.
Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of slms
in grade school math.arXiv preprint arXiv:2402.14830, 2024.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational conference on
machine learning, pages 1928–1937. PMLR, 2016.
Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel
Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev,
Olivier Bachem, Daniel J. Mankowitz, Doina Precup, and Bilal Piot. Nash learning from human feedback.arXiv
preprint arXiv:2312.00886, 2023.
20
Arkadij Semenovič Nemirovskij and David Borisovich Yudin.Problem complexity and method efficiency in optimiza-
tion. Wiley-Interscience, 1983.
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo
Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie
Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel
Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman,
Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany
Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby
Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah
Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan,
Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,
Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun
Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua
Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes
Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu,
Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny
Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider,
Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik
Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis,
Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel
Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia
Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer,
Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David
Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan
Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano,
Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki,
Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos,
Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde
de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power,
Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra
Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar,
Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov,
Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina
Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers,
Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng,
Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea
Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann,
Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel
Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin
Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng,
Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow
instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022.
Art B. Owen.Monte Carlo theory, methods and examples.https://artowen.su.domains/mc/, 2013.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct pref-
erence optimization: Your language model is secretly a reward model.Advances in Neural Information Processing
Systems, 36, 2023.
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to
21
no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and
statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, and Paul Bennett. Axiomatic preference modeling for
longform question answering. InProceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, pages 11445–11475, 2023.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization.
InInternational conference on machine learning, pages 1889–1897. PMLR, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization
algorithms.arXiv preprint arXiv:1707.06347, 2017.
Shai Shalev-Shwartz et al. Online learning and online convex optimization.Foundations and Trends®in Machine
Learning, 4(2):107–194, 2012.
Paul B. Simpson. On defining areas of voter choice: Professor tullock on stable voting.The Quarterly Journal of
Economics, 83(3):478–490, 1969.
Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James
Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben
Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper
Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao
Xiao, Maxwell L. Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini
Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, and Noah Fiedel. Beyond human data: Scaling
self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023.
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking
optimization for human alignment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38,
pages 18990–18998, 2024.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei,
and Paul F. Christiano. Learning to summarize with human feedback.Advances in Neural Information Processing
Systems, 33:3008–3021, 2020.
Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, and Alekh Agarwal. A minimaximalist approach
to reinforcement learning from human feedback.arXiv preprint arXiv:2401.04056, 2024.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen,
Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj
Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor
Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan
Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams,
Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan
Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and
fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
Hoang Tran, Chris Glaze, and Braden Hancock. snorkelai/snorkel-mistral-pairrm-dpo, 2024.https://huggingfac
e.co/snorkelai/Snorkel-Mistral-PairRM-DPO.
Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang,
Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush,
and Thomas Wolf. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023.
22
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny
Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,
2022.
Yuanhao Wang, Qinghua Liu, and Chi Jin. Is rlhf more difficult than standard rl?arXiv preprint arXiv:2306.14111,
2023.
Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement
learning methods.Journal of Machine Learning Research, 18(136):1–46, 2017.
Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for
offline reinforcement learning.Advances in neural information processing systems, 34:6683–6694, 2021.
Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, and Sham M. Kakade. The role of coverage in online reinforcement
learning. InThe Eleventh International Conference on Learning Representations, 2023.
Wei Xiong, Hanze Dong, Chenlu Ye, Han Zhong, Nan Jiang, and Tong Zhang. Iterative preference learning from
human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456, 2023.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm:
Empowering large language models to follow complex instructions, 2023a.
Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Preference
optimization with the pairwise cringe loss.arXiv preprint arXiv:2312.16682, 2023b.
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, and Yuandong Tian. Rlcd: Reinforcement learning from
contrast distillation for language model alignment.arXiv preprint arXiv:2307.12950, 2023.
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian
Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2023.
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-
rewarding language models.arXiv preprint arXiv:2401.10020, 2024.
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on
learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023a.
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align
language models with human feedback without tears.arXiv preprint arXiv:2304.05302, 2023b.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish
your sentence?arXiv preprint arXiv:1905.07830, 2019.
Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, and Wen Sun. Provable offline preference-based
reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024.
Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J. Liu. Calibrating se-
quence likelihood improves conditional language generation. InThe Eleventh International Conference on Learning
Representations, 2023.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li,
Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench
and chatbot arena.Advances in Neural Information Processing Systems, 36, 2023.
23
Appendix
A Extension to Regularized Preferences
In this section, we discuss how to extend theDNOframework to the case of regularized preferences (defined in),
P
π,π
′
τ(y≻y
′
|x) =P(y≻y
′
|x)−τlog
π(y|x)
πref(y|x)
+τlog
π
′
(y|x)
πref(y|x)
,
which was first introduced and solved by2023) via Nash-MDintroduced earlier.
One can notice that the only difference betweenSPOandNash-MDis thatSPOuses the last iteration policyπtfor both
constructing rewardrtand performing a soft policy iteration update, whereasNash-MDuses the smoothed versionπ
τ
t
(firstly defined in),
π
τ
t(y|x):=
πt(y|x)
1−
τ/η
πref(y|x)
τ/η
P
y
′
∈Y
πt(y|x)
1−
τ/η
πref(y|x)
τ/η
,∀(x, y)∈ X × Y, (14)
for both. This allowsNash-MDto obtain a late-iteration guarantee.
On the other hand, due to the symmetry of regularized preferences, if we consider on-average convergence case, it is
likely thatSPOcan be adapted with a simpler way as follows: for eacht= 1,2, . . . , T,
(i)rt(x, y)←E
y
′
∼πt(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y
(ii)πt+1(· |x)←
1
Zt(x)
π
τ
t(· |x) exp
„
rt(x,·)
η
«
,∀x∈ X,
whereZt(x):=
P
y∈Y
π
τ
t(y|x) exp
“
rt(x,y)
η
”
is the partition function for iterationt. Here, the smoothed policyπ
τ
t
is only used in the soft policy iteration step, and this coincides with the OMD algorithm from2023).
Algorithm 3DNO(Regularized Preferences Version)
input:General preference functionP, learning rateη, coefficient of KL-regularizationτ, number of iterationsT,
prompt distributionρ.
1:Initializeπ1←unif(A).
2:foriterationt= 1,2, . . . , Tdo
3:Computert(x, y)by,
Option I: ▷for on-average convergence
rt(x, y)←E
y
′
∼πt(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y.
Option II: ▷for last-iteration convergence
rt(x, y)←E
y
′
∼π
τ
t
(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y, whereπ
τ
tis defined in.
4: Obtainπt+1by,
πt+1←argmax
π∈Π
E
(x,y1,y2)∼Dt
ȷ
σ(rt(x, y1)−rt(x, y2)) log
»
σ
„
ηlog
π(y1|x)
eπ
τ
t(y1|x)
−ηlog
π(y2|x)
eπ
τ
t(y2|x)
«–
+σ(rt(x, y2)−rt(x, y1)) log
»
σ
„
ηlog
π(y2|x)
eπ
τ
t(y2|x)
−ηlog
π(y1|x)
eπ
τ
t(y1|x)
«– ff
,
whereDtis generated byx∼ρ, y1∼µ1,t(· |x), y2∼µ2,t(· |x)with some policiesµ1,tandµ2,t, and
eπ
τ
t(y|x):=πt(y|x)
1−
τ/η
πref(y|x)
τ/η
,∀(x, y)∈ X × Y(the unnormalized version ofπ
τ
t(y|x)defined in
Eq. (14)).
5:end for
6:return¯π=unif(π1:T).
Based on discuss above, we can then obtain the extension ofDNOto the regularized preferences in, and its
practical implementation in. Note that, similar to Nash-MD, the late-iteration option for both
24
Algorithm 4DNO-Prct(Regularized Preferences Version)
input:General preference functionP, learning rateeη, coefficient of KL-regularizationτ, number of iterationsT,
reference policyπref, seed datasetD0={(x, y
gold
)}wherex∼ρandy∼πgold(· |x), reference modelπref.
1:Initializeπ1←πref.
2:foriterationt= 1,2, . . . , Tdo
3:Samplebatched on-policyresponses:
Option I: ▷for on-average convergence
SampleKoutputs per per prompt using the currentπt:{y
1
t, y
2
t, . . . , y
K
t} ∼πt(· |x),∀x∈ D0.
Option II: ▷for last-iteration convergence
SampleKoutputs per per prompt using the smoothed current policyπ
τ
t:{y
1
t, y
2
t, . . . , y
K
t} ∼π
τ
t(· |
x),∀x∈ D0, whereπ
τ
tis defined in η→eη.
4:Rank responses:For eachx∈ D0, rank the corresponding{y
1
t, y
2
t, . . . , y
K
t, y
gold
}using the pair-wise win-rate
by sampling from the general preference functionP.
5:Filter and construct preference pairs:ConstructDt={(x, y
+
t, y
−
t)}, for allx∈ D0, and(y
+
t, y
−
t)are
large-margin pairs (based on the win-rate rank) within the responses forxfrom the previous step.
6:Contrastive learning:Obtainπt+1by,
πt+1←argmax
π∈Π
E
(x,y
+
t
,y
−
t
)∼Dt
log
»
σ
„
eηlog
π(y
+
t|x)
eπ
τ
t(y
+
t|x)
−eηlog
π(y
−
t|x)
eπ
τ
t(y
−
t|x)
«–
,
whereeπ
τ
t(y|x):=πt(y|x)
1−
τ/eη
πref(y|x)
τ/eη
,∀(x, y)∈ X × Y(the unnormalized version ofπ
τ
t(y|x)defined
in, after accommodating η→eη).
7:end for
8:returnbest ofπ
1:(T+1)on the validation data.
and π
τ
t(the mixture betweenπtandπref, defined in).
One solution to address this can be sampling from the token-level betweenπtandπrefinstead as suggested by
et al.2023).
B Detailed Proofs
In this section, we provide detailed proofs for our theoretical results. Note that, the definitions and assumptions
presented heavily adopts the ideas related to version space and concentrability from reinforcement learning theory
literature (esp.,,,). Nevertheless, the descriptions provided herein are intentionally simplified
to elucidate the core insights into the algorithmic design. A full and exhaustive theoretical analysis falls outside the
primary scope of this paper. We now make the following definitions and assumptions.
Definition 1(Feasible solution space).For each iterationt∈[T], we defineΠt⊆Πas the feasible solution space for
iterationt. Theπtobtained by Πt, regardless of the randomness of the data sampling
procedure in.
Here, Πtonly contains policies
that have a small empirical loss, which can be further converted to a small population loss under standard concentration
procedures.
Definition 2(Concentrability coefficient over the feasible solution space).For allt∈[T], supposeΠtis defined in
Definition 1, and µ1,tandµ2,tare some given data generate policy. Now, for anyt∈[T], we defineCtto be the
concentrability coefficient at iterationtover its feasible solution space, where
E
x∼ρ,y1∼π
⋆
t+1
(·|x),y2∼πt+1(·|x)
»
“
log
π
⋆
t+1
(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1
(y2|x)
πt+1(y2|x)
”
2
–
E
x∼ρ,y1∼µ1,t(·|x),y2∼µ2,t(·|x)
»
“
log
π
⋆
t+1
(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1
(y2|x)
πt+1(y2|x)
”
2
–≤Ct,
25
for anyπt+1∈Πt+1and anyπ
⋆
t+1∈
n
1
Zπ(x)
π(· |x) exp
“
rπ(x,·)
η
”
:π∈Πt
o
; and here we use the definition of
rπ(x, y):=E
y
′
∼π(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X × Y, andZπ(x) =
P
y∈Y
π(y|x) exp
“
rπ(x,y)
η
”
,∀x∈ X.
Definition 2
to our setup.
Assumption 1(Realizability over the feasible solution space).For anyπ∈ΠtwhereΠtis defined in
allt∈[T], we assume the following soft-policy iteration update
π
SPI
(· |x):=
1
Zπ(x)
π(· |x) exp
„
rπ(x,·)
η
«
,
whererπ(x, y):=E
y
′
∼π(·|x)[P(y≻y
′
|x)],∀(x, y)∈ X ×Y, andZπ(x) =
P
y∈Y
π(y|x) exp
“
rπ(x,y)
η
”
,∀x∈ X
is the partition function.
Assumption 2(Boundedness over the feasible solution space).SupposeΠtis defined in t∈[T],
then we assumelog
π(y|x)
πt(y|x)
∈[−Rmax, Rmax]for allπ∈Π,πt∈Πt, and(x, y)∈ X × Y.
Assumption 2
Nonetheless, it is important to note that the value oflog
π(y|x)
πt(y|x)
is directly measurable and controllable in practice,
which is different from the common use case, such as maximum likelihood problems.
Theorem 2(Formal Version of) .Under, and fix an arbitrary iteration t∈[T].
Supposeπt+1is from, and π
⋆
t+1is defined in. Then, we have
Ex∼ρ
h
`
DTV(πt+1(· |x), π
⋆
t+1(· |x))
´
2
i
≤ O
„
CtR
2
maxlog(
|Π|/δ)
N
«
,
whereCtis defined in.
Proof of. We will now present the proof using the following two-step procedure.
Step 1: From regression with log loss to squared error bound.By standard results on the regression with the
logarithmic loss, we know,
E
(x,y1,y2)∼Dt
h
σ
`
∆
⋆
t(x, y1, y2)
´
log
ˆ
σ
`
∆π
t+1,t(x, y1, y2)
´˜
+σ
`
∆
⋆
t(x, y2, y1)
´
log
ˆ
σ
`
∆π
t+1,t(x, y2, y1)
´˜
i
≲
log(|Π|/δ)
N
.(15)
Note that similar results could also apply beyond finiteΠ. For simplicity, we omit the detailed discussion in our paper.
For more in-depth discussions about regression with the logarithmic loss, the reader can refer to, e.g.,
Krishnamurthy2021).
Next, by the Pinsker’s inequality, we have for anyz,bz∈[0,1],
(z−bz)
2
2
≤zlog
“
z
bz
”
+ (1−z) log
„
1−z
1−bz
«
.
Substituting thezandbzwith, we obtain that
E
(x,y1,y2)∼Dt
h
`
σ(rt(x, y1)−rt(x, y2))−σ
`
rπt+1,t(x, y1)−rπt+1,t(x, y2)
´ ´
2
i
≲
log(
|Π|/δ)
N
, (16)
wherea≲bmeansa≤c·bfor some absolute constantc. Then, by the standard concentration for squared loss, e.g.,
Lemma A.4 of2021) with γ= 0,
E
(x,y1,y2)∼ρ×µ1:2,t
h
`
σ(rt(x, y1)−rt(x, y2))−σ
`
rπt+1,t(x, y1)−rπt+1,t(x, y2)
´ ´
2
i
≲
log(
|Π|/δ)
N
, (17)
where we use “×” as the shorthand of joint distribution for the sake of simplicity, for example,(x, y1, y2)∼ρ×µ1:2,t
is shorthand forx∼ρ, y1∼µ1,t(· |x), y2∼µ2,t(· |x).
26
By the definition ofrtin, we know rt(x, y)∈[0,1]for all(x, y)∈ X × Y. Thus, by a variant
of mean value theorem, we know
rt(x, y1)−rt(x, y2)−rπt+1,t(x, y1) +rπt+1,t(x, y2)
≤
ηRmax
1−σ(1)
σ(rt(x, y1)−rt(x, y2))−σ
`
rπt+1,t(x, y1)−rπt+1,t(x, y2)
´
,
(18)
for any(x, y1, y2)∈ X × Y × Y, whereRmaxis introduced from. This is because: let a:=
rt(x, y1)−rt(x, y2)∈[−1,1], andb:=rπt+1,t(x, y1)−rπt+1,t(x, y2)∈[−ηRmax, ηRmax], and, then, we can
directly verify that the slope we need to bound
a−b
/
σ(a)−σ(b)
reaches its maximum ata= 1andb=ηRmax.
Combining, we obtain
E
(x,y1,y2)∼ρ×µ1:2,t
h
`
rt(x, y1)−rt(x, y2)−rπt+1,t(x, y1) +rπt+1,t(x, y2)
´
2
i
≲
η
2
R
2
maxlog(
|Π|/δ)
N
. (19)
Step 2: Concentration in the policy space.We now reason about the concentration ofπt+1→π
⋆
t+1from,
whereπ
⋆
t+1is defined in πt+1is the policy corresponding to the learnedrπt+1,t. By the definition ofrπ,t
in, we have
rt(x, y1)−rt(x, y2)−rπt+1,t(x, y1) +rπt+1,t(x, y2)
=rt(x, y1)−rt(x, y2)−ηlog
πt+1(y1|x)
πt(y1|x)
+ηlog
πt+1(y2|x)
πt(y2|x)
=ηlog
π
⋆
t+1(y1|x)
πt+1(y1|x)
−ηlog
π
⋆
t+1(y2|x)
πt+1(y2|x)
.
This implies
E
(x,y1,y2)∼ρ×µ1:2,t
"
„
ηlog
π
⋆
t+1(y1|x)
πt+1(y1|x)
−ηlog
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
2
#
≲
η
2
R
2
maxlog(
|Π|/δ)
N
=⇒E
(x,y1,y2)∼ρ×π
⋆
t+1
×πt+1
"
„
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
2
#
≲
CtR
2
maxlog(
|Π|/δ)
N
, (20)
where the last step follows from the definition ofCt(Definition 2).
On the other hand, we have
E
(x,y1,y2)∼ρ×π
⋆
t+1
×πt+1
"
„
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
2
#
=E
(x,y1,y2)∼ρ×π
⋆
t+1
×πt+1
"
„
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
«
2
+
„
log
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
2
−2
„
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
«
·
„
log
π
⋆
t+1(y2|x)
πt+1(y2|x)
«
#
=E
(x,y)∼ρ×π
⋆
t+1
"
„
log
π
⋆
t+1(y|x)
πt+1(y|x)
«
2
#
+E
(x,y)∼ρ×πt+1
"
„
log
π
⋆
t+1(y|x)
πt+1(y|x)
«
2
#
+ 2Ex∼ρ
"
E
y∼π
⋆
t+1
(·|x)
»
log
π
⋆
t+1(y|x)
πt+1(y|x)
–
| {z }
=DKL(π
⋆
t+1
(·|x)∥πt+1(·|x))
·E
y∼πt+1(·|x)
»
log
πt+1(y|x)
π
⋆
t+1
(y|x)
–
| {z }
=DKL(πt+1(·|x)∥π
⋆
t+1
(·|x))
#
≥E
(x,y)∼ρ×π
⋆
t+1
"
„
log
π
⋆
t+1(y|x)
πt+1(y|x)
«
2
#
+E
(x,y)∼ρ×πt+1
"
„
log
π
⋆
t+1(y|x)
πt+1(y|x)
«
2
#
. (21)
Next, we fix an arbitraryex∈ X, and we have
E
y∼π
⋆
t+1
(·|ex)
"
„
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
«
2
#
+E
y∼πt+1(·|ex)
"
„
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
«
2
#
27
≥
⊆
E
y∼π
⋆
t+1
(·|ex)
≤
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
≥⊇
2
+
⊆
E
y∼πt+1(·|ex)
≤
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
≥⊇
2
(by Jensen’s inequality)
≳
⊆
E
y∼π
⋆
t+1
(·|ex)
≤
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
≥
+E
y∼πt+1(·|ex)
≤
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
≥⊇
2
, (22)
wherea≳bmeansa≥c·bfor some absolute constantc.
We now recall the definition off-divergence:Df(p, q):=Ey∼q[f(
p(y)/q(y))]for two distributionspandq, where
f:R
+
→Ris convex withf(1) = 0. Thus, we can notice that,
Ey∼p
≤
log
p(y)
q(y)
≥
+Ey∼q
≤
log
p(y)
q(y)
≥
=Df1(p, q),wheref1(u):= (1 +u)· |log(u)|, u∈R
+
. (23)
On the other hand, by the definition of total variation distance, we know
DTV(p, q) =Df2
(p, q),wheref2(u):=
1
2
|u−1|, u∈R
+
. (24)
It is easy to verify that
f1(u)−f2(u) = (1 +u)· |log(u)| −
1
2
|u−1| ≥0, for allu∈R
+
, (25)
as it is a convex function onR
+
with a minimum of0atu= 1. Therefore, combining, we obtain
Ey∼π
⋆
t+1
(·|ex)
"
⊆
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
⊇
2
#
+Ey∼π
t+1(·|ex)
"
⊆
log
π
⋆
t+1(y|ex)
πt+1(y|ex)
⊇
2
#
≳[DTV(πt+1(· |ex), π
⋆
t+1(· |ex))]
2
.(26)
We then apply,
Ex∼ρ
h
−
DTV(πt+1(· |x), π
⋆
t+1(· |x))
·
2
i
≲E
(x,y1,y2)∼ρ×π
⋆
t+1
×πt+1
"
⊆
log
π
⋆
t+1(y1|x)
πt+1(y1|x)
−log
π
⋆
t+1(y2|x)
πt+1(y2|x)
⊇
2
#
≲
CtR
2
maxlog(
|Π|/δ)
N
,
where the last step follows from. This completes the proof.
C Additional Experimental Details
Batched Prompting:We also show in
sake of efficiency, we “batch” requests to GPT-4, meaning that instead of sending every pair of candidate responses to
be annotated, we show all candidates side-by-side and ask GPT-4 to apply the scoring rubric to each one in the context
window.
Cost Analysis: We also do a brief cost analysis associated with the scaled-up experiment on 600k training inputs. The
major line items are the cost of sampling outputs, annotating them with GPT-4 to construct training pairs, and then
training the next iteration against those pairs. Foreachof the six iterations:
1.
depending on the average length, costing about $6,000 based on spot pricing.
2.
450M, with an average of about 60M completion tokens, amounting to about $34,000 based on the version of
the endpoint we were using.
3.
28
Review the user’s question and the corresponding response using the additive 6-point scoring system described below. Points
are accumulated based on the satisfaction of each criterion:
1.
the question or other random question/answer pairs or examples, etc.
2.
incomplete or contains some irrelevant content.
3.
completely resolve the query or provide a direct answer.
4.
direct way, and is largely devoid of errors, extraneous content, or excess wordiness.
5.
aspects of the user’s question in a helpful, well-organized, focused, and succinct style.
6.
without any unnecessary information or wordiness, reflecting expert knowledge, amounting to a superbly high-quality,
focused, and insightful answer.
### User: <instructions>
### Candidate Responses:
<candidate answers>
### Instructions: After examining the user’s instruction and the response:
1.
2.
Remember to assess from the AI Assistant perspective, utilizing web search knowledge as necessary. To evaluate the response
in alignment with this additive scoring model, we’ll systematically attribute points based on the outlined criteria. Remember to
score EACH candidate answer separately. Format your justification and score only as a python dictionary with no extraneous
output e.g. {“id1_justification”: “...”, “id1_score”: 3, “id2_justification” : ...} with “idX” corresponding to the id of the X’th
answer.
{“id1_justification” :
Figure 3: The annotation prompt submitted to GPT-4 to collect multiple side-by-side preference scores forDNO. One
of the candidates is also an output from GPT-4 itself, along with several outputs from the current policy.
Example 1
Question: When will the earth run out of fresh water?
Example 1: Initial SFT model output
The earth will not run out of fresh water. There is a finite amount of fresh water on the planet, but it is constantly being
replenished through the water cycle. The water cycle involves the evaporation of water from the earth’s surface, its condensation
into clouds, and its precipitation back to the earth’s surface as rain, snow, or other forms of precipitation. This cycle ensures
that fresh water is always available, although the distribution of fresh water may vary depending on factors such as climate,
geography, and human use.
29
Example 1:DNOIter 1
It is difficult to predict exactly when the Earth will run out of fresh water, as it depends on various factors such as climate
change, population growth, water usage, and technological advancements in water management and conservation. However,
current estimates suggest that we are already facing a global water crisis, with many regions experiencing water scarcity and stress.
According to the United Nations, about 2.2 billion people worldwide lack access to safe drinking water, and this number is
expected to increase due to population growth and climate change. The World Wildlife Fund (WWF) estimates that by 2025,
half of the world’s population will be living in water-stressed areas.
While the Earth’s total water supply remains constant, the availability of fresh water is decreasing due to overuse, pollution,
and climate change. It is important to note that fresh water is a renewable resource, but it takes time for it to replenish naturally.
Therefore, it is crucial to manage and conserve water resources to ensure their sustainability for future generations.
To address the global water crisis, it is necessary to implement sustainable water management practices, improve water
infrastructure, promote water conservation and efficiency, and raise awareness about the importance of water conservation.
In summary, while the Earth will not run out of water, the availability of fresh water is becoming increasingly scarce in many
regions. It is important to take action to manage and conserve water resources to ensure their sustainability for future generations.
It is also important to note that the availability of fresh water varies depending on the location and the specific water sources.
Some regions may face water scarcity sooner than others, and it is important to address these issues on a local level.
In conclusion, while it is difficult to predict exactly when the Earth will run out of fresh water, it is clear that we are already
facing a global water crisis. It is important to take action to manage and conserve water resources to ensure their sustainability
for future generations.
Reference:
United Nations. (2019). World Water Development Report 2019. Retrieved from https://www.unwater.org/publications/world-
water-development-report-2019/
World Wildlife Fund. (2019). Water Scarcity. Retrieved from https://www.worldwildlife.org/issues/water-scarcity
Note: The information provided in this response is based on current scientific knowledge and available data. However, it is
important to note that the field of water resources management is constantly evolving, and new research and findings may
emerge in the future.
Note 2: The response provided is based on the information available at the time of writing and may not reflect the most current
data or research. It is important to consult the latest scientific literature and data sources for the most up-to-date information on
the topic.
Note 3: The response provided is based on the assumption that the question is asking about the availability of fresh water
for human consumption. However, it is important to note that fresh water is also essential for ecosystems and other living
organisms. The availability of fresh water for all purposes is a complex issue that requires a holistic approach to management
and conservation.
Note 4: The response provided is based on the assumption that the question is asking about the availability of fresh water on a
global scale.
30
Example 1:DNOIter 2
It is not accurate to say that the Earth will "run out" of fresh water, but rather that we could potentially deplete or
contaminate fresh water resources to a point where they can no longer sustain us. Fresh water is a renewable resource, but
its availability depends on various factors such as climate, geography, and human activities. Here are some key points to consider:
1. Water cycle: The water cycle, which includes evaporation, condensation, precipitation, and runoff, constantly replenishes
fresh water resources. However, the rate of replenishment varies depending on the region and climate.
2. Groundwater: Groundwater is a significant source of fresh water, but it is not infinite. Over-extraction of groundwater can
lead to depletion and contamination. It takes thousands of years for groundwater to replenish naturally.
3. Surface water: Surface water, such as rivers and lakes, is also a renewable resource, but it can be affected by climate change,
deforestation, and other human activities. Droughts and changes in precipitation patterns can reduce the availability of surface
water.
4. Contamination: Contamination from pollution, agricultural runoff, and other sources can render fresh water unusable. This
includes both groundwater and surface water.
5. Population growth: As the global population continues to grow, the demand for fresh water is increasing. This can lead to
over-extraction, contamination, and other issues that can impact the availability of fresh water.
6. Climate change: Climate change is expected to exacerbate water scarcity in many regions due to changes in precipitation
patterns, increased evaporation, and melting of glaciers and snowpacks.
In conclusion, while fresh water is a renewable resource, its availability is not infinite and can be impacted by various factors. It
is essential to manage and conserve fresh water resources sustainably to ensure their long-term availability for future generations.
So, instead of asking when the Earth will "run out" of fresh water, it is more accurate to ask how we can manage and conserve
fresh water resources to ensure their long-term sustainability.
Example 1:DNOIter 3
The Earth as a planet does not "run out" of fresh water, as it continues to cycle through the hydrological cycle, which includes
processes such as evaporation, condensation, precipitation, and infiltration. However, the availability and accessibility of fresh
water for human use are significant concerns.
The freshwater crisis is primarily driven by population growth, urbanization, agricultural expansion, and climate change, which
lead to increased demand, pollution, and variability in water availability. According to the United Nations, approximately 2.2
billion people lack access to safe drinking water services, and 4.2 billion people experience severe water scarcity for at least
one month a year.
While the Earth will not run out of fresh water, the distribution and availability of fresh water for human use are be-
coming increasingly challenging. To address this issue, it is crucial to implement sustainable water management practices,
reduce water waste, and promote equitable access to fresh water resources.
In summary, the Earth will not run out of fresh water, but the availability and accessibility of fresh water for human
use are becoming increasingly limited, and it is essential to address this challenge through sustainable water management
practices and equitable access to resources.
Table 5: Outputs of the variousDNOmodels across iterations for the question:
water?. The output for Iter-1 is a bit too long as shown in. We believe this could be addressed with better
hyperparameter tuning or preference data. We fine the initial SFT model’s response to be missing important points
about how the premise of the question is best addressed by highlighting access to freshwater. The last reponse, for Iter
3 is more informative and specific than the initial SFT response.
Example 2
Question: what data structure would you use to design a url lookup system to quickly determine whether a url exists, or
if not, its closest prefix. Explain.
31
Example 2: Initial SFT model output
To design a URL lookup system that can quickly determine whether a URL exists or find its closest prefix, we can use a Trie
data structure.
A Trie is a tree-like data structure that stores strings in a hierarchical manner. Each node in the Trie represents a character in
the string, and the edges represent the characters themselves. The root node represents an empty string, and each subsequent
node represents a character appended to the previous node’s string.
In our URL lookup system, we can use a Trie to store all the existing URLs. Each node in the Trie will represent a URL prefix,
and the edges will represent the characters in the URL.
To determine whether a URL exists, we can traverse the Trie starting from the root node. If we reach a node that represents a
complete URL, we know that the URL exists. If we reach a node that represents an incomplete URL, we know that the URL
does not exist.
To find the closest prefix of a URL, we can traverse the Trie starting from the root node. We can keep track of the longest prefix
we have encountered so far. Whenever we encounter a node that represents a URL prefix, we compare its length with the length
of the longest prefix we have encountered so far. If the length of the current prefix is greater, we update the longest prefix.
By using a Trie, we can efficiently determine whether a URL exists or find its closest prefix in constant time complexity. This
is because the Trie allows us to traverse the data structure in a hierarchical manner, and we can quickly determine whether a
URL exists or find its closest prefix by following the appropriate edges.
Example 2:DNOIter 1
To design a URL lookup system that quickly determines whether a URL exists or finds its closest prefix, you can use a
combination of a Trie (Prefix Tree) and a Bloom Filter. This approach provides efficient lookup and prefix matching while
maintaining a low memory footprint.
1.
character in the string, and the edges represent the characters themselves. The root node is empty, and each subsequent
node contains a single character. This data structure allows for efficient prefix matching and string lookup.
2.
whether an element is a member of a set. It provides a fast way to check if a URL exists in the system, but it may produce
false positives.
Here’s how you can use these data structures together:
1.
characters in the URL.
2.
When a URL is added to the system, it is hashed using the hash functions, and the resulting bits are set in the Bloom
Filter.
3.
the system. If the URL is present in the Bloom Filter, look up the URL in the Trie. If the URL is found in the Trie, it
exists in the system. If the URL is not found in the Trie, it may be a prefix of an existing URL.
4.
Trie until you find the longest prefix that matches the input URL. This will give you the closest prefix.
This approach provides efficient lookup and prefix matching while maintaining a low memory footprint. The Trie allows for
efficient string lookup, and the Bloom Filter provides a fast way to check if a URL exists in the system. However, the Bloom
Filter may produce false positives, so it should be used in conjunction with the Trie to ensure accuracy.
Note: The Trie and Bloom Filter can be optimized further by using a compressed Trie or a Bloom Filter with a variable-size bit
array. These optimizations can reduce the memory footprint and improve the performance of the system.
32
Example 2:DNOIter 2
To design a URL lookup system that can quickly determine if a URL exists or find its closest prefix, you can use a data structure
called a Trie (also known as a Prefix Tree). A Trie is a tree-like data structure that stores strings as nodes, where each edge
is labeled with a single character. This data structure is particularly efficient for searching and inserting strings with common
prefixes.
Here’s how you can use a Trie to implement a URL lookup system:
1.
a dictionary to store child nodes.
2.
If the character is not present in the current node’s dictionary, create a new node and add it.
3.
current node’s dictionary, move to the corresponding child node. If the character is not present, return the closest prefix
by traversing the Trie upwards from the last valid node.
Here’s an example implementation in Python:
classTrieNode
def__init__
self . is_end_of_word
self . children {
classTrie
def__init__
self . root
definsert ( self , url )
current_node
forcharinurl
ifcharnot current_node . children
current_node . children
current_node
current_node . is_end_of_word
defsearch ( self , url )
current_node
forcharinurl
ifcharincurrent_node . children
current_node
else:
returnself . find_closest_prefix ( current_node )
returnurlifcurrent_node . is_end_of_word elseNone
deffind_closest_prefix ( self , current_node )
prefix
whilecurrent_node
prefix .
current_node
return’
This implementation allows you to quickly determine if a URL exists or find its closest prefix by searching the Trie. The ‘in-
sert‘ method inserts URLs into the Trie, and the ‘search‘ method searches for a URL or returns its closest prefix if it doesn’t exist.
33
Example 2:DNOIter 3
To design a URL lookup system that can quickly determine if a URL exists or find its closest prefix, you can use a data structure
called the Trie (also known as Prefix Tree). A Trie is a tree-like data structure where each node represents a character in the key
(URL in this case) and each branch corresponds to a possible character in the alphabet.
Here’s how you can implement a URL lookup system using a Trie:
1.
the URL exists.
2.
characters of the URL. If a character doesn’t exist in the Trie, create a new node for it.
3.
(a)
(b)
(c)
(d)
(e)
To search for a URL or find its closest prefix, follow these steps:
1.
2.
3.
4.
Table 6: Outputs of the variousDNOmodels across iterations for an interview question relating to the
URL lookup system.. The last response from Iter 3 is more informative and clearer, and doesn’t contain misleading
information (searching a trie runs in linear complexity in the length of the strings, not constant...). The response for
Iter 1 contains an implementation of a Trie, which may be unnecessary because the user did not ask for it.
Example 3
Question: What factors led to the outbreak of ww1
Example 3: Initial SFT model output
There were several factors that led to the outbreak of World War I:
1. Nationalism: The rise of nationalism in Europe led to a desire for territorial expansion and the desire to unite people of
the same ethnicity or language under one nation. This led to tensions between countries and a desire to assert dominance over
neighboring territories.
2. Imperialism: The scramble for colonies in Africa and Asia led to competition between European powers, which further
fueled tensions and rivalries.
3. Militarism: The arms race between European powers led to a buildup of military forces and weapons, which created a sense
of insecurity and fear among neighboring countries.
4. Alliances: The complex system of alliances between European powers created a web of interdependencies that made it
difficult for any one country to remain neutral in the event of a conflict.
5. Assassination of Archduke Franz Ferdinand: The assassination of Archduke Franz Ferdinand of Austria-Hungary by a
Serbian nationalist in 1914 was the immediate trigger for the war, but it was the culmination of these other factors that led to
the outbreak of the war.
34
Example 3:DNO-More-DataIter 1
The outbreak of World War I was the result of a complex interplay of factors, including:
1.
rivalries. The scramble for Africa and the race to acquire overseas territories created a sense of competition and mistrust
among nations.
2.
a buildup of military power. The development of new technologies, such as machine guns, artillery, and submarines,
increased the destructive potential of warfare.
3.
in one region could escalate into a larger war. The Triple Entente (France, Russia, and Britain) and the Triple Alliance
(Germany, Austria-Hungary, and Italy) were the main alliances that formed the basis of the conflict.
4.
contributed to the outbreak of war. The desire for self-determination and independence led to conflicts and tensions
among different ethnic groups.
5.
Serbian nationalist in 1914 was the immediate trigger for the war. The complex web of alliances and rivalries among
European powers led to a series of diplomatic maneuvers and ultimatums, which ultimately resulted in the declaration of
war.
6.
leading up to the war. The economic rivalry between Germany and Britain, particularly in the context of colonial
competition, added to the existing tensions.
7.
Alexander I of Serbia in 1903, created a sense of uncertainty and instability. These political upheavals contributed to the
overall atmosphere of tension and conflict.
8.
the breakdown of communication and trust. The inability of leaders to find peaceful solutions to disputes contributed to
the escalation of tensions and the eventual outbreak of war.
9.
also played a role in the outbreak of war. The lack of clear communication and misunderstandings between different
countries led to a series of missteps and escalations.
10.
where conflicts in one region could have a domino effect on other regions. The interconnectedness of these factors led
to a chain reaction of events that ultimately led to the outbreak of World War I.
35
Example 3:DNO-More-DataIter 6
The outbreak of World War 1, which started on July 28, 1914, resulted from a complex web of events and circumstances that
had been developing for years prior. Here are some of the key factors that led to the war:
1.
along their borders. This led to a sense of insecurity and a belief that preventive war was necessary to secure national
interests.
2.
the Triple Alliance (Germany, Austria-Hungary, and Italy). These alliances created a complex web of obligations that
meant if one member was attacked, all members were obliged to come to their aid.
3.
tensions between the major powers.
4.
assassination of Archduke Franz Ferdinand of Austria, which triggered the war, was a result of nationalist tensions in the
Balkans.
5.
the first country feel more insecure, leading to a spiral of actions and reactions that escalated the situation.
6.
Serbian nationalist group, provided the immediate trigger for the war.
7.
to Austria-Hungary declaring war on Serbia.
These factors combined to create a powder keg that was waiting for a spark to ignite the war. Once the spark was lit, the
complex web of alliances meant that the conflict quickly escalated into a global war.
Table 7: Outputs of first and lastDNO-More-Dataiterations for the third example
ww1?. The last response from Iter 6 has a higher information density; it recalls more key facts and entities, including
the specific date of the start of WW1, Triple Entente, the Triple Alliance, Gavrilo Princip, and the Black Hand. Iteration
1 also contains some of this information, but is too wordy. The initial SFT model seems a lot more superficial.
36