Extracted Text

2201.02373v11.pdf

Mirror Learning: A Unifying Framework of Policy Optimisation
Jakub Grudzien Kuba
1
Christian Schroeder de Witt
1
Jakob Foerster
1
Abstract
Modern deep reinforcement learning (RL) al-
gorithms aremotivatedby either the gener-
alised policy iteration (GPI) or trust-region learn-
ing (TRL) frameworks. However, algorithms
thatstrictly respectthese theoretical frameworks
have proven unscalable. Surprisingly, the only
known scalable algorithms violate the GPI/TRL
assumptions, e.g. due to required regularisa-
tion or other heuristics. The current explana-
tion of their empirical success is essentially “by
analogy”: they are deemed approximate adapta-
tions of theoretically sound methods. Unfortu-
nately, studies have shown that in practice these
algorithms differ greatly from their conceptual
ancestors. In contrast, in this paper we intro-
duce a novel theoretical framework, namedMir-
ror Learning, which provides theoretical guar-
antees to a large class of algorithms, including
TRPO and PPO. While the latter two exploit the
flexibility of our framework, GPI and TRL fit in
merely as pathologically restrictive corner cases
thereof. This suggests that the empirical perfor-
mance of state-of-the-art methods is a direct con-
sequence of their theoretical properties, rather
than of aforementioned approximate analogies.
Mirror learning sets us free to boldly explore
novel, theoretically sound RL algorithms, a thus
far uncharted wonderland.
1. Introduction
The generalised policy iteration (Sutton & Barto,,
GPI) and trust-region learning (Schulman et al.,,
TRL) frameworks lay the foundations for the design of
the most commonly used reinforcement learning (RL) al-
gorithms. At each GPI iteration, an RL agent first eval-
uates itspolicyby computing a scalarvalue function, and
then updates its policy so as to maximise the value function
1
University of Oxford. Correspondence to: Jakub Grudzien
Kuba<jakub.grudzien@new.ox.ac.uk>.
Proceedings of the39
th
International Conference on Machine
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s).
at every environment state. This procedure serves well for
Markov Decision Problems (Bellman,, MDPs) with
small state spaces, as it is guaranteed to produce a new pol-
icy whose value at every state improves monotonically over
the old one. However, the sizes of state spaces that many
practical problem settings constitute are intractable to ex-
act implementations of GPI. Instead, large scale settings
employ function approximation and sample based learning.
Unfortunately, such an adoption of GPI has proven unsta-
ble (Mnih et al.,), which has necessitated a number of
adjustments, such as replay buffers, target networks, etc., to
stabilise learning (Van Hasselt et al.,). In their days,
these heuristics have been empirically sucessful but were
not backed up by any theory and required extensive hyper-
parameter tuning (Mnih et al.,).
Figure 1.Known RL frameworks and algorithms as points in the
infinite space oftheoretically soundmirror learning algorithms.
Instead, TRL improves the robustness of deep RL meth-
ods by optimising asurrogateobjective, which restricts the
policy update size at each learning iteration while preserv-
ing monotonic improvement guarantees (Schulman et al.,
2015). To this end it introduces a notion of distance be-
tween policies, e.g. through evaluating the maximal KL-
divergence. Unfortunately, these types of measures do not
scale to large MDPs that TRL was meant to tackle in the
first place (since small problems can be solved by GPI).
Nevertheless, TRL’s theoretical promises and intuitive ac-
cessibility inspired a number of algorithms based on heuris-
tic approximations thereof. This paradigm shift led to sub-
stantial empirical success: First, inTrust-Region Policy Op-
timization(Schulman et al.,, TRPO), a hard constraint
mechanism on the policy update size was introduced. Sec-
ond,Proximal Policy Optimization(Schulman et al.,,
PPO) replaced the hard constraint with theclipping objec-

Mirror Learning
tivewhich, intuitively, disincentivises large policy updates.
Since they were published, these algorithms have both been
applied widely, resulting in state-of-the-art (SOTA) perfor-
mance on a variety of benchmark, and even real-world,
tasks (Schulman et al.,;;,).
There is a stark contrast between the empirical success and
the lack of theoretical understanding, which is largely lim-
ited to “proof by analogy”: For example, PPO is regarded
assoundsince it approximates an algorithm compliant with
TRL. However, recent studies concluded that PPO arbitrar-
ily exceeds policy update size constraints, thus fundamen-
tally violating TRL principles (Wang et al.,;
et al.,), which shows that PPO’s empirical success
does not follow from TRL. On a higher level, this reveals a
concerning lack of theoretical understanding of the perhaps
most widely used RL algorithm.
To reconnect theory with practice, in this paper we intro-
duce a novel theoretical framework, namedMirror Learn-
ing, which provides theoretical guarantees to a large class
of known algorithms, including TRPO (Schulman et al.,
2015), PPO (Schulman et al.,), and MDPO (Tomar
et al.,), to name a few, as well as to myriads of al-
gorithms that are yet to be discovered. Mirror learning is
a general, principled policy-learning framework that pos-
sesses monotonic improvement and optimal-policy conver-
gence guarantees, under arbitrary update-size constraints.
Intuitively, a mirror-learning policy update maximises the
current value function while keeping the, arbitrarily speci-
fied, update cost (calleddrift) small. The update is further
constrained within aneighbourhoodof the current policy,
that can also be specified arbitrarily.
Since TRL and GPI were the only anchor points for theo-
retically sound policy optimisation, prior unsuccessful at-
tempts to proving the soundness of PPO and other algo-
rithms tried to shoe-horn them into the overly narrow con-
fines of these frameworks (Liu et al.,;,
2020;,). Mirror learning shows that
this was a flawed approach: Rather than shoe-horning the
algorithms, we radically expand the space of theoretically
sound methods, naturally covering PPO et al in their cur-
rent, practical formulations. Our work suggests that the
empirical performance of these state-of-the-art methods is
a direct consequence of their theoretical properties, rather
than of aforementioned approximate analogies. Rather than
being limited by the confines of two singular anchor points
(GPI/TRL), mirror learning sets us free to explore an end-
less space of possible algorithms, each of which is already
endowed with theoretical guarantees.
We also illustrate the explanatory power of mirror learning
and use it to explain a number of thus farunexplainedob-
servations concerning the performance of other algorithms
in the literature.
Finally, we show that mirror learning allows us to view pol-
icy optimisation as a search on a directed graph, where the
total path-weight between any two nodes in an optimisa-
tion path is upper bounded by the optimality gap between
them, which we confirm experimentally. We also analyse
proof-of-concept instantiations of mirror learning that use
a variety of different neighbourhood and drift functions.
2. Background
In this section, we introduce the RL problem formulation
and briefly survey state-of-the-art learning protocols.
2.1. Preliminaries
We consider a Markov decision process (MDP) defined by
a tuplexS,A, r, P, γ, dy. Here,Sis a discrete state space,
Ais a discrete action space
1
,r:SˆAÑ r´Rmax, Rmaxs
is the bounded reward function,P:SˆAˆSÑ r0,1s
is the probabilistic transition function,γP r0,1qis the
discount factor, anddPPpSq(herePpXqdenotes the
set of probability distributions over a setX) is the initial
state distribution. At time steptPN, the agent is at a
state st, takes an action ataccording to its stationary policy
πp¨|stq PPpAq, receives a rewardrpst,atq, and moves to
the state st`1, whose probability distribution isPp¨|st,atq.
The whole experience is evaluated by thereturn, which is
the discounted sum of all rewards
r
γ
8fi
8
ÿ
t“0
γ
t
rpst,atq.
The state-action and state value functions that evaluate the
quality of states and actions, by providing a proxy to the
expected return, are given by
Qπps, aqfiEs1:8„P,a1:8„π
“
r
γ
8|s0“s,a0“a
‰
,
VπpsqfiEa0„π,s1:8„P,a1:8„π
“
r
γ
8|s0“s
‰
,
respectively. Theadvantage function, defined as
Aπps, aqfiQπps, aq ´Vπpsq,
estimates the advantage of selecting one action over an-
other. The goal of the agent is to learn a policy that max-
imises theexpected return, defined as a function ofπ,
ηpπqfiEs0„d,a0:8„π,s0:8„P
“
r
γ
8
‰
“Es„ρπ,a„π
“
rps,aq
‰
.
Here,ρπpsqfi
8ř
t“0
γ
t
Prpst“s|πqis the (improper)
marginal state distribution. Interestingly, the set of so-
lutions to this problem always contains theoptimal poli-
cies—policiesπ
˚
, for whichQ
π
˚ps, aqfiQ
˚
ps, aq ě
Qπps, aqholds for anyπPΠfi
Ś
sPS
PpAq,sPS, aP
1
Our results extend to any compact state and action spaces.
However, we work in the discrete setting in this paper for clarity.

Mirror Learning
A. Furthermore, the optimal policies are the only ones that
satisfy theoptimality equation(Sutton & Barto,)
π
˚
p¨|sq “arg max
¯πp¨|sqPPpAq
Ea„¯π
“
Q
π
˚ps,aq
‰
,@sPS.
In the next subsections, we survey the most fundamental
and popular approaches of finding these optimal policies.
2.2. Generalised Policy Iteration
One key advantage of thegeneralised policy iteration(GPI)
framework is its simplicity. Even though the policy influ-
ences both the rewards and state visitations, GPI guarantees
that simply reactinggreedilyto the value function,
πnewp¨|sq “arg max
¯πp¨|sqPPpAq
Ea„¯π
“
Qπoldps,aq
‰
,@sPS(1)
guarantees that the new policy obtains higher expected re-
turns at every state,i.e.,Vπnewpsq ěVπoldpsq,@sPS. More-
over, this procedure converges to the set of optimal policies
(Sutton & Barto), which can be seen intuitively by
substituting a fixed-point policy into Equation (1).
In settings with small, discrete action spaces GPI can be
executed approximately without storing the policy variable
πby responding greedily to the state-action value func-
tion. This gives rise tovalue-based learning(Sutton &
Barto,), where the agent learns a Q-function with the
Bellman-max update
Qnewps, aq “rps, aq `γ¨Es
1
„P
“
max
a
1
PA
Qoldps
1
, a
1
q
‰
,
which is known to converge toQ
˚
(Watkins & Dayan,
1992), and has inspired design of a number of methods
(Mnih et al.,;,).
Another, approximate, implementation of GPI is through
the policy gradient (PG) algorithms (Sutton et al.,).
These are methods which optimise the policyπθby pa-
rameterising it withθ, and updating the parameter in the
direction of the gradient of the expectd return, given by
∇θηpπθq|θ“θold
“Es„ρπ
θ
old
”
∇θEa„πθ
“
Qπθ
old
ps,aq
‰ˇ
ˇ
θ“θold
ı
,
which is the gradient of the optimisation objective of GPI
from Equation (1) weighted by ρπθ
old
. An analogous re-
sult holds for (continuous) deterministic policies (Silver
et al.,;,). Thus, PG based algo-
rithms approximately solve the GPI step, with a policy in
the neighbourhood ofπθold
, provided the step-sizeαą0
in the updateθnew“θ`α∇θηpπθq|θ“θold
is sufficiently
small. PG methods have played a major role in applications
of RL to the real-world settings, and especially those that
involve continuous actions, where some value-based algo-
rithms, like DQN, are intractable (Williams,;
& Bartlett,;,).
2.3. Trust-Region Learning
In practice, policy gradient methods may suffer from the
high variance of the PG estimates and training instabil-
ity (Kakade & Langford,;,). Trust-
region learning(TRL) is a framework that aims to solve
these issues. At its core lies the following policy update
πnew“arg max
¯πPΠ
Es„ρπ
old
,a„¯π
“
Aπoldps,aq
‰
´CD
max
KLpπold,¯πq,
whereC“4γmax
s,a
|Aπold
ps, aq|{p1´γq
2
.(2)
It was shown by2015) that this update
guarantees the monotonic improvement of the return,i.e.,
ηpπnewq ěηpπoldq. Furthermore, the KL-penalty in the
above objective ensures that the new policy stays within
the neighbourhood ofπold, referred to as thetrust region.
This is particularly important with regard to the instability
issue of the PG-based algorithms (Duan et al.,).
Although the exact calculation of the max-KL penalty is in-
tractable in settings with large/continuous state spaces, the
algorithm can be heuristically approximated, e.g. through
Trust Region Policy Optimization(Schulman et al.,,
TRPO), which performs a constrained optimisation update
πnew“arg max
¯πPΠ
Es„ρπ
old
,a„¯π
“
Aπold
ps,aq
‰
subject toEs„ρπ
old
“
DKL
`
πoldp¨|sq, πnewp¨|sq
˘‰
ďδ,
whereδą0. Despite its deviation from the original theory,
empirical results suggest that TRPO approximately main-
tains the TRL properties. However, in order to achieve a
simpler TRL-based heuristic,2017) intro-
ducedProximal Policy Optimization(PPO) which updates
the policy by
πnew“arg max
¯πPΠ
Es„ρπ
old
,a„πold
“
L
clip
‰
, (3)
L
clip
“min
´
rp¯πqAπold
ps, aq,clip
`
rp¯πq,1˘ϵ
˘
Aπold
ps, aq
¯
,
where for givens, a, rp¯πqfi¯πpa|sq{πpa|sq. Theclipoper-
ator truncates rp¯πqto1´ϵ(or1`ϵ), if it is below (or above)
the threshold interval. Despite being motivated from a TRL
perspective, PPO violates its very core principles by fail-
ing to constrain the update size—whether measured either
by KL divergence, or by the likelihood ratios (Wang et al.,
2020;,). Nevertheless, PPO’s ability
to stabilise policy training has been demonstrated on a wide
variety of tasks, frequently resulting in SOTA performance
(Henderson et al.,;,). This begs
the question how PPO’s empirical performance can be jus-
tified theoretically. Mirror learning, which we introduce in
the next section, addresses this issue.
3. Mirror Learning
In this section we introduceMirror Learning, state its the-
oretical properties, and explore its connection to prior RL
theory and methodology.

Mirror Learning
3.1. The Framework
We start from the following definition.
Definition 3.1.Adrift functionalDis a map
D: ΠˆSÑ tDπp¨|sq:PpAq ÑRu,
such that for allsPS, andπ,¯πPΠ, writingDπp¯π|sqfor
Dπ
`
¯πp¨|sq|s
˘
, the following conditions are met
1.Dπp¯π|sq ěDπpπ|sq “0(nonnegativity),
2.Dπp¯π|sqhas zero gradient
2
with respect to¯πp¨|sq,
evaluated at¯πp¨|sq “πp¨|sq(zero gradient).
Letν
¯π
πpsq PPpSqbe a state distribution that can depend
onπand¯π. ThedriftD
ν
of¯πfromπis defined as
D
ν
πp¯πqfiEs„ν
¯π
π
“
Dπp¯π|sq
‰
whereν
π
¯πis such that the expectation is continuous inπand
¯πand monotonically non-decreasing in the total variation
distance between them. The drift ispositiveifD
ν
πp¯πq “0
implies¯π“π, andtrivialifD
ν
πp¯πq “0,@π,¯πPΠ.
We would like to highlight that drift is not the Bregman
distance, associated with mirror descent (Nemirovskij &
Yudin,;,), as we do not re-
quire it to be (strongly) convex, or differentiable every-
where. All we require are, the much milder, continuity
and Gˆateaux-differentiability (Gateaux,;
Nashed,) of the drift functional at¯πp¨|sq “πp¨|sq.
We introduce one more concept whose role is to gener-
ally account for explicit update-size constraints. Such con-
straints reflect a learner’s risk-aversity towards subtantial
changes to its behaviour, like in TRPO, or are dictated by
an algorithm design, like the learning rate in PG methods.
Definition 3.2.We say thatN: ΠÑPpΠqis aneigh-
bourhood operator, wherePpΠqis the power set ofΠ, if
1. (continuity),
2. Npπqis a compact set(compactness),
3. χ: ΠˆΠÑR, such that@πP
Π, there existsζą0, such thatχpπ,¯πq ďζimplies
¯πPNpπq(closed ball).
Thetrivialneighbourhood operator isN”Π.
LetD
ν
be a drift, andπ,¯πbe policies. Suppose that
βπPPpSqis a state distribution, referred to as asam-
pling distribution, such thatβπpsq ą0,@sPS. Then, the
mirror operatortransforms the value function ofπinto the
following functional of¯π,
rM
πnew
D
Qπold
spsq “Ea„πnew
“
Qπold
ps,aq
‰
´Dπold
pπnew|sq
2
More precisely, all its Gˆateaux derivatives are zero.
Note that, in the above definition,Qπcan be equivalently
replaced byAπ, as this only subtracts a constantVπpsqin-
dependent of¯π. We will use this fact later. As it turns out,
despite the appearance of the drift penalty, simply acting to
increase the mirror operator suffices to guarantee the pol-
icy improvement, as summarised by the following lemma,
proved in Appendix.
Lemma 3.3.Letπoldandπnewbe policies. Suppose that
rM
πnew
D
Vπold
spsq ě rM
πold
D
Vπold
spsq,@sPS.(4)
Then,πnewis better thanπold, so that for every states,
Vπnew
psq ěVπold
psq.
Of course, the monotonic improvement property of the ex-
pected return is a natural corollary of this lemma, as
ηpπnewq “Es„d
“
Vπnew
psqs
ěEs„d
“
Vπold
psqs “ηpπoldq.
Hence, optimisation of the mirror operator guarantees the
improvement of the new policy’s performance. Condition
(4), however, is represented by|S|inequalities and solving
them may be intractable in large state spaces. Hence, we
shall design a proxy objective whose solution simply satis-
fies Condition (4), and admits Monte-Carlo estimation (for
practical reasons). For this purpose, we define the update
rule of mirror learning.
Definition 3.4.Letπoldbe a policy. Then,mirror learning
updates the policy by
πnew“arg max
¯πPNpπoldq
Es„βπ
”
“
M
¯π
DVπold
‰
psq
ı
. (5)
This, much simpler and clearer, problem formulation, is
also robust to settings with large state spaces and parame-
terised policies. Being an expectation, it can be approxi-
mated by sampling with the unbiased “batch” estimator
1
|B|
ÿ
s,aPB
”
¯πpa|sq
πoldpa|sq
Qπoldps,aq ´
ν
¯π
πold
psq
βπoldpsq
Dπoldp¯π|sq
ı
.(6)
The distributionν
¯π
πold
can be chosen equal toβπold, in which
case the fraction from the front of the drift disappears, sim-
plifying the estimator. Note also that one may chooseβπold
independently ofπold. For example, it can be some fixed
distribution overS, like the uniform distribution. Hence,
the above update supportsboth on-policy and off-policy
learning. In the latter case, the fraction¯πpa|sq{πoldpa|sqis
replaced by¯πpa|sq{πhistpa|sq, whereπhistis the policy used
to sample the pairps,aqand insert it to a replay buffer (for
more details see Appendix). Such a Monte-Carlo objec-
tive can be optimised with respect to¯πparameters by, for
example, a few steps of gradient ascent. Most importantly,
in its exact form, the update in Equation (5) guarantees that
the resulting policy satisfies the desired Condition (4).

Mirror Learning
Lemma 3.5.Letπoldbe a policy andπnewbe obtained from
the mirror update of Equation (5). Then,
rM
πnew
D
Vπold
spsq ě rM
πold
D
Vπold
spsq,@sPS.
Hence,πnewattains the properties provided by Lemma.
For proof, see Appendix. Next, we use the above results
to characterise the properties of mirror learning. The im-
provement and convegence properties that it exhibits not
only motivate its usage, but may also contribute to ex-
plaining the empirical success of various widely-used al-
gorithms. We provide a proof sketch below, and a detailed
proof in Appendix.
Theorem 3.6(The Fundamental Theorem of Mirror Learn-
ing).LetD
ν
be a drift,Nbe a neighbourhood operator,
and the sampling distributionβπdepend continuously on
π. Letπ0PΠ, and the sequence of policiespπnq
8
n“0
be
obtained by mirror learning induced byD
ν
,N, andβπ.
Then, the learned policies
1.
ηpπn`1q ěηpπnq `Es„d
«
ν
πn`1
πnpsq
βπn
psq
Dπn
pπn`1|sq
ff
,
2.
lim
nÑ8
Vπn
“V
˚
,
3.
lim
nÑ8
ηpπnq “η
˚
,
4. ω-limit set consists of the optimal policies.
Proof sketch.We split the proof into four steps. InStep
1, we use Lemmas
value functionspVπk
qkPNconverges. InStep 2, we show the
existance of limit points¯πofpπkqkPN, and show that they
are fixed points of the mirror learning update, which we do
by contradiction. The most challengingStep 3, is where we
prove (by contradiction) that¯πis a fixed point of GPI. In
the proof, we supposed that¯πis not a fixed point of GPI,
and use the drift’s zero-gradient property (Definition)
to show that one could slightly perturb¯πto obtain a policy
π
1
, which corresponds to a higher mirror learning objective.
This contradicts¯πsimultaneously being a fixed point of
mirror learning.Step 4finalises the proof, by recalling that
fixed points of GPI are optimal policies.
Hence, any algorithm whose update takes the form of
Equation (5) improves the return monotonically, and con-
verges to the set of optimal policies. This result provides
RL algorithm designers with a template. New instances of
it can be obtained by altering the driftD
ν
, the neighbour-
hood operatorN, and the sampling distribution function
βπ. Indeed, in the next section, we show how some of the
most well-known RL algorithms fit this template.
4. Mirror Learning View of RL Phenomena
We provide a list of RL algorithms in their mirror learning
representation in Appendix.
Generalised Policy IterationFor the trivial driftD
ν
”
0, and the trivial neighbourhood operatorN”Π, the mir-
ror learning update from Equation (5) is equivalent to
πnew“arg max
¯πPΠ
Es„βπ
old
“
Ea„¯π
“
Qπoldps,aq
‰‰
.
As in this case the maximisation is unconstrained, and the
expectation over s„βπold
is monotonically increasing in
the individual conditionalEa„¯π
“
Qπold
ps,aq
‰
, the maximi-
sation distributes across states
πnewp¨|sq “arg max
¯πp¨|sqPPpAq
Ea„¯π
“
Qπold
ps,aq
‰
,
which is exactly the GPI update, as in Equation (1). Hence,
all instances of GPI,e.g.,policy iteration (Sutton & Barto,
2018), are special cases of mirror learning, and thus in-
herit its improvement and convergence properties. Fur-
thermore, neural implementations, like A3C (Mnih et al.,
2016), maintain these qualities approximately.
Trust-Region LearningLet us choose the drift opera-
torDto be the scaled KL-divergence, so thatDπp¯π|sq “
CπDKL
`
πp¨|sq,¯πp¨|sq
˘
, whereCπis a constant that de-
pends onπ. Further, in the construction of the driftD
ν
πp¯πq,
let us setν
¯π
πpsq “δps´smaxq, whereδis the Dirac-
delta distribution, andsmaxis the state at which the KL-
divergence betweenπand¯πis largest. For the neighbour-
hood operator, we choose the trivial one. Lastly, for the
sampling distribution, we chooseβπ“¯ρπ, the normalised
version ofρπ. Then, a mirror learning step maximises
Es„¯ρπ
old
”
Ea„πnew
“
Aπold
ps,aq
‰
ı
´Cπold
D
max
KLpπold, πnewq,
which, for appropriately chosenCπold, is proportional (and
thus equivalent) to the trust-region learning update from
Equation (2). Hence, the monotonic improvement of trust-
region learning follows from Theorem, which also im-
plies its convergence to the set of optimal policies.
TRPOThe algorithm is designed to approximate trust-
region learning, and so its mirror-learning representation is
similar. We make the following changes: set the drift oper-
ator toD
ν
”0, and choose the neighbourhood operator to

Mirror Learning
theaverage-KL ball. Precisely,
Npπq “
␣
¯πPΠ|Es„¯ρπ
“
DKL
`
πp¨|sq,¯πp¨|sq
˘‰
ďδ
(
.
The resulting mirror learning update is the learning rule of
the TRPO algorithm. As a result, even though TRPO was
supposed to only approximate the monotonic trust-region
learning update (Schulman et al.,), this analysis shows
that TRPO has monontonic convergence guarantees.
PPOWe analyse PPO through unfolding the clipping ob-
jectiveL
clip
(Equation (3)). For a given sPS, it equals
Ea„πold
“
min
`
rp¯πqAπoldps,aq,clip
`
rp¯πq,1˘ϵ
˘
Aπoldps,aq
˘‰
.
By recalling that rp¯πq “¯πpa|sq{πoldpa|sq, we use impor-
tance sampling (Sutton & Barto,), and the trick of
“adding and subtracting”, to rewriteL
clip
as
Ea„¯π
”
Aπold
ps,aq
ı
´Ea„πold
“
rp¯πqAπold
ps,aq
´min
`
rp¯πqAπold
ps,aq,clip
`
rp¯πq,1˘ϵ
˘
Aπold
ps,aq
˘‰
.
Going forward, we focus on the expectation that is sub-
tracted. First, we can replace theminoperator withmax,
with the identity -minfpxq “max
“
´fpxq
‰
, as follows
Ea„πold
“
rp¯πqAπoldps,aq
`max
`
´rp¯πqAπold
ps,aq,´clip
`
rp¯πq,1˘ϵ
˘
Aπold
ps,aq
˘‰
.
Then, we move rp¯πqAπoldps,aqinside themax, and obtain
Ea„πold
“
max
`
0,
“
rp¯πq ´clip
`
rp¯πq,1˘ϵ
˘‰
Aπoldps,aq
˘‰
,
which can be simplified as
Ea„πold
”
ReLU
´
“
rp¯πq ´clip
`
rp¯πq,1˘ϵ
˘‰
Aπold
ps,aq
¯ı
(7)
Notice now that this expression is always non-negative,
due to the presence of the ReLU function (Fukushima &
Miyake,). Furthermore, for ¯πsufficently close to
πold,i.e.,so that rp¯πq P r1´ϵ,1`ϵs, the clip operator
reduces to the identity function, and so Equation (7) is con-
stant and zero. Hence, it is zero at¯π“πold, and has a
zero gradient. Therefore, the expression in Equation (7)
is a drift functional of¯πp¨|sq. This, together with taking
βπold
”ν
¯π
πold
”¯ρπold
, and the trivial neighbourhood oper-
atorN”Π, shows that PPO, while it was supposed to
be a heuristic approximation of trust-region learning, is by
itself a rigorous instance of mirror learning. Thus, it in-
herits the monotonic improvement and convergence prop-
erties, which helps explain its great performance.
Interestingly, during the developmnent of PPO, a variant
more closely related to TRL was considered (Schulman
et al.,). Known as PPO-KL, the algorithm updates
the policy to¯π“πnewto maximise
Es„ρπ
old
,a„¯π
“
Aπoldps,aq
‰
´τDKLpπold,¯πq,
and then scalesτup or down by a constant (typically1.5),
depending on whether the KL-divergence induced by the
update exceeded or subceeded some target level. Intrigu-
ingly, according to the authors this approach, although
more closely-related to TRL, failed in practice. Mir-
ror learning explains this apparent paradox. Namely, the
rescaling scheme ofτcauses discontinuity of the penalty
term, when viewed as a function ofπoldand¯π. This pre-
vents the penalty from being a valid drift, so unlike PPO
this version, PPO-KL, is not an instance of mirror learn-
ing. Recently,2020) introduced PPO-KL with a
fixedτ, allowing for continuity of the KL-penalty. Such an
algorithm is an instance of mirror learning, independently
of whether theforwardorbackwardKL-divergence is em-
ployed. And indeed, the authors find that both versions of
the algorithms result in strong empirical performance, fur-
ther validating our theory.
5. Related Work
The RL community has long worked on the development
of theory to aid the development of theoretically-sound
techniques. Perhaps, one of the greatest achievements
along this thread is the development of trust-region learn-
ing (Schulman et al.,, TRL)—a framework that al-
lows for stable training of monotonically improving poli-
cies. Unfortunately, despite its great theoeretical guaran-
tees, TRL is intractable in most practical settings. Hence,
the RL community focused on the development of heuris-
tics, like TRPO (Schulman et al.,) and PPO (Schul-
man et al.,), that approximate it while trading off
theoretical guarantees of TRL for practicality. As these
methods established new state-of-the-art performance on a
variety of tasks (Duan et al.,;,),
the conceptual connection to TRL has been considered the
key to success for many algorithms (Arjona-Medina et al.,
2018;,). However, recent works have
shown that PPO is prone to breaking the core TRL princi-
ples of constraining the update size (Wang et al.,;
gstrom et al.,), and thus revealed an existing chasm
between RL practice and understanding—a problem that
mirror learning resolves.
Trust-region algorithms, although original, inspire connec-
tions between RL and optimisation, where the method of
mirror descenthas been studied (Nemirovskij & Yudin,
1983). This idea has recently been extended upon in
(Tomar et al.,), where Mirror Descent Policy Opti-
mization(MDPO) was proposed—the algorithm optimises
the policy with mirror ascent, where the role of the Breg-
man distance is played by the KL-divergence. The method

Mirror Learning
Figure 2.An intuitive view on the policy DAG and initial steps
of mirror learning. A policy vertex has a neighbour, within its
neighbourhood, which improves the return.
has been shown to achieve great empirical performance
(which is also implied by mirror learning). Notably, a new
stream of research in regularised RL is arising, where the
regulariser is subsumed into the Bregman distance. For ex-
ample,2017) have shown that, in the average-
reward setting, a variant of TRPO is an instance of mirror
descent in a regularised MDP, and converges to the optimal
policy.2020) have generalised TRPO to han-
dle regularised problems, and derived convergence rates for
their methods in the tabular case.2021) and
et al.2021) have proposed mirror-descent algorithms that
solve regularised MDPs with convex regularisers, and also
provided their convergence properties. Here, we would like
to highlight that mirror learning is not a method of solving
regularised problems through mirror descent. Instead, it
is a very general class of algorithms that solve the classical
MDP. The termmirror, however, is inspired by the intuition
behind mirror descent, which solves the image of the orig-
inal problem under themirror map(Nemirovskij & Yudin,
1983)—similarly, we defined the mirror operator.
Mirror learning is also related toProbability Functional
Descent(Chu et al.,, PFD). Although PFD is a gen-
eral framework of probability distribution optimisation in
machine learning problems, in the case of RL, it is an in-
stance of mirror learning—the one recovering GPI. Lastly,
the concepts of mirror descent and functional policy rep-
resentation are connected in a concurrent work of
et al.2021), who show how the technique of mirror de-
scent implements the functional descent of parameterised
policies. This approach, although powerful, fails to capture
some algorithms, including PPO, which we prove to be an
instance of mirror learning. The key to the generalisation
power of our theory is its abstractness, as well as simplicity.
6. Graph-theoretical Interpretation
In this section we use mirror learning to make a surprising
connection between RL and graph theory. We begin by
introducing the following definition of a particulardirected
acyclic graph(DAG).
Definition 6.1(Policy DAG).LetD
ν
be a positive drift,
andNbe a neighbourhood operator. Then, thepolicy DAG
GpΠ,D
ν
,Nqis a graph where
• Π,
•pπ1, π2qis an edge ifηpπ1q ăηpπ2qandπ2PNpπ1q,
• pπ1, π2qisD
ν
π1
pπ2q.
This graph is a valid DAG because the transitive, asymmet-
ric “ă” relation that builds edges prevents the graph from
having cycles. We also know that, for every non-optimal
policyπ, there in an outgoing edgepπ, π
1
qfromπ, asπ
1
can be computed with a step of mirror learning (see Figure
2
The above definition allows us to cast mirror learning as
a graph search problem. Namely, letπ0PΠbe a ver-
tex policy at which we initialise the search, which fur-
ther induces a sequencepπnq
8
n“0
. Let us defineUβ“
minπPΠ,sPSdpsq{βpsq. Theorem
the weight of any traversed edge as
ηpπn`1q ´ηpπnq ěUβ¨D
ν
πn
pπn`1q. (8)
Notice thatηpπnq ´ηpπ0q “
ř
n
i“1
“
ηpπiq ´ηpπi´1q
‰
con-
verges toη
˚
´ηpπ0q. Hence, the following series expansion
η
˚
´ηpπ0q “
8
ÿ
n“0
“
ηpπn`1q ´ηpπnq
‰
is valid. Combining it with Inequality (8), we obtain a
bound on the total weight of the path induced by the search
η
˚
´ηpπ0q
Uβ
ě
8
ÿ
n“0
D
ν
πn
pπn`1q, (9)
Hence, mirror learning finds a path fromπ0to the set of
graph sinks (the optimal policies), whose total weight is fi-
nite, and does not depend on the policies that appeared dur-
ing the search. Note that one could also estimate the left-
hand side from above by a boundpη
˚
`Vmaxq{Uβwhich
is uniform for all initial policicesπ0. This finding may be
considered counterintuitive: While we can be decreasing
the number of edges in the graph by shrinking the neigh-
bourhood operatorNpπq, a path toπ
˚
still exists, and de-
spite (perhaps) containing more edges, its total weight re-
mains bounded.
Lastly, we believe that this inequality can be exploited for
practical purposes: the drift functionalDis an abstract
hyper-parameter, and the choice of it is a part of algo-
rithm design. Practicioners can chooseDto describe a
cost that they want to limit throughout training. For ex-
ample, one can setD“risk, whereriskπp¯π|sqquan-
tifies some notion of risk of updatingπp¨|sqto¯πp¨|sq, or
D“memory, to only make updates at a reasonable mem-
ory expense. Inequality (9) guarantees that the total ex-
pense will remain finite, and provides an upper bound for
it. Thereby, we encourage employing mirror learning with
drifts designed to satisfy constraints of interest.

Mirror Learning
7. Numerical Experiments
We verify the correctness of our theory with numerical ex-
periments. Their purpose is not to establish a new state-
of-the-art performance in the most challenging deep RL
benchmarks. It is, instead, to demonstrate that algorithms
that fall into the mirror learning framework obey Theorem
3.69). Hence, to enable a close connec-
tion between the theory and experiments we choose simple
environments and for drift functionals we selected: KL-
divergence, squared L2 distance, squared total variation
distance, and the trivial (zero) drift. For the neighbourhood
operator, in one suite of experiments, we use the expected
drift ball,i.e., the set of policies within some distance from
the old policy, measured by the corresponding drift; in the
second suite, we use the KL ball neighbourhood, like in
TRPO. We represent the policies withn-dimensional ac-
tion spaces assoftmaxdistributions, withn´1parame-
ters. For an exact verification of the theoretical results,
we test each algorithm overonly onerandom seed. In
all experiments, we set the initial-state and sampling dis-
tributions to uniform. The code is available athttps:
//github.com/znowu/mirror-learning .
Single-step Game.In this game, the agent chooses one of
5actions, and receives a reward corresponding to it. These
rewards are10,0,1,0,5, respectively for each action. The
optimal return in this game equals10.
Tabular Game.This game has5states, lined up next to
each other. At each state, an agent can choose to go left, and
receive a reward of`0.1, stay at the current state and re-
ceive0reward, or go right and receive´0.1reward. How-
ever, if the agent goes left at the left-most state, it receives
´10reward, and if it goes right at the right-most state, it
recieves`10reward. In these two cases, the game termi-
nates. We setγ“0.999. Clearly, the optimal policy here
ignores the small intermediate rewards and always chooses
to go right. The corresponding optimal expected return is
approximately9.7.
GridWorld.We consider a5ˆ5grid, with a barrier that
limits the agent’s movement. The goal of the agent is to
reach the top-right corner as quick as possible, which ter-
minates the episode, and to avoid the bottom-left corner
with a bomb. It receives´1reward for every step taken,
and´100for stepping onto the bomb and terminating the
game. Forγ“0.999, the optimal expected return is ap-
proximately´7.
Figure
monotonic improvement property, and converges to the op-
timal returns—this confirms Theorem. In these sim-
ple environments, the learning curves are influenced by the
choice of the drift functional more than by the neighbour-
hood operator, although not significantly. An exception oc-
Figure 3.Mirror learning algorithms with different drifts and
neighbourhood operators tested on simple environments. The
solid lines represent the return, and the dotted ones represent
the total drift. Algorithms in the left column use the drift ball
neighbourhood, while those in the right use the KL ball. In both
columns there are algorithms with each of the aforementioned
drifts. Results are for one seed per environment and algorithm.
currs in the single-step game where, of course, one step of
GPI is sufficient to solve the game. We also see that the
value of the total drift (which we shift byVπ0
on the plots
for comparison) remains well below the return in all envi-
ronments, confirming Inequality (9).
8. Conclusion
In this paper, we introducedmirror learning—a framework
which unifies existing policy iteration algorithms. We have
proved Theorem, which states that any mirror learn-
ing algorithm solves the reinforcement learning problem.
As a corollary to this theorem, we obtained convergence
guarantees of state-of-the-art methods, including PPO and
TRPO. More importantly, it provides a framework for the
future development of theoretically-sound algorithms. We
also proposed an interesting, graph-theoretical perspective
on mirror learning, which establishes a connection between
graph theory and RL. Lastly, we verifed the correctness of
our theoretical results through numerical experiments on a
diverse family of mirror learning instances in three simple
toy settings. Designing and analysing the properties of the
myriads of other possible mirror learning instances is an
exciting avenue for future research.

Mirror Learning
Acknowledgements
I dedicate this work to my little brotherMichał, and thank
him for persistently bringing light to my life.
I have to go now. We won’t see each other for a while, but
one day we’ll get together again. I promise.
Kuba
References
Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Un-
terthiner, T., Brandstetter, J., and Hochreiter, S. Rud-
der: Return decomposition for delayed rewards.arXiv
preprint arXiv:1806.07857, 2018.
Ausubel, L. M. and Deneckere, R. J. A generalized theorem
of the maximum.Economic Theory, 3(1):99–107, 1993.
Baxter, J. and Bartlett, P. L. Infinite-horizon policy-
gradient estimation.Journal of Artificial Intelligence Re-
search, 15:319–350, 2001.
Beck, A. and Teboulle, M. Mirror descent and nonlinear
projected subgradient methods for convex optimization.
Operations Research Letters, 31(3):167–175, 2003.
Bellman, R. A markov decision process. journal of mathe-
matical mechanics. 1957.
Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak,
P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S.,
Hesse, C., et al. Dota 2 with large scale deep reinforce-
ment learning.arXiv preprint arXiv:1912.06680, 2019.
Chu, C., Blanchet, J., and Glynn, P. Probability func-
tional descent: A unifying perspective on gans, varia-
tional inference, and reinforcement learning. InInterna-
tional Conference on Machine Learning, pp. 1213–1222.
PMLR, 2019.
Duan, Y., Chen, X., Houthooft, R., Schulman, J., and
Abbeel, P. Benchmarking deep reinforcement learning
for continuous control. InInternational conference on
machine learning, pp. 1329–1338. PMLR, 2016.
Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos,
F., Rudolph, L., and Madry, A. Implementation mat-
ters in deep policy gradients: A case study on ppo and
trpo. InInternational Conference on Learning Repre-
sentations, 2020.
Fukushima, K. and Miyake, S. Neocognitron: A self-
organizing neural network model for a mechanism of vi-
sual pattern recognition. InCompetition and cooperation
in neural nets, pp. 267–285. Springer, 1982.
Gateaux, R. Sur diverses questions de calcul fonctionnel.
Bulletin de la Soci´et´e Math´ematique de France, 50:1–37,
1922.
Hamilton, E. and Nashed, M. Global and local variational
derivatives and integral representations of gˆateaux differ-
entials.Journal of Functional Analysis, 49(1):128–144,
1982.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup,
D., and Meger, D. Deep reinforcement learning that mat-
ters. InProceedings of the AAAI conference on artificial
intelligence, volume 32, 2018.
Hsu, C. C.-Y., Mendler-D¨unner, C., and Hardt, M. Re-
visiting design choices in proximal policy optimization.
arXiv preprint arXiv:2009.10897, 2020.
Kakade, S. and Langford, J. Approximately optimal ap-
proximate reinforcement learning. InIn Proc. 19th In-
ternational Conference on Machine Learning. Citeseer,
2002.
Kuba, J. G., Chen, R., Wen, M., Wen, Y., Sun, F., Wang,
J., and Yang, Y. Trust region policy optimisation
in multi-agent reinforcement learning.arXiv preprint
arXiv:2109.11251, 2021.
Lan, G. Policy mirror descent for reinforcement learn-
ing: Linear convergence, new sampling complex-
ity, and generalized problem classes.arXiv preprint
arXiv:2102.00135, 2021.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous
control with deep reinforcement learning.arXiv preprint
arXiv:1509.02971, 2015.
Liu, B., Cai, Q., Yang, Z., and Wang, Z. Neural proxi-
mal/trust region policy optimization attains globally op-
timal policy.arXiv preprint arXiv:1906.10306, 2019.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., et al. Human-level con-
trol through deep reinforcement learning.nature, 518
(7540):529–533, 2015.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lilli-
crap, T., Harley, T., Silver, D., and Kavukcuoglu, K.
Asynchronous methods for deep reinforcement learning.
InInternational conference on machine learning, pp.
1928–1937. PMLR, 2016.
Nemirovskij, A. S. and Yudin, D. B. Problem complexity
and method efficiency in optimization. 1983.
Neu, G., Jonsson, A., and G´omez, V. A unified view of
entropy-regularized markov decision processes.CoRR,
abs/1705.07798, 2017. URLhttp://arxiv.org/
abs/1705.07798.

Mirror Learning
Queeney, J., Paschalidis, I., and Cassandras, C. Gener-
alized proximal policy optimization with sample reuse.
Advances in Neural Information Processing Systems, 34,
2021.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and
Moritz, P. Trust region policy optimization. InInterna-
tional conference on machine learning, pp. 1889–1897.
PMLR, 2015.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. Proximal policy optimization algorithms.
ArXiv, abs/1707.06347, 2017.
Shani, L., Efroni, Y., and Mannor, S. Adaptive trust region
policy optimization: Global convergence and faster rates
for regularized mdps. InProceedings of the AAAI Con-
ference on Artificial Intelligence, volume 34, pp. 5668–
5675, 2020.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D.,
and Riedmiller, M. Deterministic policy gradient algo-
rithms. InInternational conference on machine learning,
pp. 387–395. PMLR, 2014.
Sutton, R. S. and Barto, A. G.Reinforcement learning: An
introduction. 2018.
Sutton, R. S., Mcallester, D., Singh, S., and Mansour, Y.
Policy gradient methods for reinforcement learning with
function approximation. InAdvances in Neural Informa-
tion Processing Systems 12, volume 12, pp. 1057–1063.
MIT Press, 2000.
Tomar, M., Shani, L., Efroni, Y., and Ghavamzadeh, M.
Mirror descent policy optimization.arXiv preprint
arXiv:2005.09814, 2020.
Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-
ment learning with double q-learning. InProceedings
of the AAAI conference on artificial intelligence, vol-
ume 30, 2016.
Vaswani, S., Bachem, O., Totaro, S., Mueller, R., Geist,
M., Machado, M. C., Castro, P. S., and Roux, N. L.
A functional mirror ascent view of policy gradient
methods with function approximation.arXiv preprint
arXiv:2108.05828, 2021.
Wang, Y., He, H., and Tan, X. Truly proximal policy op-
timization. InUncertainty in Artificial Intelligence, pp.
113–122. PMLR, 2020.
Watkins, C. J. C. H. and Dayan, P. Q-learning.Ma-
chine Learning, 8(3):279–292, May 1992. ISSN 1573-
0565. doi: 10.1007/BF00992698. URLhttps://
doi.org/10.1007/BF00992698 .
Williams, R. J. Simple statistical gradient-following al-
gorithms for connectionist reinforcement learning.Ma-
chine learning, 8(3-4):229–256, 1992.
Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J. D., and Chi,
Y. Policy mirror descent for regularized reinforcement
learning: A generalized framework with linear conver-
gence.arXiv preprint arXiv:2105.11066, 2021.
Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. Analysis
and improvement of policy gradient estimation. InNIPS,
pp. 262–270. Citeseer, 2011.

Mirror Learning
A. Definition Details
Definition 3.1.Adrift functionalDis a map
D: ΠˆSÑ tDπp¨|sq:PpAq ÑRu,
such that for allsPS, andπ,¯πPΠ, writingDπp¯π|sqforDπ
`
¯πp¨|sq|s
˘
, the following conditions are met
1.Dπp¯π|sq ěDπpπ|sq “0(nonnegativity),
2.Dπp¯π|sqhas zero gradient
3
with respect to¯πp¨|sq, evaluated at¯πp¨|sq “πp¨|sq(zero gradient).
Letν
¯π
πpsq PPpSqbe a state distribution that can depend onπand¯π. ThedriftD
ν
of¯πfromπis defined as
D
ν
πp¯πqfiEs„ν
¯π
π
“
Dπp¯π|sq
‰
whereν
π
¯πis such that the expectation is continuous inπand¯πand monotonically non-decreasing in the total variation
distance between them. The drift ispositiveifD
ν
πp¯πq “0implies¯π“π, andtrivialifD
ν
πp¯πq “0,@π,¯πPΠ.
In this definition, the notion of the gradient ofDπp¯π|sqwith respect to¯πp¨|sq PPpAqis rather intuitive. As the setPpAq
is not a subset ofR
n
, the gradient is not necessarily defined—however, we do not require its existence. Instead, asPpAq
is a convex statistical manifold, we consider its Gˆateaux derivatives (Gateaux,;,). That is,
for anypPPpAq, writingv“p´πp¨|sq, we consider the Gˆateaux derivative, given as the limit
δDπp¯π|sqrv, πp¨|sqsfilim
ϵÑ0
Dπpπ`ϵ¨v|sq ´Dπpπ|sq
ϵ
(10)
and require them to be zero at¯πp¨|sq “πp¨|sq. That is, we require thatδDπp¯π|sqrv, πp¨|sqs “0.
Definition 3.2.We say thatN: ΠÑPpΠqis aneighbourhood operator, wherePpΠqis the power set ofΠ, if
1. (continuity),
2. Npπqis a compact set(compactness),
3. χ: ΠˆΠÑR, such that@πPΠ, there existsζą0, such thatχpπ,¯πq ďζimplies¯πPNpπq
(closed ball).
Thetrivialneighbourhood operator isN”Π.
We specify that a metricχ: ΠˆΠÑRis a metric between elements of a product of statistical manifolds,i.e.,Π“
Ś
sPS
PpAq. Therefore, we carefully require that it is a non-negative, monotonically non-decreasing function of individual
statistical divergence metricsχs
`
πp¨|sq,¯πp¨|sq
˘
,@sPS, and thatχ
`
π,¯πp¨|sq
˘
“0only ifχs
`
πp¨|sq,¯πp¨|sq
˘
“0,@sPS.
3
More precisely, all its Gˆateaux derivatives are zero.

Mirror Learning
B. Proofs of Lemmas
Lemma 3.3.Letπoldandπnewbe policies. Suppose that
rM
πnew
D
Vπold
spsq ě rM
πold
D
Vπold
spsq,@sPS. (4)
Then,πnewis better thanπold, so that for every states,
Vπnewpsq ěVπold
psq.
Proof.Let us, for brevity, writeVnew“Vπnew
,Vold“Vπold
,β“βπold
,ν“ν
πnew
πold
, andEπr¨s “Ea„π,s
1
„Pr¨s. For every state
sPS, we have
Vnewpsq ´Voldpsq
“Eπnew
“
rps,aq `γVnewps
1
q
‰
´Eπold
“
rps,aq `γVoldps
1
q
‰
“Eπnew
“
rps,aq `γVoldps
1
q
‰
´
νpsq
βpsq
Dπoldpπnew|sq
´Eπold
“
rps,aq `γVoldps
1
q
‰
`γEπnew
“
Vnewps
1
q ´Voldps
1
q
‰
`
νpsq
βpsq
Dπold
pπnew|sq
“Eπnew
“
Qπold
ps,aq
‰
´
νpsq
βpsq
Dπold
pπnew|sq
´Eπold
“
Qπoldps,aq
‰
`γEπnew
“
Vnewps
1
q ´Voldps
1
q
‰
`
νpsq
βpsq
Dπold
pπnew|sq
“
“
M
πnew
D
Vold
‰
psq ´
“
M
πold
D
Vold
‰
psq
`γEπnew
“
Vnewps
1
q ´Voldps
1
q
‰
`
νpsq
βpsq
Dπold
pπnew|sq
ěγEπnew
“
Vnewps
1
q ´Voldps
1
q
‰
`
νpsq
βpsq
Dπold
pπnew|sq,
by Inequality (4). Taking infimum within the expectation,
Vnewpsq ´Voldpsq
ěγ¨inf
s
1
“
Vnewps
1
q ´Voldps
1
q
‰
`
νpsq
βpsq
Dπold
pπnew|sq.
Taking infimum overs
inf
s
“
Vnewpsq ´Voldpsq
‰
ěγinf
s
1
“
Vnewps
1
q ´Voldps
1
q
‰
`inf
s
νpsq
βpsq
Dπoldpπnew|sq.
We conclude that
inf
s
“
Vnewpsq ´Voldpsq
‰
ěinf
s
νpsq
βpsq
Dπoldpπnew|sq{p1´γq,
which is non-negative, and concludes the proof.
Lemma 3.5.Letπoldbe a policy andπnewbe obtained from the mirror update of Equation (5). Then,
rM
πnew
D
Vπold
spsq ě rM
πold
D
Vπold
spsq,@sPS.
Hence,πnewattains the properties provided by Lemma.

Mirror Learning
Proof.We will prove the statement by contradiction. Suppose that
πnew“arg max
¯πPNpπoldq
Es„βπ
old
”
“
M
¯π
DVπold
‰
psq
ı
, (11)
and that there exists a states0, such that
“
M
πnew
D
Vπold
‰
ps0q ă
“
M
πold
D
Vπold
‰
ps0q. (12)
Let us define a policyˆπ, so thatˆπp¨|s0q “πoldp¨|s0q, andˆπp¨|sq “πnewp¨|sqfors‰s0. As the distance between ofˆπfrom
πoldis the same as ofπnewats‰s0, and possibly smaller ats0, we have that¯πis not further fromπoldthanπnew. Hence,
ˆπPNpπoldq. Furthermore, Inequality (12) implies
“
M
ˆπ
DVπold
‰
ps0q ą
“
M
πnew
D
Vπold
‰
ps0q.
Hence,
Es„βπ
old
”
“
M
ˆπ
DVπold
‰
psq
ı
´Es„βπ
old
”
“
M
πnew
D
Vπold
‰
psq
ı
“
`
Es„βπ
old
,a„ˆπ
”
Qπold
ps,aq
ı
´D
v
πold
pˆπq
˘
´
`
Es„βπ
old
,a„πnew
”
Qπold
ps,aq
ı
´D
v
πold
pπnewq
˘
ě
`
Es„βπ
old
,a„ˆπ
”
Qπold
ps,aq
ı
´D
v
πold
pπnewq
˘
´
`
Es„βπ
old
,a„πnew
”
Qπold
ps,aq
ı
´D
v
πold
pπnewq
˘
because of monotonic non-decreasing of the drift in TV-distance
“Es„βπ
old
,a„ˆπ
”
Qπold
ps,aq
ı
´Es„βπ
old
,a„πnew
”
Qπold
ps,aq
ı
“βoldps0q
´
“
M
ˆπ
DVπold
‰
ps0q ´
“
M
πnew
D
Vπold
‰
ps0q
¯
ą0.
This is a contradiction with Equation (11), which finishes the proof.

Mirror Learning
C. The Proof of Theorem
Theorem 3.6(The Fundamental Theorem of Mirror Learning).LetD
ν
be a drift,Nbe a neighbourhood operator, and
the sampling distributionβπdepend continuously onπ. Letπ0PΠ, and the sequence of policiespπnq
8
n“0
be obtained by
mirror learning induced byD
ν
,N, andβπ. Then, the learned policies
1.
ηpπn`1q ěηpπnq `Es„d
«
ν
πn`1
πnpsq
βπn
psq
Dπn
pπn`1|sq
ff
,
2.
lim
nÑ8
Vπn
“V
˚
,
3.
lim
nÑ8
ηpπnq “η
˚
,
4. ω-limit set consists of the optimal policies.
Proof.We split the proof of the whole theorem into two parts, each of which proves different groups of properties stated
in the theorem.
Strict monotonic improvement (Property)
By Lemma, we have that @nPN, sPS,
Vπn`1
psq ěVπn
psq. (13)
Let us writeβ“βπn
andν“ν
πn`1
πn. With this inequality, as well as Lemma, we have
Vπn`1
psq ´Vπn
psq
“Ea„πn`1,s
1
„P
“
rps,aq `γVπn`1
ps
1
q
‰
´Ea„πn,s
1
„P
“
rps,aq `γVπn
ps
1
q
‰
“
νpsq
βpsq
Dπn
pπn`1|sq `Ea„πn`1,s
1
„P
“
rps,aq `γVπn`1
ps
1
q
‰
´
νpsq
βpsq
Dπn
pπn`1|sq ´Ea„πn,s
1
„P
“
rps,aq `γVπn
ps
1
q
‰
.
Now, by Inequality (13) ,
ě
νpsq
βpsq
Dπn
pπn`1|sq `Ea„πn`1
“
rps,aq `γVπn
ps
1
q
‰
´
νpsq
βpsq
Dπn
pπn`1|sq ´Ea„πn
“
rps,aq `γVπn
ps
1
q
‰
“
νpsq
βpsq
Dπn
pπn`1|sq `Ea„πn`1
“
Qπn
ps,aq
‰
´
νpsq
βpsq
Dπn
pπn`1|sq ´Ea„πn
“
Qπn
ps,aq
‰
“
νpsq
βpsq
Dπnpπn`1|sq `
“
M
πn`1
D
Vπn
‰
psq ´
“
M
πn
D
Vπn
‰
psq
ě
νpsq
βpsq
Dπnpπn`1|sq,

Mirror Learning
where the last inequality holds by Lemma. To summarise, we proved that
Vπn`1psq ´Vπnpsq ě
νpsq
βpsq
Dπnpπn`1|sq. (14)
Taking the expectation over s„d, we obtain
ηpπn`1q ´ηpπnq ěEs„d
«
νpsq
βpsq
Dπn
pπn`1|sq
ff
,
which finishes this part of the proof.
Optimality (Properties,, &)
Step 1 (convergence of value functions).
We consider the sequencepVπnq
8
n“0
of associated value functions. By Lemma, we know that @nPN,Vπn`1ěVπn
uniformly. However, as for everynPNandsPS, the uniform boundVπn
psq ďVmaxholds, the sequence of value
functions must converge. We denote its limit byV.
Step 2 (characterisation of policy limit points).
LetLΠbe the set of all limit points ofpπnq
8
n“0
. As the sequencepπnq
8
n“0
is bounded, Bolzano-Weierstrass guarantees
that there exists¯πPLΠ, and a subsequence, saypπni
q
8
i“1
, which converges to it. Writingβk“βπk
, we consider the
optimisation problem that mirror learning induces at every elementπni
of this subsequence,
max
πPNpπn
i
q
Es„βn
i
”
“
M
π
DVπn
i
‰
psq
ı
(15)
Note that by the continuity of the value function (Kuba et al.,, Appendix A), as well as the drift, the neighbourhood
operator, and the sampling distribution, we obtain that the above expression is continuous inπni
. Hence, by Berge’s
Maximum Theorem (Ausubel & Deneckere,), writing
¯
β“β¯π, asiÑ 8, the above expression converges to the
following,
max
πPNp¯πq
E
s„
¯
β
”
“
M
π
DV¯π
‰
psq
ı
. (16)
Note thatV¯π“V, as the sequence of value functions converges (has a unique limit pointV).Furthermore, for alliPN,
πni`1is the argmax (precisely, is an element of the upper hemicontinuous argmax correspondence) of Equation (15).
Hence, there exists a subsequencepπni
k
`1qofpπni`1qiPNthat converges to a policyπ
1
, which is a solution to Equation
(16).
Claim:The solution to Equation(16)isπ
1
“¯π.
We prove the above claim by contradiction. Suppose thatπ
1
‰¯π. Asπ
1
can be obtained from¯πvia mirror learning,
Inequality (13) implies
Qπ
1ps, aq “Es
1
„P
“
rps, aq `γVπ
1ps
1
q
‰
ěEs
1
„P
“
rps, aq `γV¯πps
1
q
‰
“Q¯πps, aq. (17)
If we have
E
s„
¯
β
”
“
M
π
1
DV¯π
‰
psq
ı
ąE
s„
¯
β
”
“
M
¯π
DV¯π
‰
psq
ı
,
then for some states,
“
M
π
1
DV¯π
‰
psq ą
“
M
¯π
DV¯π
‰
psq,
which can be written as
Ea„π
1
“
Q¯πps,aq
‰
´
ν
π
1
¯πpsq
¯
βpsq
D¯πpπ
1
|sq
ąEa„¯π
“
Q¯πps,aq
‰
´
ν
¯π
¯πpsq
¯
βpsq
D¯πp¯π|sq “V¯πpsq “Vpsq.

Mirror Learning
Using Inequality (17), and non-negativity of drifts,
Vπ
1psq “Ea„π
1
“
Qπ
1ps,aq
‰
ěEa„π
1
“
Q¯πps,aq
‰
´
νpsq
¯
βpsq
D¯πpπ
1
|sq ąVpsq.
However, asVπ
1is the limit ofVπn
i
k
`1, we haveVπ
1“V(uniqueness of the value limit) which yields a contradiction,
proving the claim. Hence, the limit point¯πof the sequencepπnq
8
n“0
satisfies
¯π“arg max
πPNp¯πq
E
s„
¯
β
”
“
M
π
DV¯π
‰
psq
ı
. (18)
Step 3 (dropping the drift).
Let¯πbe a limit point ofpπnq
8
n“0
. From Equation (18), and the definition of the mirror operator, we know that
¯π“arg max
πPNp¯πq
E
s„
¯
β,a„π
“
A¯πps,aq
‰
´D
ν
¯πpπq. (19)
Suppose that there exists a policyπ
1
, and a states, such that
Ea„π
1
“
A¯πps,aq
‰
ąEa„¯π
“
A¯πps,aq
‰
“0. (20)
For any policyπ, consider the canonical parametrisationπp¨|sq “ px1, . . . , xm´1,1´
ř
m´1
i“1
xiq, wheremis the size of
the action space. We have that
Ea„π
“
A¯πps,aq
‰
“
m
ÿ
i“1
πpai|sqA¯πps, aiq
“
m´1ÿ
i“1
xiA¯πps, aiq ` p1´
m´1ÿ
j“1
xjqA¯πps, amq
“
m´1
ÿ
i“1
xi
“
A¯πps, aiq ´A¯πps, amq
‰
`A¯πps, amq.
This means thatEa„π
“
A¯πps,aq
‰
is an affine function ofπp¨|sq, and thus, its Gˆateaux derivatives are constant inPpAqfor
fixed directions. Together with Inequality (20), this implies that the G ˆateaux derivative, in the direction from¯πtoπ
1
, is
strictly positive. Furthermore, the Gˆateaux derivatives of
ν
π
¯π
psq
¯
βpsq
D¯πpπ|sqare zero atπp¨|sq “¯πp¨|sq, as
0“
ν
¯π
¯πpsq
¯
βpsq
D¯πp¯π|sq “
1
¯
βpsq
D¯πp¯π|sq,
and
0ď
ν
π
¯πpsq
¯
βpsq
D¯πpπ|sq ď
1
¯
βpsq
D¯πpπ|sq,
and both the lower and upper of the above bounds have zero derivatives. Hence, the Gˆateaux derivative ofEa„π
“
A¯πps,aq
‰
´
ν
π
¯π
psq
¯
βpsq
D¯πpπ|sqis strictly positive. Therefore, for conditional policiesˆπp¨|sqsufficiently close to¯πp¨|sqin the direction
towardsπ
1
p¨|sq, we have (with a slight abuse of notation forν)
Ea„ˆπ
“
A¯πps,aq
‰
´
ν
ˆπ
¯πpsq
¯
βpsq
D¯πpˆπ|sq ą0. (21)
Let us construct a policyrπas follows. For all statesy‰s, we setrπp¨|yq “¯πp¨|yq. Moreover, forrπp¨|sqwe chooseˆπp¨|sq
as in Inequality (21), sufficiently close to ¯πp¨|sq, so thatrπPNp¯πq. Then, we have
E
s„
¯
β,a„rπ
“
A¯πps,aq
‰
´D
ν
¯πprπq
“
¯
βpsq
“
Ea„ˆπ
“
A¯πps,aq
‰
´
ν
ˆπ
¯πpsq
β¯πpsq
D¯πpˆπ|sq
‰
ą0. (22)

Mirror Learning
The above contradicts Equation (19). Hence, the assumption made in Inequality (20) was false. Thus, we have proved that,
for every states,
¯π“arg max
π
Ea„π
“
A¯πps,aq
‰
“arg max
π
Ea„π
“
Q¯πps,aq
‰
. (23)
Step 4 (optimality).
Equation (23) implies that ¯πis an optimal policy (Sutton & Barto,), and so the value function V“V¯πis the optimal
value functionV
˚
(Property). Consequently, the expected return that the policies converge to,η“Es„d
“
Vpsq
‰
“
Es„d
“
V
˚
psq
‰
“η
˚
is optimal (Property). Lastly, as ¯πwas an arbitrary limit point, any element of theω-limit set is an
optimal policy (Property). This finishes the proof.
D. Extension to Continuous State and Action Spaces
The results on Mirror Learning extend to continuous state and action spaces through general versions of our claims. These
require a little more care in their formulation, but their validity holds as a corollary to our proofs.
In general, the state and the action spacesSandAmust be assumed to be compact and measurable. For the state space,
we introduce a reference probability measureµS:PpSq ÑRthat is strictly positive,i.e,µSpsq ą0,@sPS. Under such
setting, a policyπ
˚
is optimal if it satisfies the Bellman optimality equation
π
˚
p¨|sq “arg max
πp¨|sqPPpAq
Ea„p
“
Q
π
˚ps,aq
‰
,
at states that form a set of measure1with respect toµS. In other words, a policy is optimal if it obeys the Bellman
optimality equation almost surely.
As for the results, the inequality provided by Lemma
respect toµSas long as the policyπnewsatisfies Inequality (4), also almost surely with respect to this measure. Of course,
the corollary on the monotonic improvement remains valid. Next, the inequality introduced in Lemma
stated almost surely, again with respect toµS. Lastly, the entire statement of Theorem
same formulation for all compact .

Mirror Learning
E. The Listing of RL Approaches as Instances of Mirror Learning
Generalised Policy Iteration (Sutton & Barto,)
πnewp¨|sq “arg max
¯πp¨|sqPPpAq
Ea„¯π
“
Qπold
ps,aq
‰
,@sPS
• D”0.
• N”Π.
•
Trust-Region Learning (Schulman et al.,)
πnew“arg max
¯πPΠ
Es„ρπ
old
,a„¯π
“
Aπoldps,aq
‰
´CD
max
KLpπold,¯πq,
whereC“
4γmaxs,a|Aπold
ps, aq|
p1´γq
2
•
Dπp¯π|sq “ p1´γqCDKL
`
πp¨|sq,¯πp¨|sq
˘
,withν
¯π
πpsmaxq “1,wheresmax“arg max
sPS
Dπp¯π|sq.
• N”Π.
• βπ“¯ρπ, the normalised marginal discounted state distribution.
TRPO (Schulman et al.,)
πnew“arg max
¯πPΠ
Es„ρπ
old
,a„¯π
“
Aπold
ps,aq
‰
subject toEs„ρπ
old
“
DKL
`
πoldp¨|sq, πnewp¨|sq
˘‰
ďδ.
• D”0.
•
Npπq “ t¯πPΠ :Es„ρπ
“
DKL
`
πp¨|sq,¯πp¨|sq
˘‰
ďδ
(
.
• βπ“¯ρπ.
PPO (Schulman et al.,)
πnew“arg max
¯πPΠ
Es„¯ρπ
old
,a„πold
“
L
clip
‰
,
where for givens, a,and rp¯πq “
¯πpa|sq
πoldpa|sq
L
clip
“min
´
rp¯πqAπoldps, aq,clip
`
rp¯πq,1˘ϵ
˘
Aπoldps, aq
¯
.
•
Dπp¯π|sq “Ea„π
”
ReLU
´
“
rp¯πq ´clip
`
rp¯πq,1˘ϵ
˘‰
Aπps,aq
¯ı
,withν
¯π
π“¯ρπ.
• N”Π. Note, however, that in practical implementations the policy is updated with
ϵPPOsteps of gradient ascent with gradient-clipping thresholdM. This corresponds to a neighbourhood of an L2-ball
of radiusMϵPPOin the policy parameter space.
• βπ“¯ρπ.

Mirror Learning
PPO-KL (Hsu et al.,)
πnew“arg max
¯πPΠ
Es„¯ρπ
old
,a„¯π
“
Aπold
ps,aq
‰
´τDKLpπold,¯πq,
•
Dπp¯π|sq “τDKL
`
πp¨|sq,¯πp¨|sq
˘
,withν
¯π
π“¯ρπ.
•
• βπ“¯ρπ.
MDPO (Tomar et al.,)
πnew“arg max
¯πPΠ
Es„βπ
old
,a„¯π
“
Aπold
ps,aq
‰
´
1
tπold
DKL
`
¯π, πold
˘
.
•
Dπp¯π|sq “
1
tπ
DKL
`
¯πp¨|sq, πp¨|sq
˘
,withν
¯π
π“βπ.
• N“Π.
• βπ“¯ρπfor on-policy MDPO, andβπn
psq “
1
n`1
ř
n
i“0
¯ρπi
psqfor off-policy MDPO.

Mirror Learning
F. Instructions for Implementation of Off-Policy Mirror Learning
In the case of off-policy learning, estimatingEa„¯π
“
Qπold
ps,aq
‰
is not as straighforward as in Equation (6), since sampling
actions from the replay buffer is not equivalent to sampling actions fromπoldp¨|sqanymore. The reason for this is that while
sampling an action from the buffer, we also sample a past policy,πhist, which was used to insert the action to the buffer.
To formalise it, we draw a past policy from some distribution dictated by the buffer,πhist„hPPpΠq, and then draw an
action a„πhist. To account for this, we use the following estimator,
¯πpa|sq
πhistpa|sq
Qπoldps,aq.
Note that this requires that the valueπhistpa|sqhas also been inserted in the buffer. The expectation of the new estimator
can be computed as
Eπhist„h,a„πhist
”
¯πpa|sq
πhistpa|sq
Qπoldps,aq
ı
“
ÿ
πhist
hpπhistq
ÿ
aPA
πhistpa|sq
¯πpa|sq
πhistpa|sq
Qπoldps, aq
“
ÿ
aPA
¯πpa|sqQπold
ps, aq
ÿ
πhist
hpπhistq
“
ÿ
aPA
¯πpa|sqQπold
ps, aq “Ea„¯π
“
Qπold
ps,aq
‰
.
Hence, the estimator has the desired mean.