Extracted Text

2510.19093v1.pdf

W
FOR
Atli Kosson,
12†
Jeremy Welborn,
1
Yang Liu,
1
Martin Jaggi,
2
Xi Chen
1
1
Amazon FAR (Frontier AI & Robotics),
2
EPFL
A
Transferring the optimal learning rate from small to large neural networks can
enable efficient training at scales where hyperparameter tuning is otherwise pro-
hibitively expensive. To this end, the Maximal Update Parameterization (µP)
proposes a learning rate scaling designed to keep the update dynamics of internal
representations stable across different model widths. However, the scaling rules
ofµP rely on strong assumptions, particularly about the geometric alignment of
a layer’s inputs with both its weights and gradient updates. In this large-scale
empirical investigation, we show that these assumptions hold only briefly at the
start of training in the practical setups where learning rate transfer is most valuable,
such as LLM training. For the remainder of training it is weight decay rather than
µP that correctly stabilizes the update dynamics of internal representations across
widths, facilitating learning rate transfer. This suggestsµP’s scaling primarily acts
as a form of implicit learning rate warmup, allowing us to largely replace it with
modified warmup schedules. Together these findings fundamentally challenge pre-
vailing beliefs about learning rate transfer and can explain empirical practice such
as whyP requires the independent weight decay variant for successful transfer.
1 I
The Maximal Update Parameterization (µP) has become a cornerstone of efficient training at scale,
particularly for large language models (LLMs). Many open recipe LLMs have used variants ofµP
(Dey et al., 2023; Hu et al., 2024; Liu et al., 2023; Falcon-LLM Team, 2025; Cohere et al., 2025), and
some commercial models with unknown training details likely use similar learning rate (LR) scaling
approaches. For example, Llama4 used an internal technique called MetaP, Grok-2 configuration
files suggestµP use, and the GPT-4 report (Achiam et al., 2023) cites a keyµP paper coauthored by
OpenAI staff (Yang et al., 2021). It has also impacted theory, showing how feature learning can occur
in infinitely wide networks, contrasting with the neural tangent kernel regime (Jacot et al., 2018).10
3
10
2
10
1
Base Learning Rate
3.0
3.5
4.0
Final Val Loss
Independent WD
10
3
10
2
10
1
Base Learning Rate
Standard WD
10
3
10
2
10
1
Base Learning Rate
No Weight Decay
Width 128 Width 256 Width 512 Width 1024 Width 2048 Min Loss
Figure 1:µP specifically requires independent weight decay for good learning rate transfer
when pre-training LLaMA networks with AdamW (20B tokens, width 2048 and 1B params, see
Appendix F.3). Independent weight decay achieves this by counteractingµP’s update scaling later in
training which is needed because the core assumptionsµP’s scaling is based on quickly break down.
Standard weight decay does not counteractµP’s scaling and consequently results in a poor learning
rate transfer on longer experiments. No weight decay behaves somewhere in-between, see Appx. B.
†
Work done while interning at Amazon FAR
1
arXiv:2510.19093v1 [cs.LG] 21 Oct 2025

Despite this success, several works (Wortsman et al., 2024; Blake et al., 2025; Wang & Aitchison,
2025; Bergsma et al., 2025) have empirically found that learning rates only transfer well in practice
whenµP is combined with the independent variant of weight decay (WD), see Figure 1. In the
conventional AdamW formulation (Algorithm 1), weight decay multiplies the weights by1
at each step whereηis the learning rate andλthe weight decay coefficient. With independent
weight decay, the weights are decayed by1 which is not affected byµP’s learning rate scaling.
Equivalently,λcan be scaled to keep the productηλconstant asηis scaled in the formulation above.
In this work, we demonstrate that independent weight decay is not merely a helpful addition but a
central mechanism for enabling learning rate transfer in practice. We begin by developing a unifying
framework for weight decay andµP based on relative updates, the size of updates as a fraction of the
current weights or features (Section 2). Within this framework, we show that independent weight
decay directly counteracts and ultimately overridesµP’s scaling during training (Section 3). We then
show why this is crucial: for the majority of training, it is independent weight decay, notµP, that
correctly stabilizes feature learning across different model widths (Section 4). This stabilization of
the feature learning seems to be what facilitates learning rate transfer and is a key objective ofP.
The need to overrideµP’s prescribed update scaling arises because the underlying alignment as-
sumptions fail to hold in practice. We show this can happen when the batch size is large relative
to the model width, something that commonly happens in practice unlike the infinite-width limit
that inspiredµP (Section 5). This analysis fundamentally reframesµP’s practical role as an implicit
learning rate warmup. We empirically measure this effect (Section 6) and demonstrate that the practi-
cal benefit ofµP can largely be replicated by using stronger, explicit warmup schedules (Section 7).
Together, our findings challenge prevailing beliefs about learning rate transfer and provide guidance
for achieving robust transfer in practice.
2 A
2.1 S
The coreµP and weight decay concepts discussed in this work can be understood from the linear
transformation of a fully connected layer in isolation. For an input feature dimensionC, output
feature dimension
Y
CB
(inputs),
KC
(weights),
KB
(outputs)
A weight updateW causes the layer’s outputs to change asYY for the same
inputX. Note that in the context of neural networks,∆Yrefers to the
from this layer alone, the ∆Xin the inputs arising
from the updates of earlier layers. Local changes can be related to total changes as shown by Yang
et al. (2023), but we will focus on local ones in this work for simplicity.
2.2 A
Central to this work andµP in general is connecting the size of the ∥∆Y to
the size of weight updates∥∆W . We measure this in terms of the Frobenius norm∥ · ∥ F .
It is easy for an optimizer to give a specific weight update size∥∆W , Adam (Kingma & Ba, 2015)
does this approximately and sign-based optimizers like Lion (Chen et al., 2023) can do this perfectly.
However it is ultimately the size of the representation changes∥∆Y that determine the impact of
an update. For example, a weight update that does not changeYat all has no immediate impact.
However, controlling∥∆Y is much harder than∥∆W and is the core objective ofµP’s LR scaling.
We describe the relationship between∆Y∆, defined as:
α∆W:=
∥∆Y
∥∆WX
=
∥∆W X
∥∆WX
=
q
P
k,b
∥∆w k∥
2
∥∆W
2
∥xb∥
2
∥X
2cos
2
∠(∆w k,b),
This is a weighted root-mean-square average of the cosine similarities between input samples
Xx 1, . . . , B] and rows in the weight update∆Ww 1, . . . ,w K]
⊤ . Note that a high
update alignmentα∆W means thatwkandxbtend to point in similar directions and will result in
larger representation changes∥∆Y for a given update size∥∆W . We similarly define the
alignment αW:=
∥Y
∥WX
which relates the weight magnitude∥W to representation size∥Y.
2

xκWLayerNormy(a) Normalization LayersxκWγ/κy(b) Counteracting GainxκW1ReLUW2/κy(c) Homogeneous Activation
Figure 2: Neural networks are commonly invariant to certain types of rescaling transformations.
Examples where applying an in-place scaling factorκ > does not affect the current block outputy.
Across such transformations, an update with a fixed absolute size∥W has a varying impact on the
output while a given relative change∥∆W/∥W always has the same impact. This makes the
relative change a more informative measure of the size of updates that better captures their impact.
2.3 W
Weight decay is widely thought to mainly function as regularizer despite a long line of work that
argues this is not the case in deep learning (Van Laarhoven, 2017; Zhang et al., 2019; Chiley et al.,
2019; Li & Arora, 2020; Li et al., 2020; Wan et al., 2021; Li et al., 2022; Kosson et al., 2024b;
D’Angelo et al., 2024). Instead it primarily acts as a secondary learning rate hyperparameter that
modulates the size of relative weight updates∥∆W/∥W . Kosson et al. (2024b) describe these
effects in AdamW in detail. In the standard variant (Algorithm 1), weight decay will cause the
weight norms to converge to a fixed ∥W
p
KC
for learning
rateη, weight decayλand matrix sizeK . At this value, the shrinking effect of weight decay and
growth effect of gradient updates balance out, keeping the norm constant in expectation. AdamW’s
normalization of the weight updates makes∥∆W
√
KC
, which in turn gives relative updates
∥∆W/∥W
√
ηλ
. Crucially, note that the learning rateηand weight decayλaffect the relative
updates in equilibrium the same way, only their product
∥W
p
KCη/λ,
∥∆W
∥W
∝
√
ηλ
Due to the structure of neural networks, relative changes are often more meaningful than absolute
changes (Bernstein et al., 2020; Kosson et al., 2024b;a). A given relative update size∥∆W/∥W
has the same impact across a number of equivalent parameter states obtained via transformations like
those shown in Figure 2, unlike a fixed absolute update size∥∆W . A similar argument extends to
the representations, a given change∆Ytypically has a greater impact when∥Y is small than when
it is large. Rescaling transformations do not occur in standard training but weight decay can have a
similar overall effect by changing the weight norms over time while normalization layers or learnable
gains can undo the effects of this on the output. In this work we focus on relative changes rather than
absolute ones based on their informativeness and connection to weight decay, deviating from existing
work onµP. Note that at the start of training these approaches are equivalent since∥W and∥Y
are roughly determined by the initialization. However, later in training they can differ significantly.
2.4 F
The primary design goal ofµP is to ensure stable and non-trivial feature learning at any width. Yang
et al. (2023) formulate this as∥Y RMS:=Y F/
√
KB
and∥∆Y RMS= Θ(1) , i.e., the
representations and their changes neither explode nor vanish as the input dimensionCgrows.
1
In our
relative formulation we want to keep∥∆Y/∥Y roughly constant across widths for a given learning
rate. The hope is that a fixed relationship between learning rates and representation changes will
result in a transfer of the optimal learning rate, but this is ultimately a heuristic without guarantees.
We can express the relative representation change in terms of the relative weight update as follows:
2
∥∆Y
∥Y
=
α∆W
αW
∥∆W
∥W
(4)
where we callα∆W/αW the. To stabilize the relative representation changes
∥∆Y/∥Y across different widthsC,µP must perfectly counter changes in the alignment ratio by
1
Technically it is the total representation change across training that matters, but Yang et al. (2023) show this can
follow from bounded local changes with assumptions about the spectral norms of weights and updates.
2
We assume matrices are non-zero which is typically the case throughout training, barring special initialization.
3

scaling the weight updates appropriately (via the learning rate). This requires knowing how both the
update alignment ∆Wand weight alignment Wbehave as a function of the network width
At the start of training the weights are randomly initialized. If we assume the weight vectorw
C
has independent and identically distributed (IID) zero-mean elements that are also independent of
the inputsx
C , we can show thatE[⟨w⟩
2
] =[∥w
2
]E[∥x∥
2
]/C resulting inαW≈/
√
C
.
This is analogous to the computation used to derive variance preserving initialization schemes which
call forW
√
K
The update alignment is slightly more complicated since the update∆Wis a function of the inputsX
rather than independent, causing correlations that tend to increase the update alignment. Specifically
the weight gradient is an outer product of the inputsXand the gradients with respect toY. For
SGD at batch size one, this makes the update a scaled version of the input, giving perfect alignment
α∆W= 1P alignment assumptions:
α∆W= Θ(1), α W= Θ(1/
√
C,
α∆W
αW
= Θ(
√
C
With this we can scale the learning rate to counteract changes in the alignment ratio in Equation 4.
When the network width is multipliedC , we thus get the followingµP-prescribed scaling:
∥∆W
∥W
∝
ˇ
α∆W
αW
ı
−1
∝
1
√
m
(relative update size) base/m
Theηscaling follows from the relative one based on∥∆W
√
KC
for Adam and∥W
√
K
.
The optimal ηbaseis what we aim to transfer across widths, tuning it at a small
scale rather than the target scale we are interested in, saving significant compute in the process.
We emphasize that the prescribed relative update scaling in Equation 6 is fully consistent with prior
works onµP and not an artifact of our formulation, see for example Appendix B.3 of Yang et al.
(2021) or Adafactor “Full Align” in Table 1 of Everett et al. (2024). Besides scaling the learning rate
of hidden layers,µP involves special handling of the final output layer and attention normalization.
Everett et al. (2024) shows these are not necessary for learning rate transfer, combining standard
parameterization with layer-wise scaling of the learning rate works equally well or better in practice.
We adopt their simpler version in this work although we refer to the scaling asµP-prescribed or
simplyP for brevity. See Appendix F.2 for details.
3
I µP’
There are two common formulations of AdamW. In the original work by Loshchilov & Hutter (2019),
the rate of weight decay was determined byλalone, i.e., at each step the weights are scaled by1 .
However in the default implementation in PyTorch the weights are instead shrunk by1 . These
are equivalent since we can scaleλto get one from the other, but will behave differently whenηis
modified byµP. In this work we base all discussion on the PyTorch variant but consider the following
two approaches for weight decay asP scales the learning rate:
(η, λ)η/m, λ)
(η, λ)η/m, mλ)
Note that independent scaling keeps the productηλconstant, meaning the relative updates in
equilibrium are not scaled at all (see Equation 3). This fundamentally differs fromµP’s prescribed
scaling in Equation 6, meaning that µP by eventually
overriding its update scaling.
behavior that follows the relative update scaling prescribed byµP. This is surprising because multiple
works have empirically found that using independent weight decay gives better learning rate transfer
(Wortsman et al., 2024; Blake et al., 2025; Wang & Aitchison, 2025; Bergsma et al., 2025). We
replicate this finding for LLaMA (Touvron et al., 2023) training on DCLM (Li et al., 2024) next token
prediction in Figure 1. Together these results suggest that µP-theoretical scaling of
the relative update size throughout training does not work well in practice.
To the best of our knowledge, this key mismatch between theory and what empirically works has not
been pointed out before. In the following sections we use our unified relative framework to clarify
the exact roles of bothP’s scaling and weight decay in achieving learning rate transfer in practice.
4

10
2
10
3
10
4
10
5 Std. WD Scaling
Y/Y
1×
16×
2
0
2
2
2
4
2
6
W/W
10
2
10
3
10
4
10
5
W/W
0 5 10 15 20
Step [10
3
]
10
2
10
3
10
4
10
5 Ind. WD Scaling
1×
16×
0 5 10 15 20
Step [10
3
]
2
0
2
2
2
4
2
6
0 5 10 15 20
Step [10
3
]
10
2
10
3
10
4
10
5 Figure 3: StandardP scaling fails to maintain relative representation changes across widths.
Comparing the relative representation change (RRC) forµP with standard weight decay scaling (first
row) and independent weight decay scaling (second row) for LLaMA training with a 10% linear
warmup followed by linear decay. Plots shown for a single representative layer withC at1×
width, see F.3.2 for details.
rate unlike independent scaling. Recall that the RRC is the product of the alignment ratio and the
relative weight update (Equation 4).
approximately 1.
then be kept constant as achieved by independent weight decay rather thanP’s prescribed scaling.
4 W
Recall that in our formulationµP’s core objective becomes mapping a given base learning rate to the
same rate of relative representation changes across different network widths. In the left column of
Figure 3 we experimentally measure how well this is achieved under both standard and independent
weight decay scaling during LLaMA pre-training. We find that independent weight decay scaling
successfully maintains the relative representation changes across widths as indicated by the similarity
of the traces on the lower left panel. In contrast, standard weight decay scaling fails to do this, the
the same base learning rate results in vastly different relative feature updates for the narrow and
wide networks later in training. Since learning rates transfer well with independent weight decay
(Figure 1), this suggest that the our goal of
widths can be an effective objective for learning rate transfer, butµP relies on independent
weight decay to achieve this later in training.
The reason independent weight decay helps maintain relative representation changes can be un-
derstood from the middle and right columns of Figure 3. Recall from Equation 4 that the relative
representation change is the product of the alignment ratio show in the middle column and the
relative weight change in the right column. We see that the alignment ratio quickly falls from the
width-dependent initial value to approximately one, violatingP’s assumptions. Once this happens,
relative weight changes should be the same across both network widths in order to produce the
same relative representation changes, which is exactly what independent weight decay achieves
in practice, matching theoretical models. Standard weight decay scaling results in smaller relative
updates for the larger network, as prescribed byµP. This fails to maintain the relative representation
change with the lower alignment ratios that dominate training, preventing learning rate transfer.
In Appendix B we explore the effects of not using weight decay at all. Somewhat counterintuitively,
this also results in similarly sized relative weight updates across widths like with independent weight
decay. The reason is that scale-invariant weights grow perpetually when updated without weight
decay. Once the cumulated updates outweigh the initialization, the weight norm becomes proportional
to the (peak) learning rate used, making both the weights and their updates proportional to it. This
means their ratios,
learning rate without weight decay. We believe this is a key reason learning rates somewhat transfer
without weight decay despiteµP’s broken assumptions. The ever decreasing relative changes, and
the lack of control over them, likely also contributes to the loss of performance without weight decay.
5

0 5 10 15 20
Step [10
3
]
2
6
2
4
2
2
2
0
Weight Align. W
fan_in=128
fan_in=2048
P assump.
0 5 10 15 20
Step [10
3
]
2
6
2
4
2
2
2
0
Update Align. W
0 5 10 15 20
Step [10
3
]
2
0
2
2
2
4
2
6
Align. Ratio W/W Figure 4: The alignment assumptions ofµP do not hold in practice. Measurements of the weight
alignment, the update alignment and the alignment ratio for LLaMA training at different widths. The
update alignment varies significantly over time, becoming width-dependent. As a result the alignment
ratio loses its width dependence. This violatesµP’s core scaling assumptions which only hold very
early in training. Plots shown for one representative MLP layer, see F.3.3 for details.
5 W
Figure 4 shows howµP’s alignment assumptions only hold very briefly at the start of training. Both
the update and weight alignment can deviate significantly, for the weight alignment this is more
prominent in our ResNet experiments (see Figure 13, Appendix E). In this section we investigate why
and whenP’s assumptions about each alignment component break down in practice.
5.1 U
In Appendix C we show how
batch sizeBis large compared to the input dimensionC, violatingµP’s assumptions.
frequently the case in practice unlike in the infinite width settingP was originally developed for.
To understand the conceptual reason for the width-dependence, let us look at the simplified case
of SGD for a single neuron i.e.,K in Equation 1. Assume the input samplesxb∈
C and the
output gradientsy
′
b
∈ are all zero-mean IID random and mutually independent. The gradient for
each sample is then given bygb= by
′
b and is also IID and zero-mean. The weight update is given
bywη
1
B
P
b
gb, allowing us to write the-th element of the output changey
∆y
i
=∆w i⟩η
1
B
P
b
xby
′
b
,i⟩
1
B
y
′
i
⟨xi,i⟩
1
B
P
b̸=i
y
′
b
⟨xb,i⟩
The output change for each input involves one B interference
from other samples. Although random alignment weakens each interference term by a factor of
1/
√
C
relative to the fully aligned self-contribution, the random sum across allB terms amplifies
their impact by roughly
√
B
. As a result, interference dominates whenB , giving an overall
dependence onC. Conversely, forB , the interference becomes negligible making the update
alignment invariant toC. In models like Transformers and CNNs, the spatial or sequence dimensions
act similarly to the batch size. For our LLaMA experiments, the
number of tokens per batch (1,048,576), which far exceeds the network width (C).
In practice the gradients of different samples are not uncorrelated. High initial correlation results in a
large update alignment that resembles the small batch-to-width setting thatµP targets. The correlation
decreases over time, resulting in width-dependence that violatesµP’s assumptions, see Appendix C.
5.2 W
The initial weight alignment matchesµP’s assumptions, but analyzing it later in training is hard.
Intuitively, we note that we can decompose the weightsWtat timetinto weight updates∆W τ from
prior time stepsW
Wt= (1)
t
· 0+
P
t−1
τ
(1)
t−τ
·W τ (update composition of weights)
If the decay rateηλis large or∥W0∥ is small compared to the updates, theW0term quickly becomes
insignificant. Afterwards, the weight alignment at timetis determined primarily by the alignment of
Xtand recent update terms∆W τ .
with the inputs, the overall weight alignment may approximate the update alignment.
would explain the alignment ratiosα∆W/αW≈ that we often observe in practice (see Figure 16).
6

0 5 10 15 20
Step [10
3
]
10
2
10
3
10
4
10
5
W/W Base LR=, WD=
LR=/16, WD=16
0 5 10 15 20
Step [10
3
]
10
2
10
3
10
4
10
5
Y/Y
0 5 10 15 20
Step [10
3
]
2
4
2
3
2
2
2
1
2
0
Scaling Effect st Figure 5:µP with independent WD scaling has a warmup-like effect on relative updates.
Measurements of the relative weight update (left) and relative representation change (middle) for
different learning rate (LR) and weight decay (WD) combinations in a LLaMA layer with 2048 input
features. Independent weight decay scaling achieves a given LR-WD product using a lower LR and
higher WD (orange) compared to not performingµP scaling (blue). This results in smaller relative
updates early in training, similar to an additional learning rate warmup factor. The right panel shows
this effect through the ratio of the relative weight updates for the two runs, which goes from
1
m
=
1
16
to asymptotically approaching 1 in a roughly exponential manner. See F.3.4 for experimental details.
6 T
Based on the order-one alignment ratios that we find to dominate practical training, the relative
weight updates should be kept roughly constant across widths. Kosson et al. (2024b) described how
the magnitude of the relative updates in the steady-state only depends on the product of the learning
rateηand weight decayλrather than their individual values. However, (η, λ) pairs with the same
product generally affect the initial phases of training differently. Specifically, high weight decay pairs
like those obtained from independentµP scaling via(η, λ)η/m, mλ) have a warmup-like effect,
where the relative updates at the start of training are comparatively smaller. Therefore,µP acts as
a special form of additional learning rate warmup when used with independent WD scaling.
We measure this effect empirically for LLaMA training in Figure 5, confirming the general effect is
still present despite the lack of the perfect scale-invariance typically assumed by works on weight
decay. Note that sinceµP does not scale the updates appropriately for the later phases of training,
this warmup effect is the only practical benefit ofP’s learning rate scaling for longer runs.
The shape of the induced warmup effect is generally quite complex. The initial downscaling of the
relative update is always1/mwheremis width ratio used forµP’s scaling. The scaling approaches
one over time as the weight norms close in on their equilibrium values. The effective length of the
warmup depends on the initialization value, learning rate and weight decay. In Appendix D we
analyze a simplified setting with a constant learning rate where the scaling ratio becomes:
st=
s
1 + (ρ
2
0
/ρ
2
∞−a
2t
1 + (m
2
ρ
2
0
/ρ
2
∞−a
2t
=
if 0=ρ∞
1
p
1 + (m
2
−a
2t
(simple warmup effect)
wherea is the decay multiplier at each step,ρtis the weight RMS,ρ0is the initialization
RMS value, andρ∞=
η
2λ
is the predicted weight RMS in equilibrium. In practice this model does
not accurately predict the shape in Figure 5, overestimating the time required to reach equilibrium.
The reason is likely the interaction of momentum with temporal gradient correlations that do not
appear in the simplified fully random and uncorrelated setting the model is derived for.
Finally we note thatµP’s learning rate scaling in AdamW is used to both counteract assumed
changes in the alignment ratio and the size of the relative updates. Recall that the learning rate
is scaledη/m whereas the relative updates should only scale with1/
√
m
. Even without any
assumptions about the alignment ratio, it would make sense to scale the learning rate and weight
decay(η, λ)η/
√
m,
√
m)
. This scalesρ∞∝/
√
m
, matching how variance preserving
initialization scalesρ0∝/
√
m
. Such scaling preserves bothaandρ0/ρ∞ in Equation 11, and
is thus more likely to preserve the relative weight update behavior across training. In comparison
independentµP scaling has an additional warmup effect of1/
√
m
in the relative updates (which is
desirable), but also decreasesρ∞∝/m . These smaller weight norms could potentially cause issues
(signal propagation etc.) if there are not learnable gains or normalization layers to counteract them.
7

10
3
10
2
10
1
Learning Rate
3.0
3.5
4.0
Final Loss
1x
4x
16x
min
10
3
10
2
10
1
Learning Rate
10%
50%
10
3
10
2
10
1
Learning Rate
exp50
decayaway Figure 6: Stronger warmup can replicate the benefit ofµP with independent WD scaling.
Learning rate transfer plots for LLaMA training using different warmup approaches. µP with
independent scaling transfers well. µP scaling with either the base 10% linear warmup
or a longer 50% warmup. These give a poor learning rate transfer in comparison. µP
scaling with additional warmup incorporating the width scaling factormcan give good transfer, but
can require tuning and is less stable at higher learning rates. Additional 50% exponential warmup
(Equation 12) compared to a decay-away warmup (Equation 13). See F.3.5 for experimental details.
7 C0 5 10 15 20
Step [10
3
]
2
4
2
3
2
2
2
1
2
0
Extra Warmup Factor
Exp-Increasing 50%
Decay-Away
Figure 7:
applied on top of a linear decay schedule
with 10% linear warmup form ,
η.004 , andλ.1 , matching the
minimum in the right panel of Figure 6.
Learning rate warmup is already commonly used in prac-
tice, making it unclear whether the additional warmup
effect fromµP is really needed. The first two panels of
Figure 6 explore this for LLaMA training, comparing the
quality of learning rate transfer with and withoutµP’s scal-
ing. Somewhat surprisingly, we find that
warmup does not necessarily give the same benefits for
learning rate transfer or stability asµP with indepen-
dent weight decay. In Appendix E we conduct a similar
experiment finding that
not needed for ResNet training. This could be a part
of the reasonµP like learning rate scaling did not appear
until transformers became popular, well-normalized con-
volutional networks may not need it at reasonable scales.
We conjecture that linear warmup schedules may increase too fast early on, not allowing enough time
for the alignment ratio to fall. Taking inspiration from the more exponential warmup shape from
Equation 11, we experiment with two types of additional multiplicative learning rate warmup factors:
ˆt=
min(0, t/T W−1)
(exp-increasing warmup)
ˆt=
Γ
1 + (m
2
−
t−1
Y
τ
(1 τλ)
2
˙
−1/2
(decay-away warmup)
Both scale the initial learning rate by1/m, similar toµP, and eventually become one. The exponen-
tially increasing one explicitly defines the length of the warmup through a hyperparameterTWwhich
can differ from the length of the linear warmup. The decay-away warmup comes directly from the
simplified case of Equation 11. In this case the length of the warmup effect varies depending on.
The final panel of Figure 6 shows the effect of applying these additional warmup factors on top of
the existing 10% linear warmup. This works well, offering better stability and learning rate transfer
than with the linear schedules suggesting:
µP’s learning rate scaling in practice.
stabilize higher learning rates slightly better, these results further point to the warmup effect being
µP’s primary benefit. We note that in our experiments we applied the warmup factors to all parameters.
This differs from the warmup effect of the independent scaling which only affects parameters whose
learning rate is scaled byP, i.e., the weight matrices of the hidden and final output layers.
8

8 R
We discuss related work in more detail in Appendix A, only briefly mentioning the strongest influences
and most closely related works here. Our work adapts the layer-wise LR scaling approach of Everett
et al. (2024) and uses a similar alignment definition, although they only consider weight alignment.
We also take inspiration from the simpler more accessible approach toµP introduced by Yang et al.
(2023) to isolate the key alignment assumptions ofµP and work with local changes. Finally, we build
closely upon the framework for weight decay introduced by Kosson et al. (2024b) and the additional
observations about relative updates and learning rate warmup from Kosson et al. (2024a).
The question of whyµP requires independent weight decay was explored before by Wang & Aitchison
(2025) who also empirically demonstrated its necessity for learning rate transfer. Their analysis
revolves around approximating AdamW as an exponential moving average (EMA) over past updates,
which has a form similar to the geometric sum in Equation 10. From this perspective the decay rate
defines a time scale for the EMA (analogous to a half-life) which is an alternative characterization of
the relative update size in equilibrium. They argue the time scale should intuitively remain constant
across networks sizes in order for the final network parameters to form a similar average over different
data points, but do not justify this formally. We note that although the timescale interpretation can be
a useful mental model and approximation, it does not account for the dependency of later updates
upon earlier ones. This dependency is exactly what prevents optimization from being performed in a
single step by simply taking an appropriate weighted average over all the data. We take a different
approach by relating weight decay directly to the goals ofP through the rate of feature changes.
9 D
OriginallyµP was proposed as a theoretical framework that aims to control feature learning in
the infinite width limit at initialization. This work expands upon this by studying feature learning
throughout training in practical finite-width networks through a more empirical approach. Our
relative formulation provides a unified view ofµP and weight decay, explaining how both play a
role in stabilizing feature learning at different times during training. Crucially, this reveals thatµP’s
downscaling of the relative weight updates for wider networks is only desirable early in training.
As training goes on,µP’s scaling must be counteracted by independent weight decay or similar
mechanisms in order to stabilize feature learning and thereby facilitate learning rate transfer.
In this work we have focused exclusively on AdamW for two key reasons. First, AdamW is the still
the most common optimizer for LLM training, a key application area forµP and learning rate transfer.
Second, the originalµP formulation for SGD does not scale the learning rate for hidden layers,
using special multipliers instead. This can make standard weight decay behave like independent
weight decay underµP’s scaling, making it harder to show the discrepancy. However, we believe the
alignment dynamics we have described andµP’s broken alignment assumptions occur in SGD and
similar optimizers as well. Matrix-level optimizers on the other hand could significantly change these
dynamics. For example, Muon (Jordan et al., 2024) can result in a constant low update alignment
that does not vary over time, allowing it to bypass much of the complexities we have discussed in
this work. This could potentially explain its reduced need for learning rate warmup and makes it a
promising alternative toP for effectively controlling feature learning in practice.
In Appendix E we show that our findings mostly generalize to ResNet training on ImageNet with
AdamW. The main difference there seems to be a potentially confounding regularization benefit
from higher learning rates. As the network widths increase, smaller relative representation changes
minimize the training loss while larger changes minimize the validation loss. Despite this, independent
weight decay still gives much better transfer than standard weight decay.
The complex effects of weight decay are unfortunately often overlooked in optimization research for
deep learning. The use of weight decay allows specifying the size of relative updates twice, once
at the start of training through the learning rate and again in equilibrium through the product of the
learning rate and weight decay. Interestingly,µP with independent weight decay scaling actually
makes use of this effect to create the beneficial implicit learning rate warmup. Although we showed
a similar effect can be achieved via modified learning rate schedules,µP with independent weight
decay scaling is a perfectly valid and practical way of achieving this. We hope our work can inspire
confidence in this approach in practice, provide useful conceptual insights into learning rate transfer
and weight decay, and finally highlight a promising way forward via matrix-level optimizers.
9

R
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774, 2023.
Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto rate-tuning by batch
normalization. In, 2019. URL https:
//openreview.net/forum?id=rkxQ-nA9FX. arXiv:1812.03981.
Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness.
Power lines: Scaling laws for weight decay and batch size in llm pre-training.
arXiv:2505.13738, 2025.
Jeremy Bernstein, Arash Vahdat, Yisong Yue, and Ming-Yu Liu. On the distance between two neural
networks and the stability of learning., 33:
21370–21381, 2020. arXiv:2002.03432.
Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Yuri Prince, Bj¨orn Deiseroth,
Andres Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, and Douglas Orr. u-$ \mu$p: The
unit-scaled maximal update parametrization. In
ing Representations, 2025. URL https://openreview.net/forum?id=P7KRIiLM8T .
arXiv:2407.17465.
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong,
Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization
algorithms. In, 2023. URL
https://openreview.net/forum?id=ne6zeqLFCZ. arXiv:2302.06675.
Vitaliy Chiley, Ilya Sharapov, Atli Kosson, Urs Koster, Ryan Reece, Sofia Samaniego de la Fuente,
Vishal Subbiah, and Michael James. Online normalization for training neural networks.
in Neural Information Processing Systems, 32, 2019. arXiv:1905.05894.
Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay,
Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al. Command
a: An enterprise-ready large language model., 2025.
Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do
we need weight decay in modern deep learning? In
Neural Information Processing Systems, 2024. URL https://openreview.net/forum?
id=YrAxxscKM2. arXiv:2310.04415.
Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel
Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras
wafer-scale cluster., 2023.
Nolan Dey, Quentin Anthony, and Joel Hestness. The practitioner’s guide to the
maximal update parameterization, 2024. URLhttps://www.cerebras.ai/blog/
the-practitioners-guide-to-the-maximal-update-parameterization.
Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz
Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient
deep transformers., 2025.
Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu,
Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington.
Scaling exponents across parameterizations and optimizers. In
on Machine Learning, 2024. URL https://openreview.net/forum?id=0ksNeD1SJT .
arXiv:2407.05872.
Falcon-LLM Team. Falcon-h1: A family of hybrid-head language models redefining efficiency and
performance, May 2025. URL.
10

Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, and Leena Chennuru Vankadara. On the surprising
effectiveness of large learning rates under standard width scaling.,
2025.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In
(CVPR), June 2016. arXiv:1512.03385.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang,
Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models
with scalable training strategies., 2024.
Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and
generalization in neural networks., 31, 2018.
arXiv:1806.07572.
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy
Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps:
//kellerjordan.github.io/posts/muon/.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015. arXiv:1412.6980.
Atli Kosson, Bettina Messmer, and Martin Jaggi. Analyzing & reducing the need for learning
rate warmup in GPT training. In
Processing Systems, 2024a. URL https://openreview.net/forum?id=ZgDNrpS46k .
arXiv:2410.23922.
Atli Kosson, Bettina Messmer, and Martin Jaggi. Rotational equilibrium: How weight decay balances
learning across neural networks. In,
2024b. URL. arXiv:2305.17212.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal,
Etash Kumar Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff,
Reinhard Heckel, Jean Mercat, Mayee F Chen, Suchin Gururangan, Mitchell Wortsman, Alon
Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Kamal Mohamed Abbas, Cheng-Yu Hsieh,
Dhruba Ghosh, Joshua P Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah M Pratt,
Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang,
Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham M. Kakade, Shuran Song, Sujay Sanghavi,
Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari,
Alexander T Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev,
Thomas Kollar, Alex Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar.
Datacomp-LM: In search of the next generation of training sets for language models. In
eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,
2024. URL. arXiv:2406.11794.
Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning. In
national Conference on Learning Representations, 2020. URL https://openreview.net/
forum?id=rJg8TeSFDH. arXiv:1910.07454.
Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. Reconciling modern deep learning with traditional
optimization analyses: The intrinsic learning rate.
Systems, 33:14544–14555, 2020. arXiv:2010.02916.
Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank Reddi, and Sanjiv Kumar. Robust training
of neural networks using scale invariant architectures. In
Learning, pp. 12656–12684. PMLR, 2022. arXiv:2202.00980.
Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo
Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, et al. Llm360: Towards fully transparent open-source
llms., 2023.
11

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In
ence on Learning Representations, 2019. URL https://openreview.net/forum?id=
Bkg6RiCqY7. arXiv:1711.05101.
Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of
large-batch training., 2018.
Christian H.X. Ali Mehmeti-G¨opel and Michael Wand. On the weight dynamics of deep normalized
networks. In, 2024. URL https:
//openreview.net/forum?id=AzUCfhJ9Bs. arXiv:2306.00700.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style,
high-performance deep learning library., 32,
2019. arXiv:1912.01703.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet
Large Scale Visual Recognition Challenge., 115
(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. arXiv:1409.0575.
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost.
In Jennifer Dy and Andreas Krause (eds.),
Machine Learning, volume 80 of, pp. 4596–4604.
PMLR, 10–15 Jul 2018. URLhttps://proceedings.mlr.press/v80/shazeer18a.
html. arXiv:1804.04235.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee
Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand
Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language
models, 2023. URL.
Twan Van Laarhoven. L2 regularization versus batch and weight normalization.
arXiv:1706.05350, 2017.
Mathurin Videau, Badr Youbi Idrissi, Daniel Haziza, Luca Wehrstedt, Jade Copet, Olivier Teytaud,
and David Lopez-Paz. Meta Lingua: A minimal PyTorch LLM training library, 2024. URL
https://github.com/facebookresearch/lingua.
Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynam-
ics: Learning dynamics of normalized neural network using sgd and weight decay. In
M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.),
vances in Neural Information Processing Systems, volume 34, pp. 6380–6391. Curran Asso-
ciates, Inc., 2021. URLhttps://proceedings.neurips.cc/paper/2021/file/
326a8c055c0d04f5b06544665d8bb3ea-Paper.pdf. arXiv:2006.08419.
Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and
dataset size. In, 2025. URL https:
//openreview.net/forum?id=IszVnczhfz. arXiv:2405.13698.
Ross Wightman. Pytorch image models. https://github.com/rwightman/
pytorch-image-models, 2019.
Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie E Everett, Alexander A Alemi, Ben Adlam,
John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha
Sohl-Dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Small-scale proxies
for large-scale transformer training instabilities. In
ing Representations, 2024. URL https://openreview.net/forum?id=d8w0pmvXbZ .
arXiv:2309.14322.
Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi,
Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neu-
ral networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer,
12

Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.),
formation Processing Systems, volume 34, pp. 17084–17097. Curran Associates, Inc.,
2021. URLhttps://proceedings.neurips.cc/paper_files/paper/2021/
file/8df7c2e3c3c3be098ef7b382bd2c37ba-Paper.pdf. arXiv:2203.03466.
Greg Yang and Edward J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks.
In Marina Meila and Tong Zhang (eds.),
Machine Learning, volume 139 of, pp. 11727–11737.
PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/yang21c.
html. arXiv:2011.14522.
Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning.
ArXiv, abs/2310.17813, 2023. URLhttps://api.semanticscholar.org/CorpusID:
264555180. arXiv:2310.17813.
Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs VI: Feature learning in infi-
nite depth neural networks. In,
2024. URL. arXiv:2310.02244.
Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight
decay regularization. In, 2019. URL
https://openreview.net/forum?id=B1lz-3Rct7. arXiv:1810.12281.
13

A E
Adam (Kingma & Ba, 2015) is a widely used optimizer in deep learning. It was originally used with
ℓ2-regularization but Loshchilov & Hutter (2019) empirically found that decoupled weight decay
works better, resulting in AdamW. The version of AdamW we use is shown in Algorithm 1 and
corresponds to the most popular variant, the default in PyTorch (Paszke et al., 2019).
Weight decay is unfortunately still widely believed to act primarily as an explicit regularizer despite a
long line of work that argues this is not the case in deep learning (Van Laarhoven, 2017; Zhang et al.,
2019; Chiley et al., 2019; Li & Arora, 2020; Li et al., 2020; Wan et al., 2021; Li et al., 2022; Kosson
et al., 2024b; D’Angelo et al., 2024). Instead weight decay modulates some notion of an “effective”
learning rate, which can be seen as measures of the relative update size. Our work builds closely
upon Kosson et al. (2024b), who described the equilibrium dynamics of weight decay in detail for
AdamW. This included originally noting the scheduling effects, how weight decay can be replaced
by controlling the relative weight update, and how the balanced equilibrium behavior of AdamW
explains its effectiveness over Adam with 2-regularization.
Yang & Hu (2021) proposed the Maximal Update ParameterizationµP as a way to ensure stable feature
learning in infinitely wide networks. Yang et al. (2021) demonstrated howµP allows transferring
the learning rates across different network widths and how this beneficial for hyperparameter tuning.
Later works have expandedµP to network depth rather than only width (Yang et al., 2024; Dey et al.,
2025). The original formulation ofµP is quite theoretical but Yang et al. (2023) show a simplified
and more accessible derivation based on spectral norms which we base our discussion on. Dey et al.
(2024) provide a useful overview ofP.
Everett et al. (2024) show hyperparameter transfer is possible for other parameterizations thanµP,
including standard initialization schemes without the special multipliers used inµP, using analogous
learning rate scaling. We adopt this approach in our experiments as it is simpler and results in better
performance according to Everett et al. (2024). They also describe how the learning rate scaling
depends on alignment metrics similar to those we define in Equation 4, but use the total weight change
from the start of training rather than a single optimization step. Our findings fit particularly well with
one case they consider, the no-alignment setting. Their analysis of AdaFactor (Shazeer & Stern, 2018)
with weight-proportional updates in this case predicts that learning rates should transfer without
any scaling. This is similar to what we observe in practice once weight decay makes the updates
proportional to the weights. We expand upon their work with measurements of the update-alignment,
describing the role of weight decay, and how the batch size affects alignment.
Haas et al. (2025) is very recent work that investigates why higher learning rates thanµP prescribes
can be used. They investigate alignment changes as a potential cause but dismiss them in favor of
the role of the loss function, showing that typical cross entropy losses help stabilize the final layer
compared toℓ2based losses. Our alignment measure is based on the current update rather than the
total update across training, which might explain the different conclusions.
14

0 10 20
Step [10
3
]
2
6
2
4
2
2
2
0
Weight Align. W
fan_in=128
fan_in=2048
P assump.
0 10 20
Step [10
3
]
2
6
2
4
2
2
2
0
Update Align. W
0 10 20
Step [10
3
]
2
0
2
2
2
4
2
6
Align. Ratio W/W Figure 8: The alignment assumptions ofµP break down even without weight decay. Measure-
ments of the weight alignment, the update alignment and the alignment ratio for LLaMA training at dif-
ferent widths. The update alignment varies significantly over time, becoming width-dependent. This
violatesµP’s assumption which only holds very early in training. A base learning rate ofη.6
−2
is used for the narrower network and scaled down16×for the wider network viaµP’s learning rate
scaling. Plots shown for one representative MLP layer,.
B
In Figure 1 we observed that no weight decay seems to provide better learning rate transfer than the
use of standard weight decay, although it is still not as good as with independent weight decay. The
final loss is also slightly worse than the best configuration with either weight decay approach. In this
section we explore whether the assumptions ofP hold without weight decay.
In Figure 8 we see that this is not the case. The alignment ratio still becomes roughly one, independent
of the width, meaning that the alignment assumptions ofµP break down exactly like in the weight
decay case. µP’s violated assumptions.
our analysis of the update alignment (Section 5) where the width dependence at large batch sizes does
not depend on weight decay directly. With an alignment ratio that does not vary across width, the
relative updates should be preserved in order to keep the relative representation changes comparable.
In Figure 9 we measure how the relative updates behave across different learning rates. Notably,
without weight decay the peak learning rate has little effect on the size of relative updates
later in training. This could explain why the no weight decay setting seems to be less sensitive
to the specific value of the learning rate in Figure 1. Without weight decay the size of the relative
updates falls over time compared to when weight decay is used, similar to an extra decay factor
in the learning rate schedule. The way that this decay occurs seems to cause the relative weight
updates to lose their dependence on the peak learning rate. In Appendix D.2 we show this can
happen in a simplified setting for constant learning rates. The conceptual reason is that if the total
weight growth dominates the initial value, both the update and the total weight norm have the same
proportional dependency on the learning rate later in training. These factors cancel out resulting
in roughly identical relative weight updates across peak learning rates. This assumes the network
behaves similar to a scale-invariant network, i.e., that there is no strong gradient signal that affects the
weight norms. A similar effect of the peak learning rate losing its relevance over time was observed
for SGD and normalization layers before (Arora et al., 2019; Mehmeti-G¨opel & Wand, 2024).
The optimal learning rate likely becomes a balance between two factors, meaningful learning and
stability. Higher learning rates decrease the contribution of the initial weights to the final model,
which may allow the model to better learn the distribution. However, high learning rates can also
cause large relative updates early in training that can have detrimental effects as explored by Kosson
et al. (2024a), perhaps by saturating non-linearities or causing instability. If these detrimental effects
occur sufficiently early, the alignment assumptions ofµP may still roughly hold. In this case the
optimal learning rate may be decided primarily by early behavior. This may aid optimal learning rate
transfer underµP’s scaling, despite its core assumptions only holding briefly at the start of training.
We suspect the length of the learning rate warmup may also impact whether the key limiting stability
period occurs whenP’s alignment assumptions still hold or not.
15

10
1
10
2
10
3
10
4
10
5
Independent WD
W/W
=2×10
3
=8×10
3
=3.2×10
2
10
1
10
2
10
3
10
4
10
5
Y/Y
0 5 10 15 20
Step [10
3
]
10
1
10
2
10
3
10
4
10
5
No WD
=8×10
3
=3.2×10
2
=1.28×10
1
0 5 10 15 20
Step [10
3
]
10
1
10
2
10
3
10
4
10
5 Figure 9: Without weight decay relative updates become similar across learning rates over
time. Measurements of the relative weight update size and relative representation change for LLaMA
training. Plots shown for one representative MLP layer,.
Overall the reason for learning rate transfer without weight decay, to the the extent it occurs, would
then be similar as in the independent weight decay case. Early in trainingµP correctly scales down the
relative updates which helps transfer relative representation changes under the high, width-dependent,
initial alignment ratios. Later in training the relative updates become approximately the same across
widths which helps stabilize the rate of feature learning when the alignment ratios are identical across
widths. Only standard weight decay followsµP’s prescribed relative weight update scaling from
Equation 6 throughout training, which is exactly why learning rates fail to transfer with it. With either
no weight decay or independent weight decay, the relative updates deviate from the prescribed scaling
which helps counteract the violated alignment assumptions. However, independent weight decay
seems to control the relative updates later in training more tightly, resulting in better transfer overall.
16

2
0
2
4
2
8
2
12
2
16
2
20
Batch Size B
2
6
2
5
2
4
2
3
2
2
2
1
2
0
Update Alignment W
Fan-in C=256
2
0
2
4
2
8
2
12
2
16
2
20
Batch Size B
2
6
2
5
2
4
2
3
2
2
2
1
2
0
Fan-in C=4096
=1
=10
1
=10
2
=10
3
=0
1/B
1/C Figure 10: Predicted update alignmentα∆W in the simple noise model based on Equation 25.
The overall behavior is well approximated asα∆W≈
p
C
−1
+
−1
+(1 +)
shown by the
dashed color lines. Note how the alignment for a low signal-to-noise ratioφ is roughly1/
√
B
(no width-dependence asµP assumes) untilB > Cwhen it becomes1/
√
C
(µP’s assumption fails).
Higher values ofφalways increase the alignment regardless ofBandC. This is most prominent
whenφ/C in which case we getα∆W≈/
p
φ
−1
+1
onceB
−1 . This can be viewed as
the batch size being effectively capped atφ
−1
+1 , an estimate of the critical batch size (McCandlish
et al., 2018). Early in training this can give a high approximately width-independent alignment,
allowingP’s assumptions to hold initially even when
C A
In this section we analyze the update alignment for a simple probabilistic model of a single neuron
with a fan-in of
y
⊤
= [y
1
, . . . ,
B
] =
⊤
X
⊤
[x1, . . . , B]
wherew
C are the weights of the neuron,X
CB is a batch ofBinput vectors with
dimensionCeach, andy
B is the output for each input. The gradient of the final average sample
loss
1
B
P
B
b=1
Lbwith respect to the weights
g
∂
∂
=
1
B
B
X
b=1
∂ b
∂
b
xb=
1
B
B
X
b=1
y
′
b
xb (15)
where we have definedy
′
b
:= b/∂
b . Note how the gradient for each samplegb:=
′
b
xb is simply
a scaled version of the input vector band is thus fully aligned with it.
We want to be able to introduce analytically tractable correlations between the random variables in
our probabilistic model. Taking inspiration from Kosson et al. (2024a), we use the Signal-to-Noise
ratioφto capture the correlation. It is defined as the square ratio of the norms of the expected gradient
(signal) to the expected deviation (noise):
φ
∥¯g
2
Eb[∥gb−¯g
2
]
, ¯g b[gb]
We can now define a distributions forxbandy
′
bthat result in a specificφwhile respecting the
mathematical form of the gradient and being simple enough to analyze. We chose the form of the
input samples as:
xb= (˜xb+
√
φ
1
y
′
b
)/
p
1 +
where the noise component˜xband signal componentsare independent random vectors drawn from a
standard multivariate normal distribution, i.e.,˜xb,0, I C) . We choose the distribution of the
output gradient to bey
′
b
= with equal probability to avoid complexities in modeling the inverse.
17

We could scale the variance of the distributions but this would not affect the alignment ratio. In this
model the gradient becomes:
gb=
′
b
xb= (˜xby
′
b
+
√
φ)/
p
1 +
Assuming∥s∥
2
= (or alternatively in expectation overs), this results in the targeted signal-to-noise
ratio:
∥¯g
2
Eb[∥gb−¯g
2
]
=
φ∥s∥
2
/(1 +)
Eb[∥y
′
b
˜xb/
√
1 +∥
2
]
=
C/(1 +)
C/(1 +)
=
Note that there are other forms forxbthat result in the desired signal-to-noise ratio including some
that give¯g , but our choice results in bounded input and gradient norms in expectation for both
φ. We will also assume that the update is obtained via simple gradient descent:
∆w
The update alignment involves a ratio that is hard to analyze under expectation. Instead we will
approximate this as the ratio of the square expectations:
E[α∆W] =
ffi
∥∆w
⊤
X∥
∥∆wX∥
ffl
≈
s
E[∥∆w
⊤
X∥
2
]
E[∥∆w
2
]E[∥X∥
2
]
(21)
we thus need to compute[∥∆w
⊤
X∥
2
],[∥∆w
2
], and∥X∥
2
].
Starting with the input size we get:
E[∥X∥
2
] = b[∥xb∥
2
]
= b
ffi
1
1 +
∏
˜xb+
√
φ
1
y
′
b
,˜xb+
√
φ
1
y
′
b
πffl
(22b)
=
B
1 +
ȷ
Eb[∥˜xb∥
2
] + 2
√
φEb
ffi
1
y
′
b
⟨˜xb,⟩
ffl
+E b
ffi
1
(y
′
b
)
2
∥s∥
2
fflff
(22c)
=
B
1 +
(C
=
For the update size we get:
E[∥∆w
2
] =
2
E[∥g
2
]
=
2
E



1
B
B
X
b=1
y
′
b
˜xb+
√
φs
√
1 +

2

 (23b)
=
η
2
B
2
(1 +)
E



B
X
b=1
y
′
b
˜xb
!
+
√
φs

2

 (23c)
=
η
2
B
2
(1 +)

E



B
X
b=1
y
′
b
˜xb

2

+ 2E
"*
B
X
b=1
y
′
b
˜xb, B
√
φs
+#
+
fi
∥B
√
φs∥
2
fl


(23d)
=
η
2
B
2
(1 +)

B
X
b=1
E[∥˜xb∥
2
] + 0 +
2
φE[∥s∥
2
]
!
(23e)
=
η
2
B
2
(1 +)
Γ
BC
2
φC
˙
(23f)
=
η
2
C
B
1 +
1 +
(23g)
18

Finally we can compute the expected output change as:
E[∥∆wX∥
2
] =[⟨x b,η
2
]
=
Bη
2
(1 +)
2
E
hD
˜xb+
√
φs/y
′
b
,
1
B
B
X
i=1
(˜xiy
′
i
+
√
φs)
E
2i
(24b)
=
η
2
(1 +)
2
B
E
hˇ
y
′
b
∥˜xb∥
2
+
X
i̸=b
y
′
i
⟨˜xb,˜xi⟩
√
φ⟨˜xb,⟩
+
√
φ
y
′
b
X
i̸=b
⟨s,˜xi⟩y
′
i
+
Bφ
y
′
b
∥s∥
2
ı
2i
(24c)
=
η
2
(1 +)
2
B
"
E[∥˜xb∥
4
] +
hˇX
i̸=b
y
′
i
⟨˜xb,˜xi⟩
ı
2i
+ (1 +
2
φE[⟨˜xb,⟩
2
]
+E
hˇX
i̸=b
⟨s,˜xi⟩y
′
i
ı
2i
+
2
φ
2
E
h
∥s∥
4
i
+ 2BφE
h
∥˜xb∥
2
∥s∥
2
i
#
(24d)
=
η
2
(1 +)
2
B
"
E[∥˜xb∥
4
] +
X
i̸=b
E[⟨˜xb,˜xi⟩
2
] + (1 +
2
φE[⟨˜xb,⟩
2
]
+
X
i̸=b
E[⟨s,˜xi⟩
2
] +
2
φ
2
E[∥s∥
4
] + 2BφE∥ ˜xb∥
2
∥s∥
2
]
#
(24e)
=
η
2
(1 +)
2
B
"
CCBC
2
φC
+(BC
2
φ
2
CCBφC
2
#
(24f)
Now we have all the factors needed to approximate the update alignment via Equation 21 giving:
E[α∆W]
s
(CB
2
φ(B
2
φ
2
(CBφC
(1 +)(1 +)CB
(25)
Figure 10 plots this prediction showingC-dependent behavior for large batch-to-width ratios as
expected. We find the overall behavior in Equation 25 well approximated by:
E[α∆W]
p
C
−1
+
−1
+(1 +)
The last term is the inverse ofφ
−1
+1 which is an estimate of the critical batch size (McCandlish
et al., 2018) where the overall gradient starts to become dominated by the signal component and the
benefit of increasing the batch size further diminishes quickly. In the fully uncorrelated caseφ
we roughly getα∆W≈/
√
B
whenB which is the regime targeted byµP, infinitely wide
networks trained with finite batch sizes. In the practical training settings we often haveB which
results inα∆W≈/
√
C
. ThisCdependence breaks the scaling behavior ofµP. When the critical
batch size is small compared toCthis can instead becomeα∆W≈
q
φ
1+φ
which eliminates theC
dependence again. The overall alignment behavior is thus highly dependent on the specific settings
of
19

D A
D.1 T
Here we construct a simple model of how the weight norms evolve over time, similar to the one used
in prior work (Kosson et al., 2024b). In this section we use∆Wto refer to this “gradient component”
of the update, treating the weight decay component separately. We will assume that the weight update
(excluding weight decay) at each timestep is orthogonal to the current weights. This results in a
simple recurrence relation for the weight norms:
∥Wt+1∥
2
= (1 tλ)
2
∥Wt∥
2
+∆W t∥
2
(27)
We will assume that the elementwise RMS size of the weight update is roughly one, for example due
to the second moment normalization in Adam. Definingρt:=W t∥/
√
KC
as the RMS value of
the elements inW t∥ t:= 1 tλ
ρ
2
t+1=
2
tρ
2
t+
2
t (28)
This model involves several significant simplifications:
•
The assumption that the update is orthogonal to the current weights can either come from
randomness or due to scale-invariance from normalization layers. The gradient of a scale-
invariant weight is always orthogonal to the weight and zero-mean uncorrelated random
gradient is orthogonal to the weights on average. We note that with momentum the update is
not necessarily orthogonal on average, even in these special cases.
•
This model only holds for momentum when the gradients are uncorrelated over time. In this
case we can replace the update term by the total contribution of the current gradient over
time (the sum of the impulse response) as done by Kosson et al. (2024b). This also assumes
β1is not too large, so that the impact of a single gradient happens over a short timeframe
compared to thetvalues we are interested in. If there is correlation the behavior of the
system changes in a way we can not predict without additional information. See Wan et al.
(2021) for an analysis that partially accounts for this in SGDM.
•
The size of the update assumes thatβ2can accurately track the expected magnitude of
the gradient over time. This either requires smallβ2values in comparison to how fast the
gradient magnitude changes. We note that at the start of training the gradient magnitude
often changes very rapidly.
This means that in practice Equation 28 does not accurately predict the behavior of the weight norm
in neural network training. However, we believe it is the best we can do based on the hyperparameters
alone without measurements of the temporal gradient correlation and gradient magnitude over time
(which we expect strongly depend on the other hyperparameters too). We still find the predictions
of this model to be informative for real training, particularly those based on the equilibrium value
that satisfiesρt+1= t . The RMS values seen in real training often stabilize near this value (Kosson
et al., 2024b), but how fast this happens depends strongly on the complex temporal effects described
above.
With this disclaimer, we proceed with the analysis of Equation 28. Rolling out the recurrence relation:
ρ
2
t+1=
2
0
t
Y
τ
a
2
τ+
t
X
τ
η
2
τ
t
Y
i=τ
a
2
i (29)
In the case where the learning rate is constant t= t=
ρ
2
t=
2
0·
2t
+
2
t−1
X
τ
a
2(t−τ
=
2
0·
2t
+
2
1
2t
1
2
(30)
assuming0 . Ifηtis not constant, we can still easily compute the behavior of the weight
norm over time using the recurrence relation or the rolled-out form directly.
In the constant learning rate case the RMS value approaches the equilibrium valueρ∞exponentially:
ρ
2
t=
2
∞+ (ρ
2
0−
2
∞)a
2t
ρ
2
∞=
η
2
1
2
≈
η
2λ
(31)
20

We can use this to estimate how long it takes for the weights to approximately reach equilibrium. If
we define this by the timestep EQwhen the RMS value falls within a factor
√
2
value we get:
TEQ≈







1
2ηλ
ln
ˇ
1−ρ
2
0
/ρ
2
∞
1−1/
√
2
ı
if0<
1
√
2
ρ∞
1
2ηλ
ln
ˇ
ρ
2
0
/ρ
2
∞
−1
√
2−1
ı
if0>
√
2ρ∞
0
(32)
D.2 T
In this simplified model the relative weight update becomes:
∥∆W t∥
∥W
=
bηt
ρt
(33)
which we can compute exactly given the expressions for tfrom the previous subsection.
In the constant learning rate case we get:
∥∆W t∥
∥W
=
s
b
2
η
2
ρ
2
t
=
s
b
2
η
2
ρ
2
0
·
2t
+
2
η
21−a
2t
1−a
2
−−−→
t→∞
p
2ηλ
2
λ
2
≈
p
2ηλ
showing that the productηλdetermines the size of relative weight updates in the steady-state.
However, at the start of training, e.g. the first step, the relative weight update (excluding the weight
decay component) is determined by just the learning rate and the initialization norm.
It is also interesting to see what happens without weight decay,λ . For a constant learning rate,
this gives:
ρ
2
t=
2
0+
2
ηt
ρt
=
η
p
ρ
2
0
+
2
=
1
p
(ρ
2
0
/η
2
) +
(35)
which means that the relative updates decay towards zero over time roughly ast
−0.5, rather than
stabilizing at a fixed value. For this reason it is important to use learning rate schedules with weight
decay, otherwise the effective learning rate measured in terms of relative updates never decays which
typically leads to suboptimal outcomes. This decay in the relative update size can also be detrimental,
if the relative update size decreases too quickly the network stops learning (sometimes referred to
as a loss of plasticity). An example of this was seen in the AdamW paper (Loshchilov & Hutter,
2019) where the optimal weight decay value in an experiment was zero for a constant learning rate
but non-zero when a cosine schedule was used. It is better to vary the effective learning rate over time
explicitly according to our desired schedule, ideally exactly through optimizers like LionAR (Kosson
et al., 2024a) or approximately via weight decay.
D.3 T
Let us now consider two configurations for the learning rate and weight decay,(η, λ) vs(η/m, mλ) .
These correspond to the configurations from independent weight decay scaling compared to no
scaling (or in standard scaling at a higher base learning rate). We first note that their product is
identical, so the relative weight update will converge to the same equilibrium value. However, the
very first update will also bemtimes smaller for the high weight decay configuration. A learning
rate warmup starting atη/mand reachingηover time has an identical effect on the initial update
and the steady-state updates. However, the exact shape of this warmup effect is not trivial, i.e. how it
affects the relative update size over time.
We use the previous symbols to denote quantities for the base hyperparameters(η, λ) andˆover the
symbol to denote the corresponding quantities for the scaled configuration. We are interested in the
ratio of the relative weight updates between the two configurations:
st:=
ˆ
bηt/t
bηt/ρt
(36)
21

which captures the size of the warmup effect. For the simple weight norm model we can always
compute this using Equation 28, even when the learning rate varies over time. For constant learning
rates we would get:
st=
1
m
ρt
ˆt
(37a)
=
1
m
v
u
u
t
ρ
2
0
·
2t
+
2
η
21−a
2t
1−a
2
ρ
2
0
·
2t
+
2
ˆ
21−a
2t
1−a
2
(37b)
=
s
ρ
2
∞+ (ρ
2
0
−
2
∞)a
2t
ρ
2
∞+ (m
2
ρ
2
0
−
2
∞)a
2t
(37c)
=
s
1 + (ρ
2
0
/ρ
2
∞−a
2t
1 + (m
2
ρ
2
0
/ρ
2
∞−a
2t
(37d)
This is already a relatively complicated expression with behavior that depends on the initialization
magnitudeρ0as well as the exact values ofηandλused. The effective length of the additional
warmup can vary from a single step, e.g. whenρ0→ , to the whole training, for example whenρ0is
large or decays slowly. We note it will also depend significantly onηλ, so when sweeping learning
rates the length of the warmup effect will differ between points.
Finally we note that simply scaling the learning schedule in for the base configuration bystwill
generally not give relative updates matching the high weight decay configuration. This is because
modifying the learning rate schedule also changes how the norms evolve over time. A scaling factor
that would make the profiles match exactly could still be computed using the recurrence relations by
keeping track of how the modifications affect the norms over time. However, exactly matching the
warmup effect from higher weight decay values may not be that important since it is largely arbitrary
as far as we know in the sense that it does not depend on the alignment ratio. Applying the scaling
factor from Equation 36 to a relative or rotational optimizer like LionAR (Kosson et al., 2024a) would
result in the desired warmup effect.
22

E R
In this section we repeat some of our experiments for ResNet training to see if they generalize to
other architectures and settings. The training setup is described in Appendix F.4 Overall our main
conclusions from the LLaMA experiments hold well. The main differences are that the learning rate
transfer is worse, likely due to regularization effects and that the warmup effects seem to matter less.
The results are shown in the following figures.
•
Figure 11 compares learning rate transfer between standard and independent weight decay
scaling. Independent weight decay gives better transfer but there is a shift in the optimal
learning rates for both scaling approaches.
•
Figure 12 shows that independent weight decay scaling helps preserve relative representation
changes. The behavior is essentially identical to the LLaMA setting.
•
Figure 13 measures the weight alignment, update alignment and alignment ratio. Just like in
the LLaMA settingµP’s assumptions only hold for a very brief period at the start of training.
The weight alignment changes more than we observed with LLaMA.
•
Figure 14 empirically measures the warmup effect from the use of high weight decay. The
conclusions are identical to the LLaMA setting.
•
Figure 15 compares learning rate transfer betweenµP with independent learning rate scaling
and not applyingµP scaling. We find that both learning rate transfer quality and stability are
similar. TheµP configuration results in a slightly higher accuracy which may at least in part
be due to the warmup effect.10
3
10
2
10
1
10
0
Learning Rate
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Train Loss
10
3
10
2
10
1
10
0
Learning Rate
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
Val Loss
10
3
10
2
10
1
10
0
Learning Rate
62
64
66
68
70
72
74
76
78
80
Val Accuracy
Standard
Independent
min loss
max acc
x0.25 Width
x0.5 Width
x1 Width
x4 Width
Figure 11: Comparison of standard vs independentµP scaling for ResNet training on ImageNet.
Learning rates generally transfer much worse than for LLaMA. Independent scaling still gives better
results than standard scaling overall. Note how optimal learning rates in terms of training loss shift
downwards with independent scaling while they shift upwards for the validation loss and accuracy, a
strong indication of a confounding regularization effect. Widths are specified relative to a standard
ResNet-50 but the base width for scaling is the0.25× variant. Learning rates for parameters that do
not scale with width are frozen before the sweep at the optimal value for the.25×
23

10
2
10
4
10
6
Std. WD Scaling
Y/Y
0.25×
4×
2
0
2
2
2
4
2
6
W/W
10
2
10
4
10
6
W/W
0 50 100
Step [10
3
]
10
2
10
4
10
6
Ind. WD Scaling
0.25×
4×
0 50 100
Step [10
3
]
2
0
2
2
2
4
2
6
0 50 100
Step [10
3
]
10
2
10
4
10
6 Figure 12:
fails to preserve the relative representation change across widths. The alignment ratio is similar across
widths contrary toµP’s assumptions, requiring maintaining relative weight updates as achieved by
independent weight decay scaling. Plots show forlayer3.2.conv1which is a1 convolutional
layer withfan_in=4096for the4×variant. The learning rate used here isη
−3 with a 1
epoch warmup (1%) followed by cosine decay to zero.0 50 100
Step [10
3
]
2
6
2
4
2
2
2
0
Weight Align. W
fan_in=256
fan_in=4096
P assump.
0 50 100
Step [10
3
]
2
6
2
4
2
2
2
0
Update Align. W
0 50 100
Step [10
3
]
2
0
2
2
2
4
2
6
Align. Ratio W/W
Figure 13: µP do not hold.
Plots show forlayer3.2.conv1which is a1 convolutional layer. Both the weight alignment
and update alignment deviate significantly fromµP’s assumptions. Note the larger changes in the
weight alignment over time compared to the LLaMA experiment. Learning rateη
−3 , no
µP scaling applied.0 50 100
Step [10
3
]
10
2
10
3
10
4
10
5
W/W Base LR=, WD=
LR=/16, WD=16
0 50 100
Step [10
3
]
10
2
10
3
10
4
10
5
Y/Y
0 50 100
Step [10
3
]
2
4
2
3
2
2
2
1
2
0
Scaling Effect st
Figure 14:
tions. Plots show forlayer3.2.conv1which is a1 convolutional layer. Hyperparameters
used areη
−3
, λ.1 corresponding to the best standard WD configuration in Figure 11
(afterµP scaling). The additional warmup effect here is comparable in absolute length to the LLaMA
setting but proportionally shorter.
24

10
3
10
2
10
1
Learning Rate
62
64
66
68
70
72
74
76
78
80
Val Top-1 Accuracy
10
3
10
2
Learning Rate
10
3
10
2
Learning Rate
P + Ind WD
no-P (1% WU)
no-P (10% WU)
max acc
x0.25 Width
x0.5 Width
x1 Width
x4 Width Figure 15: µP with independent weight decay scaling
(left) and no-µP scaling for ResNet training (middle/right). The left and middle panels use a 1%
linear warmup, the right panel uses a 10% linear warmup (WU). The additional warmup effect ofµP
with independent weight decay scaling is not needed for stability or transfer, but may contribute to the
slightly higher accuracy achieved on the left. Other differences such as the resulting weight norms
could also matter, particularly for the final FC layer. For normalized layers the weight norms also
inversely scale the gradient norms which may affect the result through interactions with the epsilon
value of Adam. A similar gap also appears in Figure 11 between standard and independent weight
decay scaling.
25

Algorithm 1
Require:
Learning rateηt, weight decayλ, momentumβ1, magnitude smoothingβ2,ϵfor numerical stability
1:, parameter vector 0, momentum vector 0←, magnitude vector 0←
2:
while
3:
t
4:
gt← t−1
5:
mt← 1mt−1+ (1 1)gt
6:
vt← 2vt−1+ (1 2)g
2
t
7:
ˆmt← t/(1
t
1)
8:
ˆvt← t/(1
t
2)
9:
θt← tλ)θt−1− tˆt/(
√
ˆt+)
F E
F.1 A
The AdamW variant we use is shown in Algorithm 1. Note how the weight decay hyperparameter in
line 9 is scaled by the learning rate. This differs from the original variant described in Loshchilov &
Hutter (2019), which is often referred to as “independent” or “fully-decoupled”. This variant only
hasλinstead ofηλin line 9, but the weight decay hyperparameter is still varied according to the
same schedule as the learning rate.
F.2 L
Our learning rate transfer experiments consist of two phases. We start from an existing hyperparameter
configuration and first sweep the global learning rate used across all parameter types, i.e., the input
layer, hidden layers, output layer, biases and gains. This is done for the1×scale model. We then
freeze the learning rate of the input layer, biases and gains, changing only the learning rate for the
layers that scale with the width, i.e., the hidden layer and output layer. This second base learning rate
is scaled viaµP’s rules before applying it in the optimizer, but the plots show the base rate before
scaling (which is what transfers across scale). This second sweep gives the final performance shown
in the learning rate transfer plots.
We adopt this approach because it better captures how well the learning rate scaling works by leaving
the parameters it is not applied to unaffected. Compared to sweeping a single global learning rate
across scales, this typically gives better performance with standard scaling but highlights the shift
more. The reason is that without freezing the learning rate for the other layers, they can become
unstable at the shifted (higher) base learning rates needed for the hidden layers to learn well. This
gives worse overall results since no single global learning rate works well for all parameters anymore,
but the optimal learning rates may shift less as a result.
Instead of using the full Maximal Update ParameterizationµP for training we combine standard
parameterization with the learning rate scaling ofµP as proposed by Everett et al. (2024). They find
this simpler approach empirically works just as well or better. We note there are many variants of
µP that achieve the same effects slightly differently, see for example Tables 3, 8 and 9 in Tensor
Programs V (Yang et al., 2021). However, these are generally not equivalent with standard weight
decay, which would complicate the use of fullP for our experiments.
For transformer training the main differences between fullP and our experiments are the handling
of the final output layer and the normalization factor applied in attention. The output layer is
initialized using a smaller standard deviation that scales with1/Crather than1/
√
C
like in standard
parameterization. This results in an initial relative update size that is not scaled with the width
compared to the1/
√
m
scaling that occurs in the approach from Everett et al. (2024). The other
difference is that the scaling factor in the attention heads is1/dheadrather than1/
√
dhead
. However
we keepdheadconstant in our scaling experiments, opting to scale the number of heads instead. This
difference could thus be absorbed into a tunable factor that does not change across scales.
26

F.3 LL
Our LLaMA Touvron et al. (2023) experiments are performed using a modified version of Lin-
gua (Videau et al., 2024). The networks have 20 transformer blocks, each consisting of one attention
and one MLP sub-block. We vary the embedding dimension embbetween 128 and 2048, which we
sometimes refer to as1×and16×. The MLP block has an expansion factor of exactly3×giving
exact scaling ratios for all layers. We used a fixed attention head size ofdhead= 128 , varying the
number of heads across scale. All linear layers are initialized using a standard deviation of1/
√
C
.
We do not use embedding tying but the embeddings are still initialized with a standard deviation of
1/
√
demblike the final FC layer (default behavior in Lingua).
The task is standard next token prediction pre-training using a cross-entropy loss. The dataset used is
DLCM (Li et al., 2024), specifically a 10% subset obtained by using the first global shard from the
Hugging Face dataset. We use the llama3 tokenizer
3
with a vocabulary size of 128,256.
Unless otherwise stated we use the PyTorch variant of the AdamW optimizer shown in Algorithm 1.
We vary the learning rate between runs but other hyperparameters are fixed atβ1= 0.9 ,β2= 0.95 ,
ϵ
−8 , andλ.1 (except when scaled through independent weight decay scaling). Weight
decay is applied to all parameters except the gains. We train for 20,000 steps at a global batch size
of 256 and sequence length of 4096 giving 1,048,576 (2
20) tokens in each batch and 20.97 billion
tokens in total. This results in 19.23 tokens per non-embedding parameter in the largest configuration
we use (embedding dim 2048). Unless otherwise noted, the learning rate schedule consists of a 10%
linear warmup from 0 to the specified peak learning rate, followed by a linear decay down to 0. The
losses reported are the average of the last 100 training steps unless otherwise specified. This is due to
initial issues we had with the evaluation part of the code base.
For an embedding dim of 2048 it takes roughly 14 hours to complete a single run using 8 H200 GPUs.
This was not optimized and is slowed down by heavy logging of a wide range of alignment related
metrics. Training was performed in bfloat16 with mixed precision.
F.3.1 D
This experiment follows our base LLaMA training F.3 and learning rate transfer setup F.2. However
these are more recent runs with working validation losses. For the weight decay runs, the learning rate
for the input layer and gains is frozen atη.6
−2 based on an initial sweep of a global learning
rate at the1×scale. For the no weight decay runs, this secondary learning rate is set toη.4
−2 .
The best loss without weight decay is 2.815 atηbase= 6.4
−2 , for independent weight decay it is
2.801 at base= 8
−3
, and for standard weight decay it is 2.798 at base= 6.4
−2
.
F.3.2 D
This experiment showslayers.13.feed_forward.w1 for a base learning rateη
−3 ,
the optimal for value for the1×width in Figure 1. The learning rate for gains and the input layer
isη.4
−2 . The metrics are logged and plotted every 10 steps, the final 0.5% of training is
excluded for numerical reasons (the changes go to zero with the learning rate). The representation
based metrics are computed for 1/64th of a batch (corresponding to one micro-batch on the master
rank). Other details follow the base setup described in F.3.
F.3.3 D
This experiment showslayers.13.feed_forward.w1 for a learning rateη
−3 and
weight decayλ.1 . Note that this is the LR-WD product that gives the best results for standard
scaling of the16×configuration in Figure 1, after applying theµP scaling to the LR. NoµP scaling
is performed between the two configurations, the hyperparameters are exactly the same and the
specified learning rate is applied to all parameters. The metrics are logged and plotted every 10 steps,
the final 0.5% of training is excluded for numerical reasons. The representation based metrics are
computed for 1/64th of a batch (corresponding to one micro-batch on the master rank). See Figure 16
for the alignment ratio of other layers. Other details follow the base setup described in F.3.
3
https://huggingface.co/meta-llama/Meta-Llama-3-8B
27

F.3.4 D
This experiment showslayers.13.feed_forward.w1 for a learning rateη
−3 (after
µP scaling) and weight decayλ.1 . This corresponds to the best configuration for standard weight
decay scaling in Figure 1 and the independent scaling run with the same LR-WD product. Metrics are
logged and displayed every 10 steps, the final 0.5% are excluded for numerical reasons, representation
changes are computed on 1/64th of a batch. Other details follow the base setup described in F.3.
F.3.5 D
This experiment follows our base setup for LLaMA training F.3 and learning rate transfer experiments
F.2. The learning rate for the gains and input layer is fixed atη.6
−2 . The modified learning
rate schedules are applied to all parameters, not only the width dependent ones. The scaling factor
forexp50anddecayawayvaries between widths as inµP, they have no effect at the1×scale.
The additional warmup factors are applied on top of the existing 10% warmup and linear decay. For
the longer 50% linear warmup, the run corresponding toη
−3 was repeated due to sudden
divergence half way through, which did not occur in the repeated run with exactly the same seed and
configuration.
F.4 R
Our ResNet He et al. (2016) experiments are performed using a modified version of PyTorch Image
Models (Wightman, 2019). All networks are 50 layers deep with a bottleneck block configura-
tion, specifically the original post-norm variant but with the downsampling on the residual stream
performed on the 3x3 convolutions (“ResNet v1.5”). When scaling the width we scale all hidden
dimensions of all layers evenly relative to the original network, going down to0.25× width up to4×.
The×
The training task is image classification using cross-entropy loss. We train on ImageNet-1k (Rus-
sakovsky et al., 2015) for 100 epochs. Data augmentation consists of random cropping and random
horizontal flips.
Optimization is performed using the PyTorch variant of AdamW. The learning rate varies between
experiments. Other hyperparameters areβ1= 0.9 ,β2= 0.999 ,ϵ
−8 ,λ.1 (before
independent weight decay scaling). Unless otherwise stated we use a 1 epoch linear warmup from 0
followed by a cosine decay to 0. The batch size is 1024 spread out over 8 GPUs without the use of
synchronized batch normalization during training.
The training of the4×variant takes roughly 10 hours to complete using 8 H200 GPUs. This was
not optimized and is slowed down by heavy logging of a wide range of alignment related metrics.
Training was performed in bfloat16 with mixed precision.
28

G S 1 2 3 4 5 6 7 8 9 10
2
2
2
1
2
0
2
1
Align. Ratio W/W
Layer Type
attn.wq
attn.wk
attn.wv
attn.wo
ff.w1
ff.w2
ff.w3
output
Emb. Dim
128
2048
11 12 13 14 15 16 17 18 19 20
Output
Transformer Block
2
2
2
1
2
0
2
1
Align. Ratio W/W
Figure 16: Halfway through training the alignment ratio of all layers is approximately one. The
alignment ratio for each layer in the LLaMA experiments shown in Figure 4 measured at step 10,000.
The ratio is similar across widths, contrary toP’s assumptions.
29