Extracted Text

2405.15071

Grokked Transformers are Implicit Reasoners:
A Mechanistic Journey to the Edge of Generalization
Boshi Wang
♠
Xiang Yue
3∗
Yu Su
♠
Huan Sun
♠
♠
The Ohio State University
3
Carnegie Mellon University
{wang.13930,yue.149,su.809,sun.397}@osu.edu
Abstract
We study whether transformers can learn toimplicitlyreason over parametric knowl-
edge, a skill that even the most capable language models struggle with. Focusing on
two representative reasoning types, composition and comparison, we consistently
find that transformerscanlearn implicit reasoning, but only throughgrokking, i.e.,
extended training far beyond overfitting. The levels of generalization also vary
across reasoning types: when faced with out-of-distribution examples, transformers
fail to systematically generalize for composition but succeed for comparison. We
delve into the model’s internals throughout training, conducting analytical exper-
iments that reveal: 1) the mechanism behind grokking, such as the formation of
the generalizing circuit and its relation to the relative efficiency of generalizing
and memorizing circuits, and 2) the connection between systematicity and the
configuration of the generalizing circuit. Our findings guide data and training
setup to better induce implicit reasoning and suggest potential improvements to
the transformer architecture, such as encouraging cross-layer knowledge sharing.
Furthermore, we demonstrate that for a challenging reasoning task with a large
search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory
fail badly regardless of prompting styles or retrieval augmentation, while a fully
grokked transformer can achieve near-perfect accuracy, showcasing the power of
parametric memory for complex reasoning.
2
1 Introduction
Large language models (LLMs) have been shown deficient inimplicitreasoning with their parametric
memory of knowledge and rules. For example, a range of LLMs are found to be incapable of robustly
composing internalized facts [48,71], and even GPT-4 [42] cannot adequately compare entities’
attributes despite knowing them [1].
Deficiency in implicit reasoning has profound impacts. It implies the models’ limitations in inducing
structured and compressed representations of facts and rules, which lead to redundant knowledge
storage and difficulty in propagating knowledge updates [76], and importantly, fundamentally impede
the model fromsystematic generalizationover knowledge [25]. While explicit verbalizations of
reasoning steps (e.g., chain-of-thought rationales) can improve task performance [67,64,73,55,31],
they are not available during large-scale (pre-)training where the model’s core capabilities are
acquired [77, 29].
Is implicit reasoning doomed given that even the most capable models struggle? Can it be resolved
by further scaling data and compute, or are there fundamental limitations of the transformer [62]
that prohibit robust acquisition of this skill?
∗
Project started when at OSU.
2
Code and data:https://github.com/OSU-NLP-Group/GrokkedTransformer .
Preprint. Under review.

wife
Barack Michelle
born in
Michelle 1964
wife
Barack
born in
[1964]
[younger]
age
Trump
age
Biden
Trump Biden
78 82 Figure 1: We find that transformers can learn to reason implicitly, but this skill is only robustly
acquired throughgrokking, i.e., an extended period of training far beyond overfitting. Moreover,
the transformer fails tosystematicallygeneralize for composition, yet succeeds for comparison. We
conduct a mechanistic study into the model internals throughout grokking, which reveals distinct
generalizing circuits across the two tasks (Figure 4, 5) that explains the variations in systematicity.
In this paper, we rigorously study these questions by constructing synthetic training and evaluation
datasets, training transformers from scratch, and examining their generalization. We conceptualize
reasoning as theinduction and application of inference rules, and expose the model to a mixture of
“atomic facts” and “inferred facts” (which are deduced from the atomic facts via a set of latent rules),
resembling “axioms” and “theorems” in a formal system. To evaluate how well the model learns
the rules, we test its ability to make novel deductions (i.e., completing unseen inferred facts) in both
in-distribution (ID) and out-of-distribution (OOD) scenarios.
3
This approach allows us to control
the training data and perform clean evaluations, which would be challenging when studying existing
LLMs trained on uncontrolled data.
Our experiments reveal that transformerscanlearn to perform implicit reasoning, but this skill is
only robustly acquired throughextended training far beyond overfitting(Figure 1), a phenomenon
known asgrokking[47]. We find that the speed of improvement in generalization correlates with
theratiobetween inferred and atomic facts in training, and depends little on theabsolute sizeof
the training data (Figure 2). This suggests a correction of prior explanations of grokking based on
critical data size[33,61,78,21], in that it should instead be thecritical data distributionthat decides
the characteristics of grokking. Our findings extend prior observations of the grokking phenomenon
primarily in algorithmic and linguistic tasks [47,38] to the domain of knowledge-based reasoning,
and deepen our understanding of the grokking phenomenon.
Moreover, we find that the transformer exhibits different levels of systematicity across reasoning
types. While ID generalization is consistently observed, in the OOD setting, the model fails to
systematically generalize for composition but succeeds in comparison (Figure 1). To understand why
this happens, we conductmechanisticanalysis of the internal mechanisms of the model. The analysis
uncovers the gradual formation of thegeneralizing circuitthroughout grokking and establishes the
connection between systematicity and itsconfiguration, specifically, the way atomic knowledge
and rules are stored and applied within the circuit. Our findings imply that proper cross-layer
memory-sharing mechanisms for transformers such as memory-augmentation [54,17] and explicit
recurrence [7, 22, 57] are needed to further unlock transformer’s generalization.
Finally, to demonstrate the power and potential ofparametric memoryfor complex reasoning, we
show that for a reasoning task with a large latent search space, a fully grokked transformer can achieve
near-perfect accuracy, while state-of-the-art LLMs like GPT-4-Turbo [43] and Gemini-1.5-Pro [16]
based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation.
3
Definitions of ID/OOD are introduced in §2.
2

2 General Setup
Training data & ID/OOD evaluation.As stated in §1, we are interested in whether transformers
can induce and apply latent rules over knowledge implicitly in a generalizable way. We create a data-
generating process consisting of 1) sampling a set of basicatomic facts, and 2) using the atomic facts
and latent rules to deduceinferred facts. To better characterize the level of generalization acquired by
the model, we evaluate the model’s in-distribution (ID) and out-of-distribution (OOD) performance.
We prepare two separate sets of atomic facts:atomicID andatomicOOD . Our training set includes
allthe atomic facts and a uniformly random portion of the inferred facts deduced fromatomicID ,
which we calltrain_inferred ID . For evaluation, (1) ID generalization aims to evaluate whether
the model learns the latent rules correctly, by testing its ability to complete unseen inferred facts
also deduced fromatomicID , which we denote bytest_inferredID . (2) OOD generalization
aims to evaluate thesystematicity[25] acquired by the model, namely, the ability to apply rules over
knowledge regardless of its distribution. To do this, we test the model on the facts deduced from
atomicOOD, denoted bytest_inferredOOD.
Model & optimization.The model we use is a standard decoder-only transformer as in GPT-2 [50]
with8layers,768hidden dimensions and12attention heads (we explore the impact of different
model scales in Appendix B). Optimization is done by AdamW [34] with learning rate10
−4, batch
size512, weight decay0.1and2000warm-up steps with a linear schedule. More details are included
in Appendix A.
3 Composition—Delayed Generalization without Systematicity
We begin our investigation withcomposition, where a model needs to “chain” different pieces of
facts, e.g.,“Barack’s wife is Michelle”and“Michelle is born in 1964”, to successfully complete
a compositional sentence, e.g.,“Barack’s wife is born in [1964]”. Prior work extensively studied
whether transformer-based language models can perform implicit composition, and negative results
are consistently reported [48,1,71]. Specifically, there exists a“compositionality gap”[48], i.e.,
the frequency at which the model knows all the underlying basic facts but fails to compose them,
which is considerable across different LLMs and does not decrease as models scale. Are transformers
doomed to fail on such kind of reasoning, and if so, why?
3.1 Setup
We focus on two-hop composition in this work. For atomic facts, we generate a random knowledge
graphGconsisting of|E|entities and|R|= 200 relations, where each entity (as the subject) has
20outgoing edges that connect through a random relation to another random entity (as the object).
The atomic facts are then the edges, i.e.,(subject, relation, object)triplets inG, which we partition
disjointly intoatomicIDandatomicOOD(95%:5%). The rule of (two-hop) composition is
∀h, b, t∈ E,∀r1, r2∈ R,(h, r1, b)∧(b, r2, t) =⇒(h, r1, r2, t), (1)
which is used to deduce the ID and OOD inferred facts fromatomicID andatomicOOD , respectively.
For convenience, in the above rule, we will callhtheheadentity,bthebridgeentity, andtthetail
entity. For both atomic and inferred facts, training/testing is done by having the model predict the
final tail entity. We assign a unique token to each relation/entity by default, and also find that the
results are robust to different tokenizations (details in Appendix C).
We study the influence of the following two aspects on the model’s learned behaviors:
•
Ratio between inferred and atomic facts.By varying the amount of inferred facts included in
training, we study the effect of the ratioϕ=|train_inferred ID|/|atomicID| on the model.
•Training data size.We study the impact of the training data size by varying|E|, the total number
of entities, while controlling the ratioϕ. Note that the size of training data (both atomic/inferred
facts) scales linearly with|E|.
3.2 Results
Grokking observed in ID generalization but not in OOD generalization. Figure 1(left) shows the
model’s accuracy on the train and test facts throughout optimization, with|E|= 2000andϕ= 7.2 .
3

We find that the modelcangeneralize to ID test examples, but high performance is only achieved
through extended training far beyond overfitting, a phenomenon calledgrokking[47]. Specifically, the
training performance saturates (over99% accuracy on both atomic and inferred facts) at around 14K
optimization steps, before which the highest ID generalization accuracy is merely9.2%. However,
generalization keeps improving by simply training for longer, and approaches almost perfect accuracy
after extended optimization lasting around50times the steps taken to fit the training data. On the
other hand, OOD generalization is never observed. We extend the training to2million optimization
steps, and there is still no sign of OOD generalization.
Inferred/atomic ratioϕcorrelates with generalization speed. Figure 2(a) shows the ID accuracy
across differentϕ. We omit the other splits since for all settings, the training performance saturates
quickly and the OOD accuracy remains at zero as earlier.
4
It could be seen that the ratioϕstrongly
correlates with thespeedof generalization. A very large ratio can push generalization to improve at a
similar pace as the model fits the training data, reducing the need for extended training.
510
4
10
5
Optimization Step (Log Scale)
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
Ratio (Inferred/Atomic)
3.6
5.4
7.2
9.0
12.6
18.0
(a) Effect of the inferred/atomic ratioϕ.0 50 100 150
# Epoch
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
| |: 10K
| |: 5K
| |: 2K
Train (ID)
Test (ID)
Test (OOD) (b) Effect of changing|E|(ϕ= 9.0).
Figure 2: The speed of grokking ontest_inferredID (a) correlates with theratiobetween
inferred and atomic facts, and (b) is not influenced by thesizeof training data.
Training datadistribution, instead of training data size, qualitatively influences generalization
behavior. Whenϕincreases and|E|holds constant, thesizeof training data also gets larger. Prior
studies hypothesize that training data size plays a central role in order for grokking to happen. In
particular, previous work connects grokking with the notion ofcritical data size(CDS) [33,61,78,21],
where it is hypothesized that CDS marks the shift from memorization to generalization (via grokking),
and the speed of generalization improves as the training data further scales. However, results from
our controlled experiments seem to contradict such a hypothesis. Figure 2(b) shows the results
of varying|E|with a fixedϕ= 9.0 , where we change the horizontal axis from optimization
step to epoch for better visualization.
6
When fixing the ratioϕ, the training data size doesnot
qualitatively affect the model’s generalization. Specifically, scaling the data affects neither the
relative speed of ID generalization and training improvement (as seen by the rather constant “gap”
betweentrain_inferred ID andtest_inferredID curves), nor the systematicity level (OOD
performance stays zero). We also run the experiments across differentϕand find the results to be
consistent. This suggests thatcritical data “distribution”, not size, may be the actual deciding factor
behind grokking and generalization.In addition, we find that scaling up the model size also does not
qualitatively change the generalization behaviors observed here (Appendix B), and the main pattern
is that larger models converge in fewer optimization steps, which shares with prior findings [60,28].
Summary.We have shown that transformers are capable of acquiring the rule of composition through
grokking, with controlled experiments suggesting the crucial factor of data distribution (e.g., the
inferred/atomic ratioϕ) in characterizing the model’s generalization. However, important questions
4
The training performances of all settings saturate within25K steps, where largerϕtakes more steps.
5
Whenϕ= 18.0, the model achieves96.7%accuracy before training performance saturates.
6
The optimization steps for each epoch scale linearly with the training size since we use a fixed batch size.
4

still remain:what happens during grokking, why does it happen, and why do transformers struggle
with OOD examples?Answering these questions requires a deeper understanding of (the changes in)
the model’s inner workings, which we investigate next.
3.3 Analyzing the inner workings of the model throughout grokking
We analyze the internal mechanisms within the model via a combination of two prevalent approaches:
logit lens and causal tracing. We apply our analysis to the setting with|E|= 2000, ϕ= 9.0 on300
random examples fromtrain_inferred ID.…
…
…
…
ℎ �� 1 ��
2…
…
…
…
ℎ �� 2��
1
′
��
′
(≠��) ��
(3) Intervention
?
…
…
Measure change
Input & Output
Embeddings
Final
LayerNorm
Position
Encoding
��
1��
8��
5
(1) Normal run(2) Perturbed run
Layer 0
Layer 1
Layer 4
Layer 5
Layer 8
… …
Logit Lens
ℎ ��1 ��
2
��
2
…
Layer 0
Layer 5
Layer 8
��
…
��
……
�� 1 ��
2
��
2
…
Layer 0
Layer 5
Layer 8
��
…
……
Layer 7
��
1
��
<��
=��
>
Activations in
Normal run
States Affected by Intervention
Figure 3: Illustration of our circuit analysis approach (on the composition task). We use logit lens
to interpret individual states, and use causal tracing to measure the strength of connections between
states. Details are in the main content.
Logit lens. We interpret individual hidden states via logit lens [40,15,71], where the activation is
converted into a set of logits for each vocabulary token by multiplying with the output embedding
matrix. We follow the recent practice [71] where the activation first goes through the transformer’s
final normalization layer before multiplying with the output embedding (Figure 3, top right).
Causal tracing. The transformer could be viewed as a causal graph [46] that propagates information
from the input to the output through a grid of intermediate states, which allows for a variety of causal
analyses on its internal computations [63,35,19,65,12]. For convenience, we will refer to a hidden
state byS[i, a] , whereiis the layer index andais the input token at the same position as the state
(one of{h, r1, r2} ). We illustrate our method in Figure 3, where the hidden state of interest isS[4, r1]
and the target is the model’s final prediction stateS[8, r2]. There are in total three steps:
1.
Thenormal runrecords the model’s hidden state activations on a regular input(h, r1, r2) .
Note that since the model maintains perfect training performance throughout grokking, the final
prediction is always the ground truth tail entityt.
7
2.
In theperturbed run, a slightly perturbed input is fed to the model which changes the prediction,
where again the hidden state activations are recorded. For the perturbation, prior work has explored
adding noise to the input [35] and replacing key tokens with semantically close ones [63,12]. We
adopt token replacement which avoids unnecessary distribution shifts [74]. Specifically, for the
hidden state of interest, we replace the input token at the same position as the state to be a random
alternative of the same type (e.g.,r1→r
′
1 ) that leads to a different target prediction (e.g.,t→t
′ ).
3.Intervention. During the normal run, we intervene the state of interest by replacing its activation
with its activation in the perturbed run. We then run the remaining computations and measure if
the target state (top-1token through logit lens) is altered. The ratio of such alterations (between0
and1) quantitatively characterizes thecausal strengthbetween the state of interest and the target.
7
For convenience, when we refer to a state as a token, we mean the top token of the state via logit lens.
5

ℎ ��1 ��
2
��
2
…
Layer 0
Layer 5
Layer 8
��
…
��
…… (a)hr1r2
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (b)0.0 0.5 1.0 1.5 2.0 2.5 3.0
Optimization step (1e5)
0.0
0.2
0.4
0.6
0.8
1.0
Grokking starts
MRR(b) at S[5, r1]
MRR(r 2) at S[5, r2]
Test (ID)
Test (OOD) (c)
Figure 4: The (evolution of) generalizing circuit for composition. (a) The generalizing circuit. (b)
Thechangein causal strengths during grokking, where the target is the prediction state. (c) Mean
reciprocal rank (via logit lens) of the bridge entitybatS[5, r1]and second relationr2atS[5, r2].
The generalizing circuit. We run a set of causal tracing and logit lens experiments across different
model checkpoints throughout training. The discovered generalizing circuit (i.e., the causal com-
putational pathways after grokking) is illustrated in Figure 4(a). Specifically, we locate a highly
interpretable causal graph consisting of states in layer0,5, and8, where we have pruned away the
weak nodes/connections (details in Appendix D). Layer5splits the circuit into lower and upper
layers, where 1) the lower layers retrieve the first-hop fact(h, r1, b) from the inputh, r1, store the
bridge entitybinS[5, r1] , and “delay” the processing ofr2toS[5, r2] ; 2) the upper layers retrieve the
second-hop fact(b, r2, t)fromS[5, r1]andS[5, r2], and store the tailtto the output stateS[8, r2].
What happens during grokking?To understand the underlying mechanism behind grokking, we
track the strengths of causal connections and results from logit lens across different model checkpoints
during grokking (the “start” of grokking is the point when training performance saturates). We
observe two notable amplifications (within the identified graph) that happen during grokking. The
first is the causal connection betweenS[5, r1] and the final predictiont, which is very weak before
grokking (Appendix D) and grows significantly during grokking (Figure 4(b)). The second is ther2
component ofS[5, r2] via logit lens, for which we plot its mean reciprocal rank (MRR) (Figure 4(c)).
Additionally, we find that the stateS[5, r1] has a large component of the bridge entitybthroughout
grokking (Figure 4(c)). These observations strongly suggest that the model isgradually forming
the second hop in the upper layers (5-8) during grokking. This also indicates that, before grokking,
the model is very likely mostly memorizing the examples intrain_inferred ID bydirectly
associating(h, r1, r2)witht, without going through the first hop.
Why does grokking happen?These observations suggest a natural explanation of why grokking
happens through the lens of circuit efficiency [61]. Specifically, as illustrated above, there exist both
a memorizing circuitCmemand a generalizing circuitCgenthat can fit the training data. While
Cmemis learned first (which causes training performance to saturate quickly),Cgenis relatively more
efficient, in the sense that it could fit the data with a lower complexity. To see this, we can compare
the amount of factsCmemandCgenneed to store (denoted asNmemandNgen) as a proxy for their
complexity.
8
Cmem
stores both atomic facts and inferred facts in the weights.Cgen(Figure 4(a))
stores the atomic facts in the lower layers, and another copy of the atomic facts thatappear as the
second hop in the inferred factsin the upper layers. As the inferred/atomic ratioϕincreases,Nmem
would increase rapidly whileNgenincreases slowly and is always bounded by two times the total
amount of atomic facts, and hence, the relative efficiency ofCgenincreases. In the long run, the model
will be incentivized to transition fromCmemtoCgendue to implicit bias of the optimization [53] and
explicit regularization such as weight decay which prefers more efficient circuits, and the transition
would happen faster asϕincreases. This also explains why the training data size does not affect the
speed of grokking, sincesolely increasing the size does not change the relative efficiency ofCmem
andCgen. The explanation also implies that a larger regularization factor should accelerate grokking
(and vice versa), which we confirm by varying the degree of weight decay (Appendix E.1).
8
While the circuits also consist of other components, they pale in comparison as the number of facts scales.
6

Explaining and mitigating the deficiency in OOD generalization.The configuration ofCgenalso
has another important implication: while the model does acquire compositionality through grokking,
itdoes not have any incentive to store atomic facts in the upper layers that do not appear as the
second hop during training. This explains why the model fails in the OOD setting where facts
are only observed in the atomic form, not in the compositional form—the OOD atomic facts are
simply not stored in the upper layers when queried during the second hop.
9
Such issue originates
from the non-recurrent design of the transformer architecture which forbids memory sharing across
different layers. Our study provides a mechanistic understanding of existing findings that transformers
seem to reduce compositional reasoning to linearized pattern matching [10], and also provides a
potential explanation for the observations in recent findings that LLMs only show substantial positive
evidence in performing the first hop reasoning but not the second [71]. Our findings imply that proper
cross-layer memory-sharing mechanisms for transformers such as memory-augmentation [54,17] and
explicit recurrence [7,22,57] are needed to improve their generalization. We also show that a variant
of the parameter-sharing scheme in Univeral Transformer [7] can improve OOD generalization in
composition (Appendix E.2).
4 Comparison—Systematic Generalization via Parallel Circuit
We have just shown that the vanilla transformer fails to achieve OOD generalization for composition,
but is the vanilla transformer generally incapable of acquiring systematic implicit reasoning skills?
We show that forcomparison, a task where SoTA LLMs such as GPT-4 also struggle [1], the vanilla
transformerdoeshave the capability to acquire systematic generalization, again through grokking. On
the surface, it seems that the comparison task is no different than the composition task—both require
retrieving and reasoning over two pieces of facts. However, as it turns out through our analysis,the
comparison task emits a “parallel circuit” that is learned by the transformer during grokking, which
allows atomic facts to be stored and retrieved in the same region and enables systematicity to happen.
Setup.The comparison task involves comparing the attribute values of entities. We assume there
are|E|= 1000entities,|A|= 20 attributes and|V|= 20 ordinal values for the attributes. Each
attributea∈ Ahas a label space{a<, a=, a>} , a set of relations specifying its comparative form. For
example, an attributeagewould havea<, a=, a>to beyounger, contemporary, older, respectively.
The atomic facts are(entity, attribute, value)triplets, where we assign a random valuev∈ Vfor
each(e, a)∈ E × A . Again, we randomly partition the atomic facts intoatomicID andatomicOOD
(90%:10%). The rules of comparison are:
∀e1, e2∈ E,∀a∈ A,∀v1, v2∈ V,(e1, a, v1)∧(e2, a, v2)∧v1< v2=⇒(a, e1, e2, a<),
∀e1, e2∈ E,∀a∈ A,∀v1, v2∈ V,(e1, a, v1)∧(e2, a, v2)∧v1=v2=⇒(a, e1, e2, a=),
∀e1, e2∈ E,∀a∈ A,∀v1, v2∈ V,(e1, a, v1)∧(e2, a, v2)∧v1> v2=⇒(a, e1, e2, a>).
(2)
Take the attributeageas an example, the first rule means if theageofe1issmaller thanthe
ageofe2, then we can infer “In terms ofage, the relation betweene1ande2isyounger”. Each
entity/attribute/value/label is assigned a unique token, and training/testing is done by having the
model predict the last token (attribute value for atomic facts; comparative relation for inferred facts).
Results & analysis. Figure 1(right) shows the results forϕ= 7.2 , and we include more results in
Appendix E.3. It can be seen that 1) the model again acquires robust generalization only through
grokking; 2) surprisingly, the model also achieves systematicity in generalization, different from the
case of composition.
Analyzing the model’s internals similarly as in §3.3 (details in Appendix D), we find the generalizing
circuit for comparison illustrated in Figure 5(a). On a separate stream, the model prepares the label
space{a<, a=, a>} fromaand stores it inS[7, a] . In the lower layers (0-5), the model retrieves
the two atomic facts and stores the attribute valuesv1andv2atS[5, e1] andS[5, e2] . Then, the
upper layers (5-8) comparev1, v2 and fetch the label fromS[7, a] based on the comparison result.
Importantly, there is a major difference compared with the circuit for composition: the two atomic
facts are retrievedin parallel, which suggests that the atomic facts are stored solely in the lower layers,
without having separate copies across different regions as in the circuit for composition. This explains
why systematicity could happen: OOD facts are now stored and accessed in the same way as ID
9
On the other hand, over97% (100% forϕ≥7.2) of ID atomic facts do appear as second hop in training.
7

facts. Tracking the changes in the model throughout grokking, we observe signicantly strengthened
causal connections from S [7 ; a ] and S [5 ; e
1
] to the nal prediction (Figure 5(b)). We also nd that
throughout grokking, S [7 ; a ] always encodes the label space and S [5 ; e
1
] ; S [5 ; e
2
] gradually encode
the two attribute values (Figure 5(c)). This conrms that a similar transition from C
mem to C
g en
happens during grokking.
(a)(b) (c)
Figure 5: The (evolution of) generalizing circuit for comparison. (a) The generalizing circuit. (b)
The change in causal strengths during grokking, where the target is the prediction state. (c) Mean
reciprocal rank (via logit lens) of the two attribute values ( v
1
; v
2
) at S [5 ; e
1
] and S [5 ; e
2
] .
The ndings here showcase transformer's ability to learn parallel solutions to seemingly sequential
problems, akin to the ndings in Liu et al. [ 30 ] where it is shown that transformers can learn
�shortcuts� to automata. The difference in the acquired generalization across the two tasks that
we study also emphasizes the need for controlled and mechanistic study on understanding the
transformer's reasoning before making general claims on its limitations.
5 The Power of Parametric Memory for Complex Reasoning
At the high level, our study so far paves the way towards better understanding and improving
transformer's reasoning with parametric representation of knowledge and rules. But why is parametric
memory practically important? Can we not simply enhance LLMs with non-parametric memory, e.g.,
by using their long-context modes and/or doing explicit retrieval, to solve the tasks at hand?
We believe parametric memory has its unique capability to perform deep compression and integration
of information for complex reasoning. To showcase the potential of parametric memory for complex
reasoning, we create a difcult reasoning task with a large search space , and show that 1) it is far out
of reach even for current SoTA models (e.g., GPT-4-Turbo [ 43 ] and Gemini-Pro-1.5 [ 16 ]) based on
non-parametric memory; 2) a fully grokked transformer can solve the task with near-perfect accuracy.
Our task is a variation of the comparison task above where we use a simple way to massively expand
the search space, based on an additional set of rules that are already contained within the task itself,
namely, the (anti-)symmetry and transitivity of comparison:
8 e
1
; e
2
2 E ; 8 a 2 A ; ( a; e
1
; e
2
; a
<
=a
=
=a
>
) = ) ( a; e
2
; e
1
; a
>
=a
=
=a
<
) ;
8 e
1
; e
2
; e
3
2 E ; 8 a 2 A ; 8 y 2 f a
<
; a
=
; a
>
g ; ( a; e
1
; e
2
; y ) ^ ( a; e
2
; e
3
; y ) = ) ( a; e
1
; e
3
; y ) :
(3)
In the original setting (�4) of the task, for the OOD test set, one can simply retrieve the two OOD
facts and compare the attribute values, which requires no further search. We change the task setting
via the following. For each attribute, 1) we do not add the OOD atomic facts into training, and 2) we
add a random portion of the comparisons between ID entities and OOD entities into training. We test
the models on queries consisting of derivable (from all training facts) comparisons between OOD
entities where any possible proof would involve rules from both Eqs.(2) and Eqs.(3). Consequently,
answering a test query would require the model to successfully locate two ID bridge entities which
can connect the two query entities into a proof (Figure 6). We select a balanced (by a
<
; a
=
; a
> )
subset from these queries for evaluation. More details are included in Appendix F.
The difculty of such a task is two-fold. First, the search space is large. For example, on average,
each query entity connects with more than 50 facts, and each bridge entity in the ground truth proof
8

sure that the results are robust to different setups closer to practice through additional experiments
(Appendix B,C,E). Still, there are certain distances from our settings to those in practice. However,
we believe that it is far more important to build solid understandings, even having distances with
practice, than to draw conclusions or make claims that are closer to practice but questionable due to
insufficient control over data and evaluations.
Acknowledgement
The authors would like to thank colleagues from the OSU NLP group and CMU NeuLab for
their thoughtful comments. This research was supported in part by NSF CAREER #1942980,
ARL W911NF2220144, NSF OAC 2112606, and Ohio Supercomputer Center [5]. The views and
conclusions contained herein are those of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the U.S. government. The U.S. Government
is authorized to reproduce and distribute reprints for Government purposes notwithstanding any
copyright notice herein.
References
[1]
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipula-
tion. InarXiv preprint: abs/2309.14402, 2023.
[2]
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney,
Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the
tuned lens. InarXiv preprint: abs/2303.08112, 2023.
[3]
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie
Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark,
et al. Improving language models by retrieving from trillions of tokens. InInternational
conference on machine learning, pages 2206–2240. PMLR, 2022.
[4]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri
Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In
arXiv preprint: abs/2401.10774, 2024.
[5]
[6]
Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent.
InNeurIPS ML Safety Workshop, 2022.
[7]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni-
versal transformers. InInternational Conference on Learning Representations, 2019.
[8]
Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to
internalize cot step by step. InarXiv preprint: abs/2405.14838, 2024.
[9]
Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and
Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. InarXiv preprint:
abs/2311.01460, 2023.
[10]
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin,
Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya
Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits
of transformers on compositionality. InThirty-seventh Conference on Neural Information
Processing Systems, 2023.
[11]
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards
revealing the mystery behind chain of thought: A theoretical perspective. InThirty-seventh
Conference on Neural Information Processing Systems, 2023.
[12]
Jiahai Feng and Jacob Steinhardt. How do language models bind entities in context? InThe
Twelfth International Conference on Learning Representations, 2024.
11

[13]
Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R. Costa-jussà. A primer on the
inner workings of transformer-based language models. InarXiv preprint: abs/2405.00208,
2024.
[14]
Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao
Peng. Data engineering for scaling language models to 128k context, 2024.
[15]
Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers
build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi,
United Arab Emirates, December 2022. Association for Computational Linguistics.
[16]
Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
2024.
[17]
Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-
Barwi´nska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,
et al. Hybrid computing using a neural network with dynamic external memory.Nature,
538(7626):471–476, 2016.
[18]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval
augmented language model pre-training. InInternational conference on machine learning,
pages 3929–3938. PMLR, 2020.
[19]
Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT-2 compute greater-
than?: Interpreting mathematical abilities in a pre-trained language model. InThirty-seventh
Conference on Neural Information Processing Systems, 2023.
[20]
Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A
survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the
Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada, July
2023. Association for Computational Linguistics.
[21]
Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of
grokking, double descent and emergent abilities: A perspective from circuits competition. In
arXiv preprint: abs/2402.15175, 2024.
[22]
DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-
recurrent transformers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun
Cho, editors,Advances in Neural Information Processing Systems, 2022.
[23]
Nora Kassner, Benno Krojer, and Hinrich Schütze. Are pretrained language models symbolic
reasoners over knowledge? InProceedings of the 24th Conference on Computational Natural
Language Learning, pages 552–564, Online, November 2020. Association for Computational
Linguistics.
[24]
Nora Kassner, Benno Krojer, and Hinrich Schütze. Are pretrained language models symbolic
reasoners over knowledge? InProceedings of the 24th Conference on Computational Natural
Language Learning, pages 552–564, Online, November 2020. Association for Computational
Linguistics.
[25]
Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional
skills of sequence-to-sequence recurrent networks. InInternational conference on machine
learning, pages 2873–2882. PMLR, 2018.
[26]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented
generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing
Systems, 33:9459–9474, 2020.
[27]
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers
to solve inherently serial problems. InThe Twelfth International Conference on Learning
Representations, 2024.
12

[28]
Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joey Gon-
zalez. Train big, then compress: Rethinking model size for efficient training and inference of
transformers. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International
Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research,
pages 5958–5968. PMLR, 13–18 Jul 2020.
[29]
Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi
Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base LLMs: Rethinking
alignment via in-context learning. InThe Twelfth International Conference on Learning
Representations, 2024.
[30]
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Trans-
formers learn shortcuts to automata. InThe Eleventh International Conference on Learning
Representations, 2023.
[31]
Jiacheng Liu, Ramakanth Pasunuru, Hannaneh Hajishirzi, Yejin Choi, and Asli Celikyilmaz.
Crystal: Introspective reasoners reinforced with self-feedback. In Houda Bouamor, Juan
Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, pages 11557–11572, Singapore, December 2023. Association
for Computational Linguistics.
[32]
Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J Michaud, Max Tegmark, and Mike Williams.
Towards understanding grokking: An effective theory of representation learning. In Alice H.
Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural
Information Processing Systems, 2022.
[33]
Ziming Liu, Eric J Michaud, and Max Tegmark. Omnigrok: Grokking beyond algorithmic data.
InThe Eleventh International Conference on Learning Representations, 2023.
[34]
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational
Conference on Learning Representations, 2019.
[35]
Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual
associations in GPT. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho,
editors,Advances in Neural Information Processing Systems, 2022.
[36]
William Merrill, Nikolaos Tsilivis, and Aman Shukla. A tale of two circuits: Grokking as
competition of sparse and dense subnetworks. InICLR 2023 Workshop on Mathematical and
Empirical Understanding of Foundation Models, 2023.
[37]
Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, and
Luke Zettlemoyer. Nonparametric masked language modeling. In Anna Rogers, Jordan Boyd-
Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics:
ACL 2023, pages 2097–2118, Toronto, Canada, July 2023. Association for Computational
Linguistics.
[38]
Shikhar Murty, Pratyusha Sharma, Jacob Andreas, and Christopher Manning. Grokking of
hierarchical structure in vanilla transformers. In Anna Rogers, Jordan Boyd-Graber, and Naoaki
Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 439–448, Toronto, Canada, July 2023. Association
for Computational Linguistics.
[39]
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress
measures for grokking via mechanistic interpretability. InThe Eleventh International Conference
on Learning Representations, 2023.
[40]
[41]
Pascal Jr. Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, and Guillaume
Dumas. Predicting grokking long before it happens: A look into the loss landscape of models
which grok. InarXiv preprint: abs/2306.13253, 2023.
[42] arXiv preprint: abs/2303.08774, 2023.
13

[43]
[44]
Koyena Pal, Jiuding Sun, Andrew Yuan, Byron Wallace, and David Bau. Future lens: Antici-
pating subsequent tokens from a single hidden state. In Jing Jiang, David Reitter, and Shumin
Deng, editors,Proceedings of the 27th Conference on Computational Natural Language Learn-
ing (CoNLL), pages 548–560, Singapore, December 2023. Association for Computational
Linguistics.
[45]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative
style, high-performance deep learning library.Advances in neural information processing
systems, 32, 2019.
[46] Causality. Cambridge university press, 2009.
[47]
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen-
eralization beyond overfitting on small algorithmic datasets. InarXiv preprint: abs/2201.02177,
2022.
[48]
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring
and narrowing the compositionality gap in language models. In Houda Bouamor, Juan Pino, and
Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023,
pages 5687–5711, Singapore, December 2023. Association for Computational Linguistics.
[49]
Ben Prystawski, Michael Y. Li, and Noah Goodman. Why think step by step? reasoning
emerges from the locality of experience. InThirty-seventh Conference on Neural Information
Processing Systems, 2023.
[50]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
[51]
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song,
John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language
models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446,
2021.
[52]
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in BERTology: What we
know about how BERT works.Transactions of the Association for Computational Linguistics,
8:842–866, 2020.
[53]
Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on
separable data. InInternational Conference on Learning Representations, 2018.
[54]
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks.Advances
in neural information processing systems, 28, 2015.
[55]
Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented
language models. InThe Eleventh International Conference on Learning Representations, 2023.
[56]Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. olmpics-on what language
model pre-training captures.Transactions of the Association for Computational Linguistics,
8:743–758, 2020.
[57]
Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, and Chuang Gan. Sparse universal
transformer. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing, pages 169–179, Singapore,
December 2023. Association for Computational Linguistics.
[58]
Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai
Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. Transformer
memory as a differentiable search index. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022.
14

[59]
Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The
slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.
InarXiv preprint: abs/2206.04817, 2022.
[60]
Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization
without overfitting: Analyzing the training dynamics of large language models.Advances in
Neural Information Processing Systems, 35:38274–38290, 2022.
[61]
Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining
grokking through circuit efficiency. InarXiv preprint: abs/2309.02390, 2023.
[62]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information
processing systems, 30, 2017.
[63]
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer,
and Stuart Shieber. Investigating gender bias in language models using causal mediation
analysis.Advances in neural information processing systems, 33:12388–12401, 2020.
[64]
Boshi Wang, Xiang Deng, and Huan Sun. Iteratively prompt pre-trained language models for
chain of thought. InProceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing, pages 2714–2730, Abu Dhabi, United Arab Emirates, December 2022.
Association for Computational Linguistics.
[65]
Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt.
Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe
Eleventh International Conference on Learning Representations, 2023.
[66]
Xinyi Wang, Alfonso Amayuelas, Kexun Zhang, Liangming Pan, Wenhu Chen, and
William Yang Wang. Understanding the reasoning ability of language models from the perspec-
tive of reasoning paths aggregation. InarXiv preprint: abs/2402.03268, 2024.
[67]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi,
Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language
models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,
Advances in Neural Information Processing Systems, 2022.
[68]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain
Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art
natural language processing. InProceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020.
Association for Computational Linguistics.
[69]
Wilson Wu, John X. Morris, and Lionel Levine. Do language models plan ahead for future
tokens? InarXiv preprint: abs/2404.00859, 2024.
[70]
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis
Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang,
Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov,
Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models,
2023.
[71]
Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. Do large
language models latently perform multi-hop reasoning? InarXiv preprint: abs/2402.16837,
2024.
[72]
Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D. Goodman.
Quiet-star: Language models can teach themselves to think before speaking. InarXiv preprint:
abs/2403.09629, 2024.
15

[73]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with
reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,
Advances in Neural Information Processing Systems, 2022.
[74]
Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models:
Metrics and methods. InarXiv preprint: abs/2309.16042, 2024.
[75]
Zexuan Zhong, Tao Lei, and Danqi Chen. Training language models with memory augmen-
tation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing, pages 5657–5673, Abu Dhabi, United Arab Emirates, December 2022. Association
for Computational Linguistics.
[76]
Zexuan Zhong, Zhengxuan Wu, Christopher Manning, Christopher Potts, and Danqi Chen.
MQuAKE: Assessing knowledge editing in language models via multi-hop questions. In Houda
Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing, pages 15686–15702, Singapore, December 2023.
Association for Computational Linguistics.
[77]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia
Efrat, Ping Yu, LILI YU, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer
Levy. LIMA: Less is more for alignment. InThirty-seventh Conference on Neural Information
Processing Systems, 2023.
[78]
Xuekai Zhu, Yao Fu, Bowen Zhou, and Zhouhan Lin. Critical data size of language models
from a grokking perspective. InarXiv preprint: abs/2401.10463, 2024.
16

A Training Details
All implementations were done with PyTorch [45] and Huggingface Transformers [68]. All model
training runs are done on NVIDIA A6000 and A100 GPUs and last96hours at maximum.
B Effect of Model Scale
We run the experiments on composition with larger model scales with|E|= 2000 andϕ∈
{5.4,9.0,18.0}
. The results are shown in Figure 7. Overall, it could be seen that scaling up
the model won’t qualitatively change the model’s generalization behaviors, and the main pattern is
that larger models converge in fewer optimization steps, which shares with prior findings [60, 28].5000 10000 15000 20000 25000 30000
Optimization step
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
8 Layers, 768 Dim
24 Layers, 1024 Dim
36 Layers, 1280 Dim
Train (ID)
Test (ID)
Test (OOD)
(a)ϕ= 18.0.250050007500100001250015000175002000022500
Optimization step
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
8 Layers, 768 Dim
24 Layers, 1024 Dim
36 Layers, 1280 Dim
Train (ID)
Test (ID)
Test (OOD) (b)ϕ= 5.4.5000 10000 15000 20000 25000 30000
Optimization step
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
8 Layers, 768 Dim
24 Layers, 1024 Dim
36 Layers, 1280 Dim
Train (ID)
Test (ID)
Test (OOD) (c)ϕ= 9.0.
Figure 7: Results for different model scales across differentϕ. Larger models converge in fewer
optimization steps, but have no qualitative changes on the learned behaviors.
C Effect of Tokenizations
In the experiments in our main content, tokenization is done by having a unique token for each
entity.
11
This is different from how real-world entities are typically tokenized—in practice, entities
are usually multi-token, and different entities could share tokens at different positions. We investigate
11
Preliminary experiments show that tokenization of the relations does not exhibit notable impacts, which is
expected since relations are always explicitly given.
17

the effect of tokenizations on the composition task by having two tokens for each entity (resembling
the first name and last name of a person) in the setting with|E|= 2000andϕ= 12.6 . We generate a
set of unique first names and a set of unique last names with equal size from which the two tokens for
each entity are randomly chosen (we make sure each entity gets a unique ordered pair of tokens). We
definetoken multiplicityto be the number of entities that share the same first/last name. For example,
when the size of the set of first/last names is50, the token multiplicity would be2000/50 = 40.
Figure 8(a) shows the ID test results, where the training and OOD results are the same from earlier
(training performance saturates quickly, OOD result remains zero). It can be seen that a larger token
multiplicity would delay the generalization, which is expected to a certain degree since the scale
of the model is effectively smaller due to having fewer tokens in the vocabulary. Nevertheless, ID
generalization always happens. We also run linear probing onS[5, r1] throughout training to predict
the second token of the bridge entitybin the setting with token multiplicity40,
12
, where the results
are shown in Figure 8(b). It can be seen that the second token ofbcan be perfectly decoded from
S[5, r1] after grokking, and the decodability improves throughout grokking. This suggests that for
the multi-token case, the model is additionally storing the second token ofbintoS[5, r1] throughout
grokking, which may be another factor that further delays the speed of grokking. These results also
share with recent findings that in many cases, tokens beyond the immediate next token are linearly
encoded in the hidden states [69, 44, 2, 4].
In summary, different tokenizations affect the results in rather expected ways, and do not influence
our main findings and conclusions.0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Optimization step 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
Multiplicity: 1
Multiplicity: 4
Multiplicity: 8
Multiplicity: 16
Multiplicity: 40
(a)0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Optimization step 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
Test (ID)
Probing accuracy (b)
Figure 8: (a) ID generalization across different token multiplicity. (b) Probing accuracy on the second
token ofbatS[5, r1].
D More Details on Circuit Analysis
D.1 Composition
We run causal tracing on hidden states in layer1-7and every position, where the target is the final
prediction stateS[8, r2] . The changes in the strengths are monotone and smooth, and we show in
Figure 9 the strengths for the model checkpoint at the start and end of grokking, and also their
difference (same as Figure 4(b)). We also find that after grokking, the stateS[5, r2] (which encodes
r2) is not affected by perturbing the inputhorr1.
Deriving the generalizing circuit. Starting from the9states in layers0,5,8 we can directly eliminate
S[8, h] andS[8, r1] since they have no computational paths connecting toS[8, r2] .S[5, h] can be
eliminated as could be seen by Figure 9(c). The connections fromS[0, h] andS[0, r1] toS[5, r2]
could be eliminated as mentioned earlier.
12
Recall thatS[5, r1]is the state that encodes the bridge entity in the generalizing circuit (Figure 4(a))
18

hr1r2
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (a)hr1r2
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (b)hr1r2
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (c)
Figure 9: Causal strengths on composition task, with the final predictionS[8, r2] as the target. (a)
Start of grokking. (b) Change during grokking. (c) End of grokking.
D.2 Comparison
Figure 10 includes the causal tracing results where the target is the prediction stateS[8, e2] , and
Figure 11 includes the results with the target stateS[5, e2] . It can be seen that after grokking,S[5, e2]
does not depend one1, which gives the generalization circuit in Figure 5(a).
Figure 12 shows the rank (via logit lens) of the three relations{a<, a=, a>} in the label space at state
S[7, a] , where we use Recall@3 as the measure. It can be seen thatS[7, a] encodes the label space
throughout grokking.ae1e2
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
(a)ae1e2
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (b)ae1e2
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (c)
Figure 10: Causal strengths on comparison task, with the final predictionS[8, e2] as the target. (a)
Start of grokking. (b) Change during grokking. (c) End of grokking.
E Additional Results
E.1 Weight decay
Figure 13 shows the ID generalization performance when varying the degree of weight decay
(|E|= 2000 andϕ= 9.0 ). It can be seen that a larger weight decay can improve the speed of
grokking, and vice versa.
19

a e1 e2
Layer 0
Layer 1
Layer 2
Layer 3
Layer 4
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (a)a e1 e2
Layer 0
Layer 1
Layer 2
Layer 3
Layer 4
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (b)a e1 e2
Layer 0
Layer 1
Layer 2
Layer 3
Layer 4
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00 (c)
Figure 11: Causal strengths on comparison task, withS[5, e2] as the target. (a) Start of grokking. (b)
Change during grokking. (c) End of grokking.0 1 2 3 4
Optimization step (1e5)
0.5
0.6
0.7
0.8
0.9
1.0
Recall@3
Test (ID)
Test (OOD)
Figure 12: For the comparison task,S[7, a]encodes the label space throughout grokking.
E.2 Transformer with parameter sharing
We share the parameters of the first4layers and the last4layers, similar as in Universal Trans-
former [7]. This would allow the model to share the knowledge in the upper and lower layers. The
results on the setting with|E|= 2000andϕ= 12.6 are shown in Figure 14. It could be seen that
the parameter-sharing scheme can unlock OOD generalization, even though it is gained much more
slowly than ID generalization during grokking.
E.3 Comparison task across different inferred/atomic ratio
Figure 15 includes the result forϕ∈ {3.6,7.2,9.0,12.6} for the comparison task. It can be seen
that a higher ratioϕwould give a higher generalization speed, consistent with the results in the
composition task.
F Complex Reasoning Task
For the complex reasoning task, for each attribute, we include3% random facts from the comparisons
between ID and OOD entities and the comparisons between ID and ID entities. In total, the training
set contains18K ID atomic facts,437K (ID, ID) comparisons, and108K (ID, OOD) comparisons,
altogether563K facts (28K on average for each attribute). For translating the facts into natural
20

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Optimization step (1e5)
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
weight decay: 0.03
weight decay: 0.1
weight decay: 0.3 Figure 13: Effect of weight decay. A larger weight decay can improve the speed of grokking, and
vice versa.0.000.250.500.751.001.251.50
Optimization step 1e6
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
Test (ID}
Test (OOD)
Figure 14: OOD accuracy of the shared-parameter transformer model.
language for testing LLMs with non-parametric memory, we always use the attributeage
13
(we find
that the choice of attribute does not exhibit notable impact) and the templates “The age of {entity} is
{attribute value}.” and “{entity1} is {younger than/older than/in the same age as} {entity2}.” for
atomic facts and comparisons. The entities are mapped to distinct random names generated by a
random generator.
14
We also try different mappings (e.g., unique IDs) and templates, and find the
results to be consistent. All (retrieved) facts are randomly permuted and concatenated before being
loaded into the LLM context.
The train/test accuracy, and also the accuracy of inferring the attribute values of the query entities
(which we test using the same format as the atomic facts in training) are included in Figure 16. It
could be seen that, during grokking, the model gradually locates the ground truth attribute values of
the query entities (note that the model is not explicitly encouraged or trained to do this), allowing the
model to solve the problem efficiently with near-perfect accuracy.
13
Recall that we test each attribute separately by grouping the facts/queries.
14
https://pypi.org/project/names/
21

0 1 2 3 4 5
Optimization step (1e5)
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Accuracy
: 12.6
: 9.0
: 7.2
: 3.6 (a) ID accuracy.0 1 2 3 4 5
Optimization step (1e5)
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Accuracy
: 12.6
: 9.0
: 7.2
: 3.6 (b) OOD accuracy.
Figure 15: Results for the comparison task across different ratioϕ.0 1 2 3 4 5 6 7
Optimization step (1e5)
0.2
0.4
0.6
0.8
1.0
Accuracy
Train
Test
Atomic (OOD)
Figure 16: Accuracy on the train and test split, and also the accuracy of inferring the attribute values
of the query entities (Atomic (OOD)) for the complex reasoning task in §5.
22