Extracted Text
2406.09325v1
REVS: Unlearning Sensitive Information in Language
Models via Rank Editing in the Vocabulary Space
Tomer Ashuach Martin Tutek Yonatan Belinkov
Technion – Israel Institute of Technology
{tomerashuach,martin.tutek,belinkov}@campus.technion.ac.il
Abstract
Large language models (LLMs) risk inadvertently memorizing and divulging
sensitive or personally identifiable information (PII) seen in training data, causing
privacy concerns. Current approaches to address this issue involve costly dataset
scrubbing, or model filtering through unlearning and model editing, which can be
bypassed through extraction attacks. We propose REVS, a novel model editing
method for unlearning sensitive information from LLMs. REVS identifies and
modifies a small subset of neurons relevant for each piece of sensitive information.
By projecting these neurons to the vocabulary space (unembedding), we pinpoint
the components driving its generation. We then compute a model edit based on the
pseudo-inverse of the unembedding matrix, and apply it to de-promote generation
of the targeted sensitive data. To adequately evaluate our method on truly sensitive
information, we curate two datasets: an email dataset inherently memorized by GPT-
J, and a synthetic social security number dataset that we tune the model to memorize.
Compared to other state-of-the-art model editing methods, REVS demonstrates
superior performance in both eliminating sensitive information and robustness to
extraction attacks, while retaining integrity of the underlying model. The code and a
demo notebook are available athttps://technion-cs-nlp.github.io/REVS.
1 Introduction
Large language models (LLMs) exhibit a concerning tendency to memorize information from their
training data [11,12,46]. While factual recall is a desirable property when dealing with general
knowledge, memorization and regurgitation of sensitive private information, such as personal and
contact details, is a security concern regulated by laws like the general data protection regulation
[GDPR;21]. To ensure such information does not get inadvertently leaked, it is paramount to develop
techniques that detect and erase sensitive information from the LLMs or their training data.
Current approaches for handling sensitive information in LLMs can be broadly categorized into two
groups.Exact unlearningapproaches tackle the problem from the data perspective, either erasing
information from datasets [30,32] or applying differential privacy [1,27,34,47] to the training data.
Such approaches are costly and time-consuming, as each iteration of scrubbing requires retraining of
the entire model. Moreover, they reduce but do not entirely prevent the risk of sensitive information
leakage [9,10].Machine unlearningapproaches [35] aim to discourage models from generating
sensitive information through techniques such as in-context unlearning [ICL;37,45,58] or gradient
ascent [29,55,56]. These approaches do not ensure that the sensitive information is erased from
model parameters, rendering them vulnerable to extraction attacks [33].Model editing, a variant of
localization-based unlearning, localizes and change a subset model parameters to erase or overwrite
the sensitive information itself [15,38,39]. In contrast to ICL and optimization-based methods,
which may prevent models from generating sensitive content but do not fundamentally erase the
underlying knowledge, editing methods hold a greater potential to resist extraction attacks [11] by
surgically removing the sensitive data from model parameters.
��������
“Contact David
Lewis at ”
�� ���
TokenRank
dle1
……
lewR
……
Prompt Memorized Email
“Contact David Lewis at ”lewis.david@email.com
… …
(��) (��) (��)
“Contact David
Lewis at “
TokenRank
lew1
dle2
th3
…… Figure 1: Overview of the REVS unlearning process: (1) The original model memorizes a sensitive
email address and (2) generates it exactly given a related prompt using greedy decoding. (3) After
applying REVS, the target email token(s) are demoted to a specified lower rankRin the model’s
output, preventing the model from generating the unlearned email.
We introduceRankEditing in theVocabularySpace (REVS), a novel model editing approach that
enables robust unlearning of sensitive information from LLMs while maintaining model performance
and offering strong robustness against extraction attacks. Adopting the view that transformer MLP
layers construct predictions by promoting specific tokens in the output vocabulary space [24], REVS
locates the layers and particular subsets of neurons that promote the tokens corresponding to the
targeted sensitive information. By modifying these neurons with respect to the relevant tokens, REVS
surgically removes the model’s encoded tendency to generate that sensitive data while preserving its
broader knowledge and remaining robust to extraction attacks. See Figure 1 for an illustration.
As private data is infrequent in training data, to adequately evaluate REVS, we curate two benchmark
datasets containing sensitive data: Email, containing real private email addresses inadvertently
memorized by the model during pretraining, and SSN, a synthetic dataset where we deliberately
instilled the model with sensitive social security number (SSN) information through finetuning.
Unlike prior work [44], which evaluated editing methods on non-sensitive data, our benchmarks
contain actual private information, allowing us to rigorously assess the unlearning efficacy and
robustness against potential extraction attacks. We analyze the GPT-J-6B model [4,51] as it was
shown to memorize at least1%of its training data [9]. We compare REVS against two strong
baselines for unlearning: constrained fine-tuning with gradient ascent [59] and an adapted version of
the MEMIT editing method [39]. On both real and synthetic data, we show that REVS outperforms
the baselines in unlearning the sensitive information while maintaining overall performance and
being robust to extraction attacks.
To summarize, our contributions are:
•
We introduce REVS, a novel method for unlearning sensitive information from LLM parameters.
•We curate two datasets containing sensitive information: an email dataset organically memorized
by GPT-J and a synthetic SSN dataset that we fine-tune the model to memorize.
•
We extensively evaluate REVS on unlearning as well as robustness to extraction attacks, and
show that it outperforms state-of-the-art model editing methods.
2 Problem setup and preliminaries
Consider the case where a LLM is prompted with text from its training dataset that contains sensitive
information, e.g., “Contact David Lewis at ”. If the model has memorized the completion of this
prompt, it will generate the relevant email address, such as “lewis.david@email.com”, illustrated in
Figure 1. This scenario highlights a prevalent issue with LLMs: the models’ ability to memorize can
lead to unintended divulgence of sensitive data, in particular personally identifiable information (PII).
Formally, letDbe a dataset, and letMD,Θ be a LLM with parametersΘtrained onD. We consider
the modelMD,Θ to be aprivacy riskwhen the model reproduces a sequence of tokens containing
PIISfromDwhen prompted with a sequence of tokensP. Exact unlearning approaches tackle this
issue from thedata perspective, preventing sensitive information from being stored in the model
during training. Such approaches yieldMˆ
D,
ˆ
Θ , as the learning algorithm will inadvertently converge
to a different set of parameters due to a modified dataset. Machine unlearning approaches take
on a post-hoc approach, modifying either all, or just a subset of parameters that contain sensitive
information, yieldingM
D,
ˆ
Θ . The goal of these approaches is to efficiently and effectively prevent the
model from generating each sequence containing sensitive informationSin a way robust to extraction
attacks, while at the same time preserving model performance on unrelated tasks [35].
2
Background on Transformer LMs.We briefly describe the main components in Transformer LMs;
more details are found elsewhere [4,50,51]. Omitting some technical details, a Transformer LM is
made of a sequence of blocks. Each block contains self-attention and multi-layer perceptrons (MLPs)
that write and read to the residual stream [20]. The MLP is defined asMLP(x) =FF2σ(FF1x) , and
is added to the residual stream. The final layer hidden state is projected to the vocabulary space via
an unembedding matrixW, followed by a softmax to obtain the probability of the next token.
Due to the residual stream tying hidden spaces across layers, applying the unembedding matrix to
residual states at intermediate layers results in meaningful token logits [42], while applying it to MLP
weights reveals stored knowledge [14]. In fact, the second MLP layer,FF2, acts as a memory store
[25,38], and model editing methods often target this layer to update a model’s knowledge [38,39].
We adopt this view, and focus on the FF2neurons (columns) for unlearning.
3 Methodology
In this section, we first provide a high-level overview of REVS (Section 3.1), then delve into each of
its modules: selecting which tokens to unlearn (Section 3.2), which layers (Section 3.3) and which
neurons (Section 3.4) to update, and finally, how to compute and apply the actual update (Section 3.5).
3.1 REVS unlearning process
To unlearn each target piece of sensitive information, REVS firstlocalizesmodel parameters that
contribute the most towards their generation. It then adjusts selected neuron representations, aiming
to reduce those layers’ promotion of the sensitive information by reducing the rank of the target
tokens in each layer’s residual, while minimizing total model edits. The updated neuron values reflect
the modified ranks, effectively demoting the associated sensitive information while retaining the
model’s broader knowledge. The core process of REVS involves the following steps:
1. S, select a subset of target tokensT=t1, t2, ..., ttfromS(3.2).
2.
For each target sensitive tokenti∈T , identify the layers where the rank oftiin the residual
hidden state vectorhis above a desired threshold rankrh(3.3).
3.
4. tibelow neuron threshold rankrn(3.5).
5.
A detailed description of this process can be found in Algorithm 1.
Algorithm 1Unlearning Sensitive Information from LLM
1:procedureUNLEARN(M, P, S, rh, rn, Nmax)
2:V←vocabulary size ofM
3:L←number of layers inM
4:dh←hidden dimension ofM
5:W∈R
V×d
h
,W
†
∈R
d
h×V
▷Unembedding matrix and its pseudo-inverse
6:T←SelectTargetTokens(S, t) ▷Selectttokens inS
7:forti∈Tdo
8: forl∈1, . . . , Ldo
9: Nedit← ∅ ▷Neurons to edit
10: repeat
11: ⃗n←SelectNeuron(M, P, ti, l) ▷Select neuron to edit
12: ⃗v←⃗nW ▷ Project neuron to vocabulary logits
13: ⃗v←AdjustRank(⃗v, ti, rn) ▷Adjust logits to de-promote rank (Appendix B)
14: ⃗n←W
†
⃗v ▷ Project adjusted logits back to neuron
15: Nedit←Nedit∪⃗n
16: M←UpdateModel(M, Nedit, l) ▷Update model with edited neurons
17: h
l
←ResidualHiddenState(M, P, l) ▷Get residual hidden state for layerl
18: untilRank(ti, h
l
)≥rhor|Nedit|=Nmax
19: end for
20:end for
21:end procedure
3
(��)
(��)
(��)
ℎ
�
�?5
��������
�
ℎ
�
�
������
�
��
�� ��
��
…
…
��
�� ��
��
Token Rank
lew 1
dle 2
th 3
… …
Token Rank
dle 1
… …
lew R
… … Figure 2: Editing one neuron with REVS: (1) The neuron is projected from hidden space to vocabulary
logit space. (2) The logit is adjusted to demote the target token rank to a desired lower rankR. (3)
The adjusted logits vector is projected back to hidden space, yielding the updated neuron value.
3.2 Target token selection
While sensitive information often comprises multiple tokens, unlearning all tokens is unnecessary
since accurately extracting the original data requires recovering the full token sequence. Therefore,
our approach focuses on unlearning the rarestttokens from each target sequence.
3.3 Selecting layers for editing
For each target tokentito unlearn, REVS identifies all layerslwhere the rank oftiin the residual
hidden state vectorh
lafter applying the unembedding matrix is higher than the desired threshold
rankr. By targeting the layers that significantly contribute to generatingti, REVS aims to surgically
edit only the relevant model parameters for minimizing overall modifications to the model.
3.4 Selecting neurons for editing
To select which neurons to edit within the identified layers, we consider both the neuron’s association
with the target tokentiand its activation level in response to the given promptP. First, we select
the topkneurons with the highest activations for promptP. We refer to an activation of a neuron as
the input value of that neuron when the network is run on the given prompt, where higher activations
indicate that the neuron is more relevant for the given input. From this subset, we iteratively edit up
toNmaxneurons that exhibit the highest rank fortiwhen projected to the vocabulary logit space. A
higher rank indicates a strong association between the neuron andti, suggesting it encodes significant
information relevant to generatingti. This hybrid criterion aims to prioritize editing the neurons
most relevant to the undesired tokentiin the context of the specific promptP. Empirically, we find
this approach outperforms alternatives like editing only the most activated neurons, only the most
associated neurons, neurons with highest gradients w.r.t the prompt [13], or random neuron selection.
See Appendix C for ablation experiments and Section 3.5 for implementation details.
3.5 Neuron editing
As illustrated in Figure 2, REVS iteratively selects neurons from the feed-forward (F F2) layer,
projects them to the vocabulary logit space, decreases the logit value of the target token to reduce
its rank in the logits vector, and projects the adjusted logits back to the neuron’s space using the
pseudo-inverse of the unembedding matrix. This edited neuron is accumulated and used to update the
model weights. The process comprises these four steps (lines 12–15 in Algorithm 1):
1. ⃗nto the logit space as⃗vusing the unembedding matrixW.
2.
Iteratively decreasevtito reduce the rank oftibelowrn, yielding⃗˜v. See Appendix B for details.
3. ⃗˜vto the hidden space as⃗˜nusing the pseudo-inverse of the unembedding matrixW
†
.
4. ⃗˜nin the edited setNedit.
4
4 Experimental setup
4.1 Model
We experiment with GPT-J-6B [52], which was trained on the Pile [23], a publicly accessible corpus
containing sensitive information such as email addresses. Our choice of GPT-J-6B is motivated by
previous work showing it has memorized a significant portion (at least1%) of its training data [9].
4.2 Datasets
We evaluate our method on three datasets, each evaluating different aspects of our approach:
Emails from The Pile: The dataset comprises of288sentences from The Pile [23] containing
memorized email addresses. To curate it, we identified sentences with email addresses using Presidio
[40], prompted the model with the prefix before the email, and verified that it generates the exact
email in the next 20 tokens with greedy decoding. The target token sequence is the string before
@email.com. This dataset facilitates evaluation on data that the model has naturally memorized from
its training corpus. See Appendix D.2 for privacy considerations.
Synthetic SSN dataset: The dataset comprises of200sentences containing social security numbers
(SSNs). The dataset was generated by prompting Claude 3 Sonnet [3] to create template sentences on
various topics that could contain SSNs (employment records, tax documents, finance, government
records, and medical records), with placeholders for names, dates, and SSNs. We then randomly
assigned fake sensitive information to each of the placeholders. See Appendix D.1 for more informa-
tion on dataset construction. The model is fine-tuned to memorize all instances. When constructing
the dataset, we assign 20 unique SSN targets across 100 templates, such that each target SSN has
5 different prefixes. This allows evaluatinggeneralization: we unlearn using only one prompt that
generates the target token sequence, and evaluate on four additional prompts that also generate it.
The remaining 100 sentences were used to evaluate the method’s specificity, that is, whether it affects
memorization of SSN targets that were not unlearned. In this dataset, the target tokens to unlearn
were chosen only from numeric tokens, as the SSN structure contains ’-’ (e.g., 123-45-6789).
Wiki 10k
1
: 10k sentences from Wikipedia, which the model has seen during pretraining. We use
this dataset to measure the effect of unlearning on the model’s general language modeling capability
by comparing perplexity before and after the unlearning procedure.
4.3 Baseline methods
We compare our approach to two strong knowledge editing methodsmodifiedto allow unlearning:
Constrained fine-tuning (FT-L)[59]: Fine-tuning the second feed-forward layer (F F2) in the MLPs,
aiming to minimize the probability of generating the target tokensti, instead of maximizing it as in
the original work in order to allow for unlearning. FT-L imposes anL∞norm constraint to limit the
overall impact on the model’s knowledge.
MEMIT[39]: We use a modified version of MEMIT, originally proposed for inserting new knowledge
into LMs by editing theF F2matrices in model MLPs. Here, we optimize the objective to decrease
the probability of generating the target tokensti. See Appendix A.2 for details on our changes.
4.4 Evaluation metrics
To evaluate the effect of unlearning on the model, we employ a suite of metrics, which can be
categorized as: (1) assessing unlearning effectiveness (Section 4.4.1), (2) estimating the integrity of
the model after unlearning (Section 4.4.2), and (3) robustness to extraction attacks (Section 4.4.3).
4.4.1 Unlearning effectiveness
LetRank(ti) denote the rank of tokentiin the model’s logits. A naive measure would evaluate
whether a model generates the target token after unlearning it, that is, whetherRank(ti) = 1 . Instead,
we define a stricter metric. DefineScore@K(ti) as the normalized rank of tokentiwithin a set of
1
https://huggingface.co/datasets/NeelNanda/wiki-10k
5
candidate tokensC. A score of1means the token is not in the most likelyK−1tokens:
Score@K(ti) =
(
Rank(ti)
K
if Rank(ti)< K
1 otherwise
(1)
Efficacy:since extracting the sensitive information requires recovering every token in the sequence
correctly, we define the unlearning efficacy as the maximumScore@K across the subset of target
tokensT, averaged over theNtarget prompts:
Efficacy@K=
1
N
N
X
i=1
max
ti∈T
Score@K(ti) (2)
Generalization:similarly, we evaluate the unlearning generalization by computing the same measure
on prompts that were not seen by the unlearning method, but would have yielded the same piece of
sensitive information from the original model prior to unlearning it.
4.4.2 Model integrity
Specificity:assesses the model’s ability to continue generating memorizednon-targetsensitive
information by measuring the percentage of promptsPiwhere the edited model’s outputMedited(Pi)
matches the original sensitive token sequenceSi:
Specificity=
1
N
N
X
i=1
1[Medited(Pi) =Si] (3)
Perplexity:measures the perplexity before and after applying unlearning on a dataset the model was
pre-trained on (4.2) to evaluate the effect on its general capabilities.
4.4.3 Extraction resistance
This category quantifies the difficulty for an adversary to extract the unlearned sensitive token
sequence, given different attacks. The attacks differ by the set of tokens they consider for recovering
the unlearned targets. Given such a candidate setC(where|C|=K ), define the resistance to each
attack as the minimum of the average efficacy score (Eq. 2) across all model layersL. The intuition
is that it is sufficient for an attacker to extract the target in one layer for the attack to be successful.
Attack Resistance@K= min
ℓ∈L
Efficacy@K (4)
where the setCofKtokens passed to the Efficacy function is defined next.
Logit-lens attack (LLA):Considers the top-kand bottom-ktokens in the vocabulary logit vector⃗v
as candidates, obtained by projecting each layer’s residual hidden state to the vocabulary logits.
CLLA=
[
ℓ∈L
top-k(⃗v
ℓ
)∪bottom-k(⃗v
ℓ
) (5)
Delta attack (DA):Considers the top-ktokens with the largest changes in the vocabulary logit vector
⃗vbetween consecutive layers as candidates.
CDA=
[
ℓ∈L
top-k(⃗v
ℓ+1
−⃗v
ℓ
)∪bottom-k(⃗v
ℓ+1
−⃗v
ℓ
) (6)
Perturbation attack (PA):A white-box attack that inserts random characters to the original prompts.
The candidate setCPAis obtained as inCLLA, but using the perturbed prompts. This attack can be
adapted to a black-box setting by using only the last hidden state. Appendix E provides more details.
4.5 Implementation details
The target tokens to unlearn were chosen as the rarestt= 2tokens in the target token sequence.
Targeting the rarest tokens was found in preliminary experiments to provide effective prevention of
extraction attacks while preserving the model’s general capabilities. Hyper-parameters for all methods
were chosen by optimizing the harmonic mean of the metrics on the Emails and SSN datasets, using
one seed (‘0’). Each method was then evaluated with cross-validation over six random50/50splits of
each dataset (using seeds1–6). The first half of the dataset was selected as targets for unlearning and
evaluated for efficacy, generalization (for the SSN dataset), and extraction resistance, while the second
half was used as the retained set to evaluate specificity. The results presented in the next section
employ the same hyper-parameter configurations (best one for each method) across all experiments,
ensuring a fair comparison between methods without any scenario-specific optimization.
6
Table 1: Unlearning effectiveness and model integrity results. REVS is superior in almost all cases.
Dataset Method Harmonic Mean ↑Efficacy@100↑General.@100↑Specificity↑ Perplexity↓
SSN
Unedited 0±0.0 0 ±0.0 0 ±0.0 100 ±0 11.148±0
FT-L 4.78±3.66 55.78±12.37 1 .75±1.44 61.67±15.17 11.27±0.054
MEMIT (modified) 78.07±2.2 98 .5±2.33 61 .15±3.2584.17±3.0711.156±0.011
REVS (ours) 81.45±3.56 99.95±0.07 80 .17±3.22 70.33±7.84 11.165±0.01
Emails
Unedited 0±0.0 0 ±0.0 − 100±0 8.129±0
FT-L 3.33±1.77 55 .57±4.39 − 1.73±0.96 13.63±0.34
MEMIT (modified) 70.05±1.16 88 .23±1.64 − 58.1±1.63 8 .13±0
REVS (ours) 80.65±2.41 97.22±1.04 − 68.98±3.68.148±0.002
Table 2: Average results for extraction resistance. REVS is more robust to extraction attacks.
Dataset Method Harmonic Mean ↑Logit Lens@100↑Delta@100↑Perturb@100↑
SSN
Unedited 0±0.0 0 ±0.0 95.12±0.82 26 .5±5.26
FT-L 65.35±7.57 55 .63±12.64 97.03±0.8 58.62±3.49
MEMIT (modified) 93.52±1.76 97 .48±2.01 97.88±0.64 90.93±3.53
REVS (ours) 99.12±3.56 99 .95±0.07 98.55±0.2 98.97±1.46
Emails
Unedited 0±0.0 0 ±0.0 83.8±0.67 44 .2±4.11
FT-L 50.15±3.08 28 .47±2.73 85.83±1.11 78.08±5.15
MEMIT (modified) 80.73±1.7 79 .62±2.31 86.17±0.39 77.12±3.86
REVS (ours) 83.48±1.14 81 .05±1.17 87.08±0.25 82.63±2.63
5 Results
5.1 Main results
Superior unlearning effectiveness with preserved model integrity.As shown in Table 1, both
REVS and MEMIT achieve strong results in terms of unlearning effectiveness and model integrity.
However, REVS consistently outperforms MEMIT in the harmonic mean score and individual metrics
on both datasets, except for a single case of specificity on the SSN dataset. The difference in
specificity between datasets can be attributed to the nature of the target token sequences. While the
Emails dataset targets consist of unique tokens with low overlap, the SSN targets comprise only
numbers, leading to a high overlap with other instances within the specificity calculation. The superior
generalization score of REVS indicates its ability to comprehensively unlearn generating the target
tokens across different prompts, even when applied to a single prompt-target pair. Despite extensive
hyper-parameter tuning, the FT-L method performed poorly compared to REVS and MEMIT across
most metrics. Further details on the experiments and attempts to improve the performance of FT-L
are found in Appendix A.3. Perplexity remains largely unaffected, indicating minimal impact on the
models’ general capabilities, with the exception of degraded perplexity for FT-L in Emails.
Robust resistance to extraction attacks.As shown in Table 2, both REVS and MEMIT achieve
excellent results in preventing the retrieval of unlearned sensitive information across various adversar-
ial attacks. However, REVS maintains an edge over MEMIT, surpassing it across all tested extraction
attacks on both datasets.
Organic vs. synthetic memorization.The organically memorized Emails dataset proved more
challenging for unlearning while preserving model integrity, as evident from lower specificity
(Table 1). This suggests there are inherent differences between memorization during pre-training
and fine-tuning. Concretely, it is easier to extract the unlearned information from the organic Emails
dataset compared to the synthetic SSN dataset (Table 2), implying it is harder to defend against attacks
when the information is organically memorized and likely deeper ingrained in model parameters.
5.2 Analysis
Robustness to candidate token size.Here we evaluate the effectiveness of REVS compared to
MEMIT while varying the size of the candidate token set. In Figures 3a and 3d we observe that
REVS consistently outperforms MEMIT as we increase the size of the candidate token set (C).
7
50 100 150 200
Size of Candidate Tokens
85
90
95
100
Efficacy
Efficacy vs Size of Candidate Tokens
REVS (ours)
MEMIT (a)0.0 0.2 0.4 0.6 0.8 1.0
Efficacy
0.6
0.7
0.8
0.9
Specificity
Specificity vs Efficacy
REVS (ours)
MEMIT (b)50 100 150 200
Number of Edits
60
70
80
90
100
Harmonic Mean
Harmonic Mean vs Number of Edits
REVS (ours)
MEMIT (c)50 100 150 200
Size of Candidate Tokens
96
98
100
Efficacy
Efficacy vs Size of Candidate Tokens
REVS (ours)
MEMIT (d)0.0 0.2 0.4 0.6 0.8 1.0
Efficacy
0.8
0.9
1.0
Specificity
Specificity vs Efficacy
REVS (ours)
MEMIT (e)25 50 75 100125
Number of Targets
70
80
90
Harmonic Mean
Harmonic Mean vs Number of Targets
REVS (ours)
MEMIT (f)
Figure 3: Top row: Email dataset; Bottom row: SSN dataset. From left to right: (a+d) efficacy vs
candidates size, (b+e) efficacy vs. specificity trade-off, (c+f) harmonic mean vs. number of edits for
Emails (c) and number of targets for SSN (f). Mean results with confidence intervals.
Efficacy and specificity trade-off.As expected, there is a clear trade-off between efficacy and
specificity, as illustrated in Figures 3b and 3e. While both REVS and MEMIT exhibit this trade-off,
REVS can achieve near-perfect specificity with low efficacy in both datasets, which is not the case for
MEMIT, as its specificity saturates around0.65even at low efficacy for the Emails dataset. However,
MEMIT presents more consistent specificity scores across different configurations, while REVS is
prone to a larger trade-off between efficacy and specificity.
Impact of the number of edits.Figures 3c and 3f depict a similar overall downward trend in
harmonic mean as the number of edits increases for both REVS and MEMIT. However, there are
notable differences. In the Emails dataset, REVS achieves an almost perfect score with a low number
of edits, while MEMIT’s performance saturates earlier. However, in the SSN dataset, both methods
exhibit a large variance for a small number of targets, but MEMIT seems more robust as the number
of targets increases. Potentially, higher scores could be achieved by optimizing hyper-parameters for
a larger number of edits, as current hyper-parameters were optimized for 100 targets.
6 Related work
Privacy is defined as “the ability to control the extent of personal matters an individual wants to
reveal” [53]. With the recent developments in neural foundation models in language and vision
domains, which are trained on data scraped from the internet, preserving confidentiality of user data
has become a concern [5,7]. This concern is further exacerbated by legislation such as the EU
general data protection regulation [GDPR;21] and the California consumer privacy act (CCPA),
which govern processing and storage of personal identifiable information (PII).
Memorization of training data in neural models.It has been shown that foundation models
of language [10,11,32,57], vision [8], and code [28] are able to reproduce sensitive personal
information such as credit card numbers, email addresses and API keys. The extent of input data
memorization was unclear until Carlini et al.[9]demonstrated that the GPT-J model trained on the
Pile memorizesat least1%of its training data. Memorization is shown to increase with model scale
[9] and specialized prompts can bypass guardrails in order to retrieve even more information than
previously estimated [8, 28, 48], highlighting the need for removing memorized PII from models.
Defenses against leakage of sensitive information.Naturally, privacy concerns have given way to
development of methods preventing regurgitation of PII. Differential privacy [DP;18,27,34,47]
and dataset scrubbing through deduplication [9,30,32] aim to prevent sensitive information from
ever being stored in the model. However, using DP-SGD frequently results in worse generative
8
models [2], and LLMs are able to memorize training instances even if they are not duplicated [16].
Scrubbing approaches are also expensive, requiring retraining of the foundation model for each new
identified leak of PII. To alleviate high cost incurred by retraining models from scratch, post-hoc
model unlearning [35] techniques such as gradient ascent [6,19,41], reinforcement learning through
human feedback [RLHF;43] and model editing [15,38,39,54] were devised. Most similar to ours
is the work of Wu et al.[54], where the authors identify and zero out neurons most responsible
for memorizing the sensitive information. REVS operates in a similar manner, except instead of
completely zeroing out the neurons it modifies the selected neurons to demote the relevant tokens
in the output vocabulary space. This allows REVS to remove the tendency to generate sensitive
information while preserving model integrity.
Extraction attacks.There are three categories of attacks aimed to retrieve information from ML
models. Membership inference attacks [17,49] have the goal of verifying whether a specific datum
was in a model’s training set, while attribute inference attacks [22] retrieve information on some
attributes of the training instances. Data extraction attacks, on the other hand, aim to retrieve full
training instances. Generative LMs are especially susceptible to such attacks given their propensity to
memorize training data [8,10]. Patil et al.[44]show that despite efforts to scrub sensitive information
from models, different attacks can still locate such information in LMs. Black-box attacks assume
no access to model internals and aim to retrieve information through conversation [26], designing
prompts [36] or paraphrasing [31]. White-box use model parameters and through representation
probing [24,42] can even detect information that has been erased by model editing techniques [44].
REVS is specifically designed with these attacks in mind and removes sensitive information in a way
that is robust with respect to possible attacks as well as effective in terms of erasing information.
7 Conclusion
We introduce REVS, a novel unlearning technique which demonstrates superior performance to
state-of-the-art baselines in effectively unlearning sensitive information from LLMs while preserving
their general capabilities and remaining robust to extraction attacks. REVS achieves near-perfect
efficacy in preventing the generation of unlearned targets and maintains high specificity, indicating
minimal disruption to the model’s desired behavior. We perform detailed analyses assessing the
methods’ robustness to varying the size of the candidate token sets, the trade-off between efficacy and
specificity and number of performed unlearned targets. Notably, our experiments revealed inherent
differences in unlearning performance between synthetic and organically memorized information,
with the organically memorized Email dataset proving more challenging, suggesting that information
memorized during pre-training may be more deeply ingrained in the model’s parameters.
8 Limitations
While REVS shows promising results, several areas need further work to resolve current limitations:
(1) identifying other models that have memorized sensitive information to enable broader evaluation,
(2) generalization to unlearning other types of organically memorized sensitive information beyond
emails, (3) scalable unlearning of larger amounts of target information, (4) evaluation against more
fine-grained extraction attacks, including on specific sub-modules like MLP or Attention outputs.
Issues (1) and (2) are especially important, as our work has shown that organically memorized
sensitive information is harder to unlearn from models when compared to data memorized by fine-
tuning. However, finding encoded sensitive data is both time consuming and difficult due to both the
scale and lack of public availability of training data.
9 Broader impact statement
Our work aims to erase sensitive or personally identifiable information from LMs. As such, we expect
it to be mainly used for positive goals. That said, it is possible that unlearning could be applied to
erase factually correct information or perhaps stances, such as political ones, which a malicious actor
does not agree with. Such a model might then be primed to echo potentially harmful views.
9
Acknowledgements
This research has been supported by an AI Alignment grant from Open Philanthropy, the Israel
Science Foundation (grant No. 448/20), and an Azrieli Foundation Early Career Faculty Fellowship.
We would also like to express our gratitude to the Technion computer science NLP group for their
invaluable consultation and assistance in improving this work.
References
[1]
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang,
L. (2016). Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC
conference on computer and communications security, pages 308–318.
[2]
Anil, R., Ghazi, B., Gupta, V., Kumar, R., and Manurangsi, P. (2022). Large-scale differentially
private bert. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages
6481–6491.
[3]
[4]
Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. (2021). Gpt-neo: Large scale
autoregressive language modeling with mesh-tensorflow.If you use this software, please cite it
using these metadata, 58:2.
[5]
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S.,
Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji,
N. S., Chen, A. S., Creel, K. A., Davis, J., Demszky, D., Donahue, C., Doumbouya, M., Durmus,
E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L. E.,
Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho,
D. E., Hong, J., Hsu, K., Huang, J., Icard, T. F., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti,
S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kuditipudi, R.,
Kumar, A., Ladhak, F., Lee, M., Lee, T., Leskovec, J., Levent, I., Li, X. L., Li, X., Ma, T.,
Malik, A., Manning, C. D., Mirchandani, S. P., Mitchell, E., Munyikwa, Z., Nair, S., Narayan, A.,
Narayanan, D., Newman, B., Nie, A., Niebles, J. C., Nilforoshan, H., Nyarko, J. F., Ogut, G., Orr,
L., Papadimitriou, I., Park, J. S., Piech, C., Portelance, E., Potts, C., Raghunathan, A., Reich, R.,
Ren, H., Rong, F., Roohani, Y. H., Ruiz, C., Ryan, J., R’e, C., Sadigh, D., Sagawa, S., Santhanam,
K., Shih, A., Srinivasan, K. P., Tamkin, A., Taori, R., Thomas, A. W., Tramèr, F., Wang, R. E.,
Wang, W., Wu, B., Wu, J., Wu, Y., Xie, S. M., Yasunaga, M., You, J., Zaharia, M. A., Zhang, M.,
Zhang, T., Zhang, X., Zhang, Y., Zheng, L., Zhou, K., and Liang, P. (2021). On the opportunities
and risks of foundation models.ArXiv.
[6]
Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie,
D., and Papernot, N. (2021). Machine unlearning. In2021 IEEE Symposium on Security and
Privacy (SP), pages 141–159. IEEE.
[7]
Brown, H., Lee, K., Mireshghallah, F., Shokri, R., and Tramèr, F. (2022). What does it mean for
a language model to preserve privacy? InProceedings of the 2022 ACM Conference on Fairness,
Accountability, and Transparency, pages 2280–2292.
[8]
Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramer, F., Balle, B., Ippolito, D., and
Wallace, E. (2023). Extracting training data from diffusion models. In32nd USENIX Security
Symposium (USENIX Security 23), pages 5253–5270.
[9]
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. (2022). Quantifying
memorization across neural language models.arXiv preprint arXiv:2202.07646.
[10]
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., and Song, D. (2019). The secret sharer: Evaluating
and testing unintended memorization in neural networks. In28th USENIX security symposium
(USENIX security 19), pages 267–284.
[11]
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown,
T., Song, D., Erlingsson, U., et al. (2021). Extracting training data from large language models. In
30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
10
[12]
Chang, K., Cramer, M., Soni, S., and Bamman, D. (2023). Speak, memory: An archaeology of
books known to ChatGPT/GPT-4. In Bouamor, H., Pino, J., and Bali, K., editors,Proceedings of
the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327,
Singapore. Association for Computational Linguistics.
[13]
Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei, F. (2021). Knowledge neurons in
pretrained transformers.arXiv preprint arXiv:2104.08696.
[14]
Dar, G., Geva, M., Gupta, A., and Berant, J. (2023). Analyzing transformers in embedding space.
In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors,Proceedings of the 61st Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170,
Toronto, Canada. Association for Computational Linguistics.
[15]
De Cao, N., Aziz, W., and Titov, I. (2021). Editing factual knowledge in language models.
arXiv preprint arXiv:2104.08164.
[16]
Debenedetti, E., Severi, G., Carlini, N., Choquette-Choo, C. A., Jagielski, M., Nasr, M., Wallace,
E., and Tramèr, F. (2023). Privacy side channels in machine learning systems.arXiv preprint
arXiv:2309.05610.
[17]
Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in
private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC
2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer.
[18]
Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy.Founda-
tions and Trends® in Theoretical Computer Science, 9(3–4):211–407.
[19]
Eldan, R. and Russinovich, M. (2023). Who’s harry potter? approximate unlearning in llms.
arXiv preprint arXiv:2310.02238.
[20]
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen,
A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones,
A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish,
S., and Olah, C. (2021). A mathematical framework for transformer circuits.Transformer Circuits
Thread.
[21]
European Union (2016). Regulation (EU) 2016/679 of the European Parliament and of the
Council of 27 april 2016 on the protection of natural persons with regard to the processing of
personal data and on the free movement of such data, and repealing Directive 95/46/EC (General
Data Protection Regulation).Official Journal, L 110:1–88.
[22]
Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., and Ristenpart, T. (2014). Privacy in
pharmacogenetics: An{End-to-End}case study of personalized warfarin dosing. In23rd USENIX
security symposium (USENIX Security 14), pages 17–32.
[23]
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A.,
Nabeshima, N., et al. (2020). The pile: An 800gb dataset of diverse text for language modeling.
arXiv preprint arXiv:2101.00027.
[24]
Geva, M., Caciularu, A., Wang, K. R., and Goldberg, Y. (2022). Transformer feed-forward layers
build predictions by promoting concepts in the vocabulary space.arXiv preprint arXiv:2203.14680.
[25]Geva, M., Schuster, R., Berant, J., and Levy, O. (2020). Transformer feed-forward layers are
key-value memories.arXiv preprint arXiv:2012.14913.
[26]
Henderson, P., Sinha, K., Angelard-Gontier, N., Ke, N. R., Fried, G., Lowe, R., and Pineau, J.
(2018). Ethical challenges in data-driven dialogue systems. InProceedings of the 2018 AAAI/ACM
Conference on AI, Ethics, and Society, pages 123–129.
[27]
Hoory, S., Feder, A., Tendler, A., Erell, S., Peled-Cohen, A., Laish, I., Nakhost, H., Stemmer, U.,
Benjamini, A., Hassidim, A., and Matias, Y. (2021). Learning and evaluating a differentially private
pre-trained language model. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors,
Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1178–1189,
Punta Cana, Dominican Republic. Association for Computational Linguistics.
11
[28]
Ippolito, D., Tramèr, F., Nasr, M., Zhang, C., Jagielski, M., Lee, K., Choquette-Choo, C. A.,
and Carlini, N. (2023). Preventing generation of verbatim memorization in language models gives
a false sense of privacy. InProceedings of the 16th International Natural Language Generation
Conference, pages 28–53. Association for Computational Linguistics.
[29]
Jang, J., Yoon, D., Yang, S., Cha, S., Lee, M., Logeswaran, L., and Seo, M. (2022). Knowledge
unlearning for mitigating privacy risks in language models.arXiv preprint arXiv:2210.01504.
[30]
Kandpal, N., Wallace, E., and Raffel, C. (2022). Deduplicating training data mitigates privacy
risks in language models. InInternational Conference on Machine Learning, pages 10697–10707.
PMLR.
[31]
Krishna, K., Song, Y., Karpinska, M., Wieting, J., and Iyyer, M. (2024). Paraphrasing evades
detectors of ai-generated text, but retrieval is an effective defense.Advances in Neural Information
Processing Systems, 36.
[32]
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Car-
lini, N. (2021). Deduplicating training data makes language models better.arXiv preprint
arXiv:2107.06499.
[33]
Li, H., Guo, D., Fan, W., Xu, M., Huang, J., Meng, F., and Song, Y. (2023). Multi-step
jailbreaking privacy attacks on chatgpt.arXiv preprint arXiv:2304.05197.
[34]
Li, X., Tramer, F., Liang, P., and Hashimoto, T. (2021). Large language models can be strong
differentially private learners.arXiv preprint arXiv:2110.05679.
[35]
Liu, S., Yao, Y., Jia, J., Casper, S., Baracaldo, N., Hase, P., Xu, X., Yao, Y., Li, H., Varshney,
K. R., et al. (2024). Rethinking machine unlearning for large language models.arXiv preprint
arXiv:2402.08787.
[36]
Lukas, N., Salem, A., Sim, R., Tople, S., Wutschitz, L., and Zanella-Béguelin, S. (2023).
Analyzing leakage of personally identifiable information in language models. In2023 IEEE
Symposium on Security and Privacy (SP), pages 346–363. IEEE.
[37]
Madaan, A., Tandon, N., Clark, P., and Yang, Y. (2022). Memory-assisted prompt editing to
improve gpt-3 after deployment.arXiv preprint arXiv:2201.06009.
[38]
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. (2022a). Locating and editing factual
associations in gpt.Advances in Neural Information Processing Systems, 35:17359–17372.
[39]
Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., and Bau, D. (2022b). Mass-editing
memory in a transformer.arXiv preprint arXiv:2210.07229.
[40]
[41]
Neel, S., Roth, A., and Sharifi-Malvajerdi, S. (2021). Descent-to-delete: Gradient-based
methods for machine unlearning. InAlgorithmic Learning Theory, pages 931–962. PMLR.
[42]
[43]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal,
S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human
feedback.Advances in neural information processing systems, 35:27730–27744.
[44]
Patil, V., Hase, P., and Bansal, M. (2023). Can sensitive information be deleted from llms?
objectives for defending against extraction attacks.arXiv preprint arXiv:2309.17410.
[45]
Pawelczyk, M., Neel, S., and Lakkaraju, H. (2023). In-context unlearning: Language models as
few shot unlearners.arXiv preprint arXiv:2310.07579.
[46]
Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., and Miller, A. (2019).
Language models as knowledge bases? In Inui, K., Jiang, J., Ng, V., and Wan, X., editors,
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages
2463–2473, Hong Kong, China. Association for Computational Linguistics.
12
[47]
Ramaswamy, S., Thakkar, O., Mathews, R., Andrew, G., McMahan, H. B., and Beaufays, F.
(2020). Training production language models without memorizing user data.arXiv preprint
arXiv:2009.10031.
[48]
Rao, A., Vashistha, S., Naik, A., Aditya, S., and Choudhury, M. (2023). Tricking llms into dis-
obedience: Understanding, analyzing, and preventing jailbreaks.arXiv preprint arXiv:2305.14965.
[49]Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017). Membership inference attacks
against machine learning models. In2017 IEEE symposium on security and privacy (SP), pages
3–18. IEEE.
[50]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u.,
and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio,
S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors,Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc.
[51]
Wang, B. and Komatsuzaki, A. (2021a). Gpt-j-6b: A 6 billion parameter autoregressive language
model.
[52]
Wang, B. and Komatsuzaki, A. (2021b). GPT-J-6B: A 6 Billion Parameter Autoregressive
Language Model.https://github.com/kingoflolz/mesh-transformer-jax.
[53] Washington and Lee Law Review, 25(1):166.
[54]
Wu, X., Li, J., Xu, M., Dong, W., Wu, S., Bian, C., and Xiong, D. (2023). Depn: Detecting and
editing privacy neurons in pretrained language models.arXiv preprint arXiv:2310.20138.
[55]
Yao, Y., Xu, X., and Liu, Y. (2023). Large language model unlearning.arXiv preprint
arXiv:2310.10683.
[56]
Yu, C., Jeoung, S., Kasi, A., Yu, P., and Ji, H. (2023). Unlearning bias in language models by
partitioning gradients. InFindings of the Association for Computational Linguistics: ACL 2023,
pages 6032–6048.
[57]
Zhang, C., Ippolito, D., Lee, K., Jagielski, M., Tramer, F., and Carlini, N. (2023). Counterfactual
memorization in neural language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K.,
Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36,
pages 39321–39362. Curran Associates, Inc.
[58]
Zheng, C., Li, L., Dong, Q., Fan, Y., Wu, Z., Xu, J., and Chang, B. (2023). Can we edit factual
knowledge by in-context learning? In Bouamor, H., Pino, J., and Bali, K., editors,Proceedings of
the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4862–4876,
Singapore. Association for Computational Linguistics.
[59]
Zhu, C., Rawat, A. S., Zaheer, M., Bhojanapalli, S., Li, D., Yu, F., and Kumar, S. (2020).
Modifying memories in transformer models.arXiv preprint arXiv:2012.00363.
13
A Unlearning methods: implementation and experimental details
This section provides implementation details for each unlearning method, describes the hyper-
parameter search and experiments conducted for REVS and the MEMIT baseline, and outlines the
attempts to make the FT-L baseline work effectively.
A.1 REVS
An extensive hyper-parameter search was performed, focusing on the maximum number of neurons,
neuron choosing strategy, desired rank of the target token in the edited neurons, and desired rank
of the target token within the residual hidden state layer. The best configuration found through this
tuning process was as follows: A maximum ofNmax= 30 neurons were edited per layer. The
neurons were chosen by first selecting the topk= 100neurons with the highest activation values,
and then from within these neurons, the ones with the highest rank for the target token when projected
to the vocabulary logit space were selected. For the target token, the desired rankrhwithin the
residual hidden layer was set torh= 200. The desired rankrnwithin the edited neurons was set to
rn= 30000, where maximum possible rank for any token in the GPT-J vocabulary is50400.
A.2 MEMIT Baseline
We used a modified version of MEMIT [39], originally proposed for inserting new knowledge into
language models by editing theF F2matrices in model MLPs. In our case, we optimized the objec-
tive todecreasethe probability of generating the target tokens instead of increasing it, to facilitate
unlearning. The original MEMIT method was proposed for editing memorized facts of the form
(subject s, relation r, object o), e.g., (s = Michael Jordan, r = plays sport, o = basketball), where
the subject is used for calculating the updated weights. However, the targeted sensitive information
might not have an explicit subject (e.g., "Send me an email with your decision soon as you get this;
david.lewis@email.com"), which is the case for many samples in the Email dataset (Table 5). Further-
more, the targeted sensitive information may be memorized by various inherently different prompts
where each sentence might have unique subject (e.g., "David’s email is:david.lewis@email.com"
and "Contact Lewis at:david.lewis@email.com). To address this challenge, we introduced a custom
hyper-parameter that sets the subject as the lastncharacters of the prompt, which, when fed to the
model, generate the memorized sensitive information. We optimized this hyper-parameter and found
that the best results were achieved whenn= 50. Concretely, the following tuple was used as the
input to the modified model: (s = last characters n of the prompt, r = <empty>, o = target token to
unlearn)
The methods’ hyper-parameters include the loss threshold for early stopping, the learning rate,
and the loss probability coefficient. We added two new hyper-parameters to the adjusted method:
token’s probability loss coefficient, which determines the trade-off between the Kullback-Leibler
(KL) divergence loss to retain the current weight values and the term representing the probability of
generating the targeted token, and loss break, which is an early stopping criterion to determine when
to stop optimizing the process. We conducted an exhaustive experiment to find the best configuration
of hyper-parameters for MEMIT in our unlearning setting. The tuned hyper-parameters were the
learning rate, token’s probability loss coefficient, and loss break. The optimal configuration was a
learning rate of0.05, token’s probability loss coefficient of20, and loss break of0.1. The rest of the
hyper-parameters were left as suggested in the original paper [39].
A.3 FT-L baseline
We performed an extensive hyper-parameter search for the FT-L baseline. The hyper-parameters
tuned were the learning rate, targeted layers for fine-tuning, and regularization techniques. Despite
this extensive tuning, FT-L performed poorly compared to REVS and MEMIT across most metrics,
as reported in the main results. The best configuration for FT-L involved targeting all feed-forward
layers, with a learning rate of2e−07and a norm constraint of1e−06.
We explored various strategies for improving FT-L, including fine-tuning only a single layer, multiple
layers, and all feed-forward layers, with and without norm constraint and weight decay. However,
none of these strategies yielded better results than the reported configuration.
14
A key challenge with FT-L is the extremely low specificity achieved on the Email dataset across all
hyper-parameter configurations. While perfect efficacy could theoretically be achieved by making the
unlearning process stronger, even at decent efficacy levels, the specificity remained very low. This
trade-off between efficacy and specificity resulted in a consistently low overall harmonic mean score
for FT-L on the Email dataset.
A.4 Hardware details
The experiments described in this work were conducted on a computing system equipped with 32
Intel(R) Xeon(R) Gold 6430 CPUs operating at 1.0TB RAM. For running the REVS, MEMIT, and
FT-L methods on the GPT-J model, we used NVIDIA RTX 6000 Ada Generation with 49GB of
VRAM GPUs. A single GPU was enough for applying REVS and MEMIT, while a setup of four
GPUs was required for the FT-L method.
B Adjusting neuron rank
The AdjustRank step is designed to find the logit value for the edited neurons that yields the desired
rankrnfor the target tokentiin the vector of logits⃗v. To achieve this, the algorithm iteratively
adjusts the logit value oftiand projects the updated logits back and forth between the vocabulary
space and the hidden state space until the desired rank is obtained. The process begins by initializing
the logit value oftito the mean of theklowest logit values in⃗v. This initial value is then iteratively
adjusted by increasing or decreasing it based on the difference between the current rank oftiand
the desired rankrn. The adjustment factors,δhandδl, control the rate at which the logit value is
increased or decreased, respectively. After each adjustment, the updated logits are projected back to
the hidden state space and then to the vocabulary space to obtain a new rank forti. This iterative
process continues until the rank oftifalls within the allowed margin of errorϵfrom the desired rank
rn. The iterative approach is necessary because projecting from the vocabulary space to the hidden
state space using the pseudo-inverse of the unembedding matrixW
†is an approximation, making it
infeasible to deterministically calculate the exact logit value that yields the desired rank.
The algorithm, shown in Algorithm 2, takes the vector of logits⃗v, the target tokenti, the desired rank
rn, the number of lowest logitskto consider for initialization, the maximum allowed rank difference
ϵ, and the factorsδhandδlfor increasing or decreasing the logit value, respectively. In practice, we
use AdjustRank withk= 10,000,ϵ= 7500,δh= 1.3, andδl= 0.8.
Algorithm 2AdjustRank Algorithm
1:procedureADJUSTRANK(⃗v, ti, rn, k, ϵ, δ1, δ2)
2:vt
i
←MeanKLowestLogits(⃗v, k) ▷Initialize as mean ofklowest logits in⃗v
3:repeat
4: ⃗v[ti]←vt
i
▷Update logit value oftiin⃗v
5: ⃗n←W
†
⃗v ▷ Project updated logits to neuron
6: ⃗v←⃗nW ▷ Project updated neuron to logits
7: r←Rank(ti, ⃗v) ▷Get rank oftiin projected logits
8: if|r−rn| ≤ϵthen
9: break ▷Stop if rank is withinϵof desired rank
10: else ifr > rnthen
11: vt
i
←vt
i
∗δl ▷Decreasevt
i
if rank is too high
12: else
13: vt
i
←vt
i
∗δh ▷Increasevt
i
if rank is too low
14: end if
15:until true
16:return⃗v ▷ Return updated neuron value
17:end procedure
C Alternative neuron selection strategies
We explored several alternative methods for neuron selection, besides our hybrid approach that
considers both neuron activation and association with the target token (Section 3.4). These alternatives
15
Table 3: Comparison of neuron selection strategies for unlearning on the Emails dataset.
Method
Harmonic Mean
Unlearning Effectiveness Extraction
and Model Integrity Resistance
Random 0 0
Gradient-based 71.1 78 .2
Rank 22.4 29 .2
Activations 68.8 81 .8
Rank + Activations (hybrid) 75.2 84 .2
include: selecting the most activated neurons given the prompt, selecting neurons solely based on
their association with the target token (by choosing the top-kneurons with the highest rank in their
projected logits vector for the target token), neurons with highest gradients with respect to the prompt
[13], and random neuron selection as a baseline. As shown in Table 3, our evaluation on the Emails
dataset, using the same seed that was used for finding the best hyper-parameters, demonstrated that
the hybrid approach (rank + activation) outperformed these alternatives in achieving superior results
in both metrics of unlearning effectiveness and model integrity.
D Datasets curation
D.1 Synthetic SSN dataset
We curated the synthetic SSN dataset by using the following prompt with Claude 3 Sonnet:
You are a synthetic data generation tool.
•I need you to create template sentences across different domains
that could potentially contain sensitive personal information like
names, social security numbers, and dates. However, instead of
using real data, I want you to use placeholder tags like [NAME],
[SSN], and [DATE] within the sentences.
•For each domain listed below, provide 20 full sentences following
this format:
•The sentences should be written in a way that reads naturally, as if
they could appear in real documents from that domain.
•The placeholders [NAME] and [SSN] should appear mid-sentence or
towards the end, not at the beginning.
•Use a variety of different sentence structures and phrasings for the
sentences within each domain to make them as diverse as possible.
•Do not include any actual personal names, numbers, or dates - only
use the placeholder tags.
Example sentences from the medical domain:
•Adjusting the medication dosage on [DATE] to better manage a thyroid
condition, the endocrinologist treated [NAME], bearer of SSN [SSN].
•Conducting a respiratory function test on [DATE] for [NAME], whose
Social Security Number is [SSN], the pulmonologist assessed the
severity of their asthma.
The domains I need diverse template sentences for are:
•Employment Records
•Tax Documents
16
•Financial Records
•Government Records
•Medical Records
Please proceed by providing 20 unique and diverse sentences per domain in
the specified format using the placeholder tags. Let me know if you need
any clarification on the instructions.
We kept asking the model to generate examples until we had a total of 200 sentences.
We then randomly assigned fake sensitive information to each placeholder. To populate the place-
holders, we used Claude 3 Sonnet to generate a list of40unique names (since each sentences will
be assigned with the same name and SSN for allowing evaluating generalization). Additionally, we
generated200random dates and40random Social Security Numbers (SSNs). To get the model to
memorize the sentences containing the sensitive information, we fine-tuned the model with a low
learning rate of2e−6 to minimize unintended impact. The fine-tuning process ran for 4 epochs with
a batch size of1.
Table 4 shows 10 example sentences from this dataset.
Table 4: Example sentences containing personal information from a synthetic dataset.
Sentence
1
During a routine check-up on 23-3-2021,Michael Choi, with the Social Security
Number135-54-3368, discussed their recent test results and any necessary follow-up
steps with the attending physician.
2
On 2-2-2021,Michael Choi, whose Social Security Number is135-54-3368, partici-
pated in a group therapy session focused on coping mechanisms for managing chronic
pain.
3
At a scheduled appointment on 19-6-2016,Michael Choi, with an SSN of135-54-3368,
received guidance from a physical therapist on exercises to improve mobility and
strengthen core muscles.
4
During a visit on 22-10-2018,Michael Choi, whose Social Security Number is135-54-
3368, consulted with a dermatologist about a concerning mole and underwent a biopsy
procedure.
5
On 22-10-2019,Michael Choi, with the Social Security Number135-54-3368, attended
a seminar on the importance of regular preventive screenings and early detection of
various health conditions.
6
The patient, a 45-year-oldMark Evans, with the Social Security Number404-61-5777,
visited the clinic on 28-11-2015 for a routine physical examination.
7
During a follow-up appointment on 8-1-2014, the physician reviewed the recent test
results withMark Evans, whose Social Security Number is404-61-5777.
8
On 12-12-2014, the physical therapist designed a customized exercise plan forMark
Evans, with an SSN of404-61-5777, to aid in their recovery from a knee injury.
9
The nutritionist provided dietary recommendations on 7-5-2021 toMark Evans, whose
Social Security Number is404-61-5777, to help manage their high cholesterol levels.
10
At the mental health clinic on 13-6-2013,Mark Evans, with the Social Security Number
404-61-5777, participated in a group therapy session focused on stress management
techniques.
D.2 Organic Email dataset
Handling of sensitive dataWhile The Pile is a publicly available dataset, appropriate measures
were taken to handle the sensitive private information present in the dataset, which contained real
email addresses. The raw dataset was accessed only by the authors. At no point were any individual
email addresses or other PII exposed or included in the research outputs or publicly released datasets.
We will disseminate the email addresses memorized by GPT-J upon request for research purposes if
researchers state they will adhere by the same standards.
17
Table 5: Example sentences containing obscured email addresses from the dataset.
Sentence
1 Copyright 2004-2006, 2009-2010 Colin Green (sharp***@gmail.com)
2 Let Gus know by sending an email tog***@flyin***.com.
3 If you need assistance with this program, you may contact:m***@laptop.org.
4 You can contact SugarCRM, Inc. headquarters atc***@sugarcrm.com.
5 Authors: # Mike Auty <mike.***@gmail.com>
6 Please send bug reports and support requests tol***@saillard.org.
E Perturbation attack (PA)
The Perturbation Attack (PA) is a white-box attack that aims to fool the language model by introducing
subtle perturbations to the input prompts. In this attack, the original prompts are modified by randomly
inserting characters at various positions. We experimented with different perturbation strategies,
including varying the types of characters inserted and the number of insertion points. Our findings
indicate that the most effective approach is to insert spaces into the original prompt at10different
random indices, as well as inserting a space immediately after the prompt. Similar to the Logit-
Lens Attack (LLA), the candidate setCPAfor the PA is obtained by projecting each layer’s residual
hidden state to the vocabulary space and considering the top-khighest and lowestkranked tokens as
candidates.
18
Models via Rank Editing in the Vocabulary Space
Tomer Ashuach Martin Tutek Yonatan Belinkov
Technion – Israel Institute of Technology
{tomerashuach,martin.tutek,belinkov}@campus.technion.ac.il
Abstract
Large language models (LLMs) risk inadvertently memorizing and divulging
sensitive or personally identifiable information (PII) seen in training data, causing
privacy concerns. Current approaches to address this issue involve costly dataset
scrubbing, or model filtering through unlearning and model editing, which can be
bypassed through extraction attacks. We propose REVS, a novel model editing
method for unlearning sensitive information from LLMs. REVS identifies and
modifies a small subset of neurons relevant for each piece of sensitive information.
By projecting these neurons to the vocabulary space (unembedding), we pinpoint
the components driving its generation. We then compute a model edit based on the
pseudo-inverse of the unembedding matrix, and apply it to de-promote generation
of the targeted sensitive data. To adequately evaluate our method on truly sensitive
information, we curate two datasets: an email dataset inherently memorized by GPT-
J, and a synthetic social security number dataset that we tune the model to memorize.
Compared to other state-of-the-art model editing methods, REVS demonstrates
superior performance in both eliminating sensitive information and robustness to
extraction attacks, while retaining integrity of the underlying model. The code and a
demo notebook are available athttps://technion-cs-nlp.github.io/REVS.
1 Introduction
Large language models (LLMs) exhibit a concerning tendency to memorize information from their
training data [11,12,46]. While factual recall is a desirable property when dealing with general
knowledge, memorization and regurgitation of sensitive private information, such as personal and
contact details, is a security concern regulated by laws like the general data protection regulation
[GDPR;21]. To ensure such information does not get inadvertently leaked, it is paramount to develop
techniques that detect and erase sensitive information from the LLMs or their training data.
Current approaches for handling sensitive information in LLMs can be broadly categorized into two
groups.Exact unlearningapproaches tackle the problem from the data perspective, either erasing
information from datasets [30,32] or applying differential privacy [1,27,34,47] to the training data.
Such approaches are costly and time-consuming, as each iteration of scrubbing requires retraining of
the entire model. Moreover, they reduce but do not entirely prevent the risk of sensitive information
leakage [9,10].Machine unlearningapproaches [35] aim to discourage models from generating
sensitive information through techniques such as in-context unlearning [ICL;37,45,58] or gradient
ascent [29,55,56]. These approaches do not ensure that the sensitive information is erased from
model parameters, rendering them vulnerable to extraction attacks [33].Model editing, a variant of
localization-based unlearning, localizes and change a subset model parameters to erase or overwrite
the sensitive information itself [15,38,39]. In contrast to ICL and optimization-based methods,
which may prevent models from generating sensitive content but do not fundamentally erase the
underlying knowledge, editing methods hold a greater potential to resist extraction attacks [11] by
surgically removing the sensitive data from model parameters.
��������
“Contact David
Lewis at ”
�� ���
TokenRank
dle1
……
lewR
……
Prompt Memorized Email
“Contact David Lewis at ”lewis.david@email.com
… …
(��) (��) (��)
“Contact David
Lewis at “
TokenRank
lew1
dle2
th3
…… Figure 1: Overview of the REVS unlearning process: (1) The original model memorizes a sensitive
email address and (2) generates it exactly given a related prompt using greedy decoding. (3) After
applying REVS, the target email token(s) are demoted to a specified lower rankRin the model’s
output, preventing the model from generating the unlearned email.
We introduceRankEditing in theVocabularySpace (REVS), a novel model editing approach that
enables robust unlearning of sensitive information from LLMs while maintaining model performance
and offering strong robustness against extraction attacks. Adopting the view that transformer MLP
layers construct predictions by promoting specific tokens in the output vocabulary space [24], REVS
locates the layers and particular subsets of neurons that promote the tokens corresponding to the
targeted sensitive information. By modifying these neurons with respect to the relevant tokens, REVS
surgically removes the model’s encoded tendency to generate that sensitive data while preserving its
broader knowledge and remaining robust to extraction attacks. See Figure 1 for an illustration.
As private data is infrequent in training data, to adequately evaluate REVS, we curate two benchmark
datasets containing sensitive data: Email, containing real private email addresses inadvertently
memorized by the model during pretraining, and SSN, a synthetic dataset where we deliberately
instilled the model with sensitive social security number (SSN) information through finetuning.
Unlike prior work [44], which evaluated editing methods on non-sensitive data, our benchmarks
contain actual private information, allowing us to rigorously assess the unlearning efficacy and
robustness against potential extraction attacks. We analyze the GPT-J-6B model [4,51] as it was
shown to memorize at least1%of its training data [9]. We compare REVS against two strong
baselines for unlearning: constrained fine-tuning with gradient ascent [59] and an adapted version of
the MEMIT editing method [39]. On both real and synthetic data, we show that REVS outperforms
the baselines in unlearning the sensitive information while maintaining overall performance and
being robust to extraction attacks.
To summarize, our contributions are:
•
We introduce REVS, a novel method for unlearning sensitive information from LLM parameters.
•We curate two datasets containing sensitive information: an email dataset organically memorized
by GPT-J and a synthetic SSN dataset that we fine-tune the model to memorize.
•
We extensively evaluate REVS on unlearning as well as robustness to extraction attacks, and
show that it outperforms state-of-the-art model editing methods.
2 Problem setup and preliminaries
Consider the case where a LLM is prompted with text from its training dataset that contains sensitive
information, e.g., “Contact David Lewis at ”. If the model has memorized the completion of this
prompt, it will generate the relevant email address, such as “lewis.david@email.com”, illustrated in
Figure 1. This scenario highlights a prevalent issue with LLMs: the models’ ability to memorize can
lead to unintended divulgence of sensitive data, in particular personally identifiable information (PII).
Formally, letDbe a dataset, and letMD,Θ be a LLM with parametersΘtrained onD. We consider
the modelMD,Θ to be aprivacy riskwhen the model reproduces a sequence of tokens containing
PIISfromDwhen prompted with a sequence of tokensP. Exact unlearning approaches tackle this
issue from thedata perspective, preventing sensitive information from being stored in the model
during training. Such approaches yieldMˆ
D,
ˆ
Θ , as the learning algorithm will inadvertently converge
to a different set of parameters due to a modified dataset. Machine unlearning approaches take
on a post-hoc approach, modifying either all, or just a subset of parameters that contain sensitive
information, yieldingM
D,
ˆ
Θ . The goal of these approaches is to efficiently and effectively prevent the
model from generating each sequence containing sensitive informationSin a way robust to extraction
attacks, while at the same time preserving model performance on unrelated tasks [35].
2
Background on Transformer LMs.We briefly describe the main components in Transformer LMs;
more details are found elsewhere [4,50,51]. Omitting some technical details, a Transformer LM is
made of a sequence of blocks. Each block contains self-attention and multi-layer perceptrons (MLPs)
that write and read to the residual stream [20]. The MLP is defined asMLP(x) =FF2σ(FF1x) , and
is added to the residual stream. The final layer hidden state is projected to the vocabulary space via
an unembedding matrixW, followed by a softmax to obtain the probability of the next token.
Due to the residual stream tying hidden spaces across layers, applying the unembedding matrix to
residual states at intermediate layers results in meaningful token logits [42], while applying it to MLP
weights reveals stored knowledge [14]. In fact, the second MLP layer,FF2, acts as a memory store
[25,38], and model editing methods often target this layer to update a model’s knowledge [38,39].
We adopt this view, and focus on the FF2neurons (columns) for unlearning.
3 Methodology
In this section, we first provide a high-level overview of REVS (Section 3.1), then delve into each of
its modules: selecting which tokens to unlearn (Section 3.2), which layers (Section 3.3) and which
neurons (Section 3.4) to update, and finally, how to compute and apply the actual update (Section 3.5).
3.1 REVS unlearning process
To unlearn each target piece of sensitive information, REVS firstlocalizesmodel parameters that
contribute the most towards their generation. It then adjusts selected neuron representations, aiming
to reduce those layers’ promotion of the sensitive information by reducing the rank of the target
tokens in each layer’s residual, while minimizing total model edits. The updated neuron values reflect
the modified ranks, effectively demoting the associated sensitive information while retaining the
model’s broader knowledge. The core process of REVS involves the following steps:
1. S, select a subset of target tokensT=t1, t2, ..., ttfromS(3.2).
2.
For each target sensitive tokenti∈T , identify the layers where the rank oftiin the residual
hidden state vectorhis above a desired threshold rankrh(3.3).
3.
4. tibelow neuron threshold rankrn(3.5).
5.
A detailed description of this process can be found in Algorithm 1.
Algorithm 1Unlearning Sensitive Information from LLM
1:procedureUNLEARN(M, P, S, rh, rn, Nmax)
2:V←vocabulary size ofM
3:L←number of layers inM
4:dh←hidden dimension ofM
5:W∈R
V×d
h
,W
†
∈R
d
h×V
▷Unembedding matrix and its pseudo-inverse
6:T←SelectTargetTokens(S, t) ▷Selectttokens inS
7:forti∈Tdo
8: forl∈1, . . . , Ldo
9: Nedit← ∅ ▷Neurons to edit
10: repeat
11: ⃗n←SelectNeuron(M, P, ti, l) ▷Select neuron to edit
12: ⃗v←⃗nW ▷ Project neuron to vocabulary logits
13: ⃗v←AdjustRank(⃗v, ti, rn) ▷Adjust logits to de-promote rank (Appendix B)
14: ⃗n←W
†
⃗v ▷ Project adjusted logits back to neuron
15: Nedit←Nedit∪⃗n
16: M←UpdateModel(M, Nedit, l) ▷Update model with edited neurons
17: h
l
←ResidualHiddenState(M, P, l) ▷Get residual hidden state for layerl
18: untilRank(ti, h
l
)≥rhor|Nedit|=Nmax
19: end for
20:end for
21:end procedure
3
(��)
(��)
(��)
ℎ
�
�?5
��������
�
ℎ
�
�
������
�
��
�� ��
��
…
…
��
�� ��
��
Token Rank
lew 1
dle 2
th 3
… …
Token Rank
dle 1
… …
lew R
… … Figure 2: Editing one neuron with REVS: (1) The neuron is projected from hidden space to vocabulary
logit space. (2) The logit is adjusted to demote the target token rank to a desired lower rankR. (3)
The adjusted logits vector is projected back to hidden space, yielding the updated neuron value.
3.2 Target token selection
While sensitive information often comprises multiple tokens, unlearning all tokens is unnecessary
since accurately extracting the original data requires recovering the full token sequence. Therefore,
our approach focuses on unlearning the rarestttokens from each target sequence.
3.3 Selecting layers for editing
For each target tokentito unlearn, REVS identifies all layerslwhere the rank oftiin the residual
hidden state vectorh
lafter applying the unembedding matrix is higher than the desired threshold
rankr. By targeting the layers that significantly contribute to generatingti, REVS aims to surgically
edit only the relevant model parameters for minimizing overall modifications to the model.
3.4 Selecting neurons for editing
To select which neurons to edit within the identified layers, we consider both the neuron’s association
with the target tokentiand its activation level in response to the given promptP. First, we select
the topkneurons with the highest activations for promptP. We refer to an activation of a neuron as
the input value of that neuron when the network is run on the given prompt, where higher activations
indicate that the neuron is more relevant for the given input. From this subset, we iteratively edit up
toNmaxneurons that exhibit the highest rank fortiwhen projected to the vocabulary logit space. A
higher rank indicates a strong association between the neuron andti, suggesting it encodes significant
information relevant to generatingti. This hybrid criterion aims to prioritize editing the neurons
most relevant to the undesired tokentiin the context of the specific promptP. Empirically, we find
this approach outperforms alternatives like editing only the most activated neurons, only the most
associated neurons, neurons with highest gradients w.r.t the prompt [13], or random neuron selection.
See Appendix C for ablation experiments and Section 3.5 for implementation details.
3.5 Neuron editing
As illustrated in Figure 2, REVS iteratively selects neurons from the feed-forward (F F2) layer,
projects them to the vocabulary logit space, decreases the logit value of the target token to reduce
its rank in the logits vector, and projects the adjusted logits back to the neuron’s space using the
pseudo-inverse of the unembedding matrix. This edited neuron is accumulated and used to update the
model weights. The process comprises these four steps (lines 12–15 in Algorithm 1):
1. ⃗nto the logit space as⃗vusing the unembedding matrixW.
2.
Iteratively decreasevtito reduce the rank oftibelowrn, yielding⃗˜v. See Appendix B for details.
3. ⃗˜vto the hidden space as⃗˜nusing the pseudo-inverse of the unembedding matrixW
†
.
4. ⃗˜nin the edited setNedit.
4
4 Experimental setup
4.1 Model
We experiment with GPT-J-6B [52], which was trained on the Pile [23], a publicly accessible corpus
containing sensitive information such as email addresses. Our choice of GPT-J-6B is motivated by
previous work showing it has memorized a significant portion (at least1%) of its training data [9].
4.2 Datasets
We evaluate our method on three datasets, each evaluating different aspects of our approach:
Emails from The Pile: The dataset comprises of288sentences from The Pile [23] containing
memorized email addresses. To curate it, we identified sentences with email addresses using Presidio
[40], prompted the model with the prefix before the email, and verified that it generates the exact
email in the next 20 tokens with greedy decoding. The target token sequence is the string before
@email.com. This dataset facilitates evaluation on data that the model has naturally memorized from
its training corpus. See Appendix D.2 for privacy considerations.
Synthetic SSN dataset: The dataset comprises of200sentences containing social security numbers
(SSNs). The dataset was generated by prompting Claude 3 Sonnet [3] to create template sentences on
various topics that could contain SSNs (employment records, tax documents, finance, government
records, and medical records), with placeholders for names, dates, and SSNs. We then randomly
assigned fake sensitive information to each of the placeholders. See Appendix D.1 for more informa-
tion on dataset construction. The model is fine-tuned to memorize all instances. When constructing
the dataset, we assign 20 unique SSN targets across 100 templates, such that each target SSN has
5 different prefixes. This allows evaluatinggeneralization: we unlearn using only one prompt that
generates the target token sequence, and evaluate on four additional prompts that also generate it.
The remaining 100 sentences were used to evaluate the method’s specificity, that is, whether it affects
memorization of SSN targets that were not unlearned. In this dataset, the target tokens to unlearn
were chosen only from numeric tokens, as the SSN structure contains ’-’ (e.g., 123-45-6789).
Wiki 10k
1
: 10k sentences from Wikipedia, which the model has seen during pretraining. We use
this dataset to measure the effect of unlearning on the model’s general language modeling capability
by comparing perplexity before and after the unlearning procedure.
4.3 Baseline methods
We compare our approach to two strong knowledge editing methodsmodifiedto allow unlearning:
Constrained fine-tuning (FT-L)[59]: Fine-tuning the second feed-forward layer (F F2) in the MLPs,
aiming to minimize the probability of generating the target tokensti, instead of maximizing it as in
the original work in order to allow for unlearning. FT-L imposes anL∞norm constraint to limit the
overall impact on the model’s knowledge.
MEMIT[39]: We use a modified version of MEMIT, originally proposed for inserting new knowledge
into LMs by editing theF F2matrices in model MLPs. Here, we optimize the objective to decrease
the probability of generating the target tokensti. See Appendix A.2 for details on our changes.
4.4 Evaluation metrics
To evaluate the effect of unlearning on the model, we employ a suite of metrics, which can be
categorized as: (1) assessing unlearning effectiveness (Section 4.4.1), (2) estimating the integrity of
the model after unlearning (Section 4.4.2), and (3) robustness to extraction attacks (Section 4.4.3).
4.4.1 Unlearning effectiveness
LetRank(ti) denote the rank of tokentiin the model’s logits. A naive measure would evaluate
whether a model generates the target token after unlearning it, that is, whetherRank(ti) = 1 . Instead,
we define a stricter metric. DefineScore@K(ti) as the normalized rank of tokentiwithin a set of
1
https://huggingface.co/datasets/NeelNanda/wiki-10k
5
candidate tokensC. A score of1means the token is not in the most likelyK−1tokens:
Score@K(ti) =
(
Rank(ti)
K
if Rank(ti)< K
1 otherwise
(1)
Efficacy:since extracting the sensitive information requires recovering every token in the sequence
correctly, we define the unlearning efficacy as the maximumScore@K across the subset of target
tokensT, averaged over theNtarget prompts:
Efficacy@K=
1
N
N
X
i=1
max
ti∈T
Score@K(ti) (2)
Generalization:similarly, we evaluate the unlearning generalization by computing the same measure
on prompts that were not seen by the unlearning method, but would have yielded the same piece of
sensitive information from the original model prior to unlearning it.
4.4.2 Model integrity
Specificity:assesses the model’s ability to continue generating memorizednon-targetsensitive
information by measuring the percentage of promptsPiwhere the edited model’s outputMedited(Pi)
matches the original sensitive token sequenceSi:
Specificity=
1
N
N
X
i=1
1[Medited(Pi) =Si] (3)
Perplexity:measures the perplexity before and after applying unlearning on a dataset the model was
pre-trained on (4.2) to evaluate the effect on its general capabilities.
4.4.3 Extraction resistance
This category quantifies the difficulty for an adversary to extract the unlearned sensitive token
sequence, given different attacks. The attacks differ by the set of tokens they consider for recovering
the unlearned targets. Given such a candidate setC(where|C|=K ), define the resistance to each
attack as the minimum of the average efficacy score (Eq. 2) across all model layersL. The intuition
is that it is sufficient for an attacker to extract the target in one layer for the attack to be successful.
Attack Resistance@K= min
ℓ∈L
Efficacy@K (4)
where the setCofKtokens passed to the Efficacy function is defined next.
Logit-lens attack (LLA):Considers the top-kand bottom-ktokens in the vocabulary logit vector⃗v
as candidates, obtained by projecting each layer’s residual hidden state to the vocabulary logits.
CLLA=
[
ℓ∈L
top-k(⃗v
ℓ
)∪bottom-k(⃗v
ℓ
) (5)
Delta attack (DA):Considers the top-ktokens with the largest changes in the vocabulary logit vector
⃗vbetween consecutive layers as candidates.
CDA=
[
ℓ∈L
top-k(⃗v
ℓ+1
−⃗v
ℓ
)∪bottom-k(⃗v
ℓ+1
−⃗v
ℓ
) (6)
Perturbation attack (PA):A white-box attack that inserts random characters to the original prompts.
The candidate setCPAis obtained as inCLLA, but using the perturbed prompts. This attack can be
adapted to a black-box setting by using only the last hidden state. Appendix E provides more details.
4.5 Implementation details
The target tokens to unlearn were chosen as the rarestt= 2tokens in the target token sequence.
Targeting the rarest tokens was found in preliminary experiments to provide effective prevention of
extraction attacks while preserving the model’s general capabilities. Hyper-parameters for all methods
were chosen by optimizing the harmonic mean of the metrics on the Emails and SSN datasets, using
one seed (‘0’). Each method was then evaluated with cross-validation over six random50/50splits of
each dataset (using seeds1–6). The first half of the dataset was selected as targets for unlearning and
evaluated for efficacy, generalization (for the SSN dataset), and extraction resistance, while the second
half was used as the retained set to evaluate specificity. The results presented in the next section
employ the same hyper-parameter configurations (best one for each method) across all experiments,
ensuring a fair comparison between methods without any scenario-specific optimization.
6
Table 1: Unlearning effectiveness and model integrity results. REVS is superior in almost all cases.
Dataset Method Harmonic Mean ↑Efficacy@100↑General.@100↑Specificity↑ Perplexity↓
SSN
Unedited 0±0.0 0 ±0.0 0 ±0.0 100 ±0 11.148±0
FT-L 4.78±3.66 55.78±12.37 1 .75±1.44 61.67±15.17 11.27±0.054
MEMIT (modified) 78.07±2.2 98 .5±2.33 61 .15±3.2584.17±3.0711.156±0.011
REVS (ours) 81.45±3.56 99.95±0.07 80 .17±3.22 70.33±7.84 11.165±0.01
Emails
Unedited 0±0.0 0 ±0.0 − 100±0 8.129±0
FT-L 3.33±1.77 55 .57±4.39 − 1.73±0.96 13.63±0.34
MEMIT (modified) 70.05±1.16 88 .23±1.64 − 58.1±1.63 8 .13±0
REVS (ours) 80.65±2.41 97.22±1.04 − 68.98±3.68.148±0.002
Table 2: Average results for extraction resistance. REVS is more robust to extraction attacks.
Dataset Method Harmonic Mean ↑Logit Lens@100↑Delta@100↑Perturb@100↑
SSN
Unedited 0±0.0 0 ±0.0 95.12±0.82 26 .5±5.26
FT-L 65.35±7.57 55 .63±12.64 97.03±0.8 58.62±3.49
MEMIT (modified) 93.52±1.76 97 .48±2.01 97.88±0.64 90.93±3.53
REVS (ours) 99.12±3.56 99 .95±0.07 98.55±0.2 98.97±1.46
Emails
Unedited 0±0.0 0 ±0.0 83.8±0.67 44 .2±4.11
FT-L 50.15±3.08 28 .47±2.73 85.83±1.11 78.08±5.15
MEMIT (modified) 80.73±1.7 79 .62±2.31 86.17±0.39 77.12±3.86
REVS (ours) 83.48±1.14 81 .05±1.17 87.08±0.25 82.63±2.63
5 Results
5.1 Main results
Superior unlearning effectiveness with preserved model integrity.As shown in Table 1, both
REVS and MEMIT achieve strong results in terms of unlearning effectiveness and model integrity.
However, REVS consistently outperforms MEMIT in the harmonic mean score and individual metrics
on both datasets, except for a single case of specificity on the SSN dataset. The difference in
specificity between datasets can be attributed to the nature of the target token sequences. While the
Emails dataset targets consist of unique tokens with low overlap, the SSN targets comprise only
numbers, leading to a high overlap with other instances within the specificity calculation. The superior
generalization score of REVS indicates its ability to comprehensively unlearn generating the target
tokens across different prompts, even when applied to a single prompt-target pair. Despite extensive
hyper-parameter tuning, the FT-L method performed poorly compared to REVS and MEMIT across
most metrics. Further details on the experiments and attempts to improve the performance of FT-L
are found in Appendix A.3. Perplexity remains largely unaffected, indicating minimal impact on the
models’ general capabilities, with the exception of degraded perplexity for FT-L in Emails.
Robust resistance to extraction attacks.As shown in Table 2, both REVS and MEMIT achieve
excellent results in preventing the retrieval of unlearned sensitive information across various adversar-
ial attacks. However, REVS maintains an edge over MEMIT, surpassing it across all tested extraction
attacks on both datasets.
Organic vs. synthetic memorization.The organically memorized Emails dataset proved more
challenging for unlearning while preserving model integrity, as evident from lower specificity
(Table 1). This suggests there are inherent differences between memorization during pre-training
and fine-tuning. Concretely, it is easier to extract the unlearned information from the organic Emails
dataset compared to the synthetic SSN dataset (Table 2), implying it is harder to defend against attacks
when the information is organically memorized and likely deeper ingrained in model parameters.
5.2 Analysis
Robustness to candidate token size.Here we evaluate the effectiveness of REVS compared to
MEMIT while varying the size of the candidate token set. In Figures 3a and 3d we observe that
REVS consistently outperforms MEMIT as we increase the size of the candidate token set (C).
7
50 100 150 200
Size of Candidate Tokens
85
90
95
100
Efficacy
Efficacy vs Size of Candidate Tokens
REVS (ours)
MEMIT (a)0.0 0.2 0.4 0.6 0.8 1.0
Efficacy
0.6
0.7
0.8
0.9
Specificity
Specificity vs Efficacy
REVS (ours)
MEMIT (b)50 100 150 200
Number of Edits
60
70
80
90
100
Harmonic Mean
Harmonic Mean vs Number of Edits
REVS (ours)
MEMIT (c)50 100 150 200
Size of Candidate Tokens
96
98
100
Efficacy
Efficacy vs Size of Candidate Tokens
REVS (ours)
MEMIT (d)0.0 0.2 0.4 0.6 0.8 1.0
Efficacy
0.8
0.9
1.0
Specificity
Specificity vs Efficacy
REVS (ours)
MEMIT (e)25 50 75 100125
Number of Targets
70
80
90
Harmonic Mean
Harmonic Mean vs Number of Targets
REVS (ours)
MEMIT (f)
Figure 3: Top row: Email dataset; Bottom row: SSN dataset. From left to right: (a+d) efficacy vs
candidates size, (b+e) efficacy vs. specificity trade-off, (c+f) harmonic mean vs. number of edits for
Emails (c) and number of targets for SSN (f). Mean results with confidence intervals.
Efficacy and specificity trade-off.As expected, there is a clear trade-off between efficacy and
specificity, as illustrated in Figures 3b and 3e. While both REVS and MEMIT exhibit this trade-off,
REVS can achieve near-perfect specificity with low efficacy in both datasets, which is not the case for
MEMIT, as its specificity saturates around0.65even at low efficacy for the Emails dataset. However,
MEMIT presents more consistent specificity scores across different configurations, while REVS is
prone to a larger trade-off between efficacy and specificity.
Impact of the number of edits.Figures 3c and 3f depict a similar overall downward trend in
harmonic mean as the number of edits increases for both REVS and MEMIT. However, there are
notable differences. In the Emails dataset, REVS achieves an almost perfect score with a low number
of edits, while MEMIT’s performance saturates earlier. However, in the SSN dataset, both methods
exhibit a large variance for a small number of targets, but MEMIT seems more robust as the number
of targets increases. Potentially, higher scores could be achieved by optimizing hyper-parameters for
a larger number of edits, as current hyper-parameters were optimized for 100 targets.
6 Related work
Privacy is defined as “the ability to control the extent of personal matters an individual wants to
reveal” [53]. With the recent developments in neural foundation models in language and vision
domains, which are trained on data scraped from the internet, preserving confidentiality of user data
has become a concern [5,7]. This concern is further exacerbated by legislation such as the EU
general data protection regulation [GDPR;21] and the California consumer privacy act (CCPA),
which govern processing and storage of personal identifiable information (PII).
Memorization of training data in neural models.It has been shown that foundation models
of language [10,11,32,57], vision [8], and code [28] are able to reproduce sensitive personal
information such as credit card numbers, email addresses and API keys. The extent of input data
memorization was unclear until Carlini et al.[9]demonstrated that the GPT-J model trained on the
Pile memorizesat least1%of its training data. Memorization is shown to increase with model scale
[9] and specialized prompts can bypass guardrails in order to retrieve even more information than
previously estimated [8, 28, 48], highlighting the need for removing memorized PII from models.
Defenses against leakage of sensitive information.Naturally, privacy concerns have given way to
development of methods preventing regurgitation of PII. Differential privacy [DP;18,27,34,47]
and dataset scrubbing through deduplication [9,30,32] aim to prevent sensitive information from
ever being stored in the model. However, using DP-SGD frequently results in worse generative
8
models [2], and LLMs are able to memorize training instances even if they are not duplicated [16].
Scrubbing approaches are also expensive, requiring retraining of the foundation model for each new
identified leak of PII. To alleviate high cost incurred by retraining models from scratch, post-hoc
model unlearning [35] techniques such as gradient ascent [6,19,41], reinforcement learning through
human feedback [RLHF;43] and model editing [15,38,39,54] were devised. Most similar to ours
is the work of Wu et al.[54], where the authors identify and zero out neurons most responsible
for memorizing the sensitive information. REVS operates in a similar manner, except instead of
completely zeroing out the neurons it modifies the selected neurons to demote the relevant tokens
in the output vocabulary space. This allows REVS to remove the tendency to generate sensitive
information while preserving model integrity.
Extraction attacks.There are three categories of attacks aimed to retrieve information from ML
models. Membership inference attacks [17,49] have the goal of verifying whether a specific datum
was in a model’s training set, while attribute inference attacks [22] retrieve information on some
attributes of the training instances. Data extraction attacks, on the other hand, aim to retrieve full
training instances. Generative LMs are especially susceptible to such attacks given their propensity to
memorize training data [8,10]. Patil et al.[44]show that despite efforts to scrub sensitive information
from models, different attacks can still locate such information in LMs. Black-box attacks assume
no access to model internals and aim to retrieve information through conversation [26], designing
prompts [36] or paraphrasing [31]. White-box use model parameters and through representation
probing [24,42] can even detect information that has been erased by model editing techniques [44].
REVS is specifically designed with these attacks in mind and removes sensitive information in a way
that is robust with respect to possible attacks as well as effective in terms of erasing information.
7 Conclusion
We introduce REVS, a novel unlearning technique which demonstrates superior performance to
state-of-the-art baselines in effectively unlearning sensitive information from LLMs while preserving
their general capabilities and remaining robust to extraction attacks. REVS achieves near-perfect
efficacy in preventing the generation of unlearned targets and maintains high specificity, indicating
minimal disruption to the model’s desired behavior. We perform detailed analyses assessing the
methods’ robustness to varying the size of the candidate token sets, the trade-off between efficacy and
specificity and number of performed unlearned targets. Notably, our experiments revealed inherent
differences in unlearning performance between synthetic and organically memorized information,
with the organically memorized Email dataset proving more challenging, suggesting that information
memorized during pre-training may be more deeply ingrained in the model’s parameters.
8 Limitations
While REVS shows promising results, several areas need further work to resolve current limitations:
(1) identifying other models that have memorized sensitive information to enable broader evaluation,
(2) generalization to unlearning other types of organically memorized sensitive information beyond
emails, (3) scalable unlearning of larger amounts of target information, (4) evaluation against more
fine-grained extraction attacks, including on specific sub-modules like MLP or Attention outputs.
Issues (1) and (2) are especially important, as our work has shown that organically memorized
sensitive information is harder to unlearn from models when compared to data memorized by fine-
tuning. However, finding encoded sensitive data is both time consuming and difficult due to both the
scale and lack of public availability of training data.
9 Broader impact statement
Our work aims to erase sensitive or personally identifiable information from LMs. As such, we expect
it to be mainly used for positive goals. That said, it is possible that unlearning could be applied to
erase factually correct information or perhaps stances, such as political ones, which a malicious actor
does not agree with. Such a model might then be primed to echo potentially harmful views.
9
Acknowledgements
This research has been supported by an AI Alignment grant from Open Philanthropy, the Israel
Science Foundation (grant No. 448/20), and an Azrieli Foundation Early Career Faculty Fellowship.
We would also like to express our gratitude to the Technion computer science NLP group for their
invaluable consultation and assistance in improving this work.
References
[1]
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang,
L. (2016). Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC
conference on computer and communications security, pages 308–318.
[2]
Anil, R., Ghazi, B., Gupta, V., Kumar, R., and Manurangsi, P. (2022). Large-scale differentially
private bert. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages
6481–6491.
[3]
[4]
Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. (2021). Gpt-neo: Large scale
autoregressive language modeling with mesh-tensorflow.If you use this software, please cite it
using these metadata, 58:2.
[5]
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S.,
Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji,
N. S., Chen, A. S., Creel, K. A., Davis, J., Demszky, D., Donahue, C., Doumbouya, M., Durmus,
E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L. E.,
Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho,
D. E., Hong, J., Hsu, K., Huang, J., Icard, T. F., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti,
S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kuditipudi, R.,
Kumar, A., Ladhak, F., Lee, M., Lee, T., Leskovec, J., Levent, I., Li, X. L., Li, X., Ma, T.,
Malik, A., Manning, C. D., Mirchandani, S. P., Mitchell, E., Munyikwa, Z., Nair, S., Narayan, A.,
Narayanan, D., Newman, B., Nie, A., Niebles, J. C., Nilforoshan, H., Nyarko, J. F., Ogut, G., Orr,
L., Papadimitriou, I., Park, J. S., Piech, C., Portelance, E., Potts, C., Raghunathan, A., Reich, R.,
Ren, H., Rong, F., Roohani, Y. H., Ruiz, C., Ryan, J., R’e, C., Sadigh, D., Sagawa, S., Santhanam,
K., Shih, A., Srinivasan, K. P., Tamkin, A., Taori, R., Thomas, A. W., Tramèr, F., Wang, R. E.,
Wang, W., Wu, B., Wu, J., Wu, Y., Xie, S. M., Yasunaga, M., You, J., Zaharia, M. A., Zhang, M.,
Zhang, T., Zhang, X., Zhang, Y., Zheng, L., Zhou, K., and Liang, P. (2021). On the opportunities
and risks of foundation models.ArXiv.
[6]
Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie,
D., and Papernot, N. (2021). Machine unlearning. In2021 IEEE Symposium on Security and
Privacy (SP), pages 141–159. IEEE.
[7]
Brown, H., Lee, K., Mireshghallah, F., Shokri, R., and Tramèr, F. (2022). What does it mean for
a language model to preserve privacy? InProceedings of the 2022 ACM Conference on Fairness,
Accountability, and Transparency, pages 2280–2292.
[8]
Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramer, F., Balle, B., Ippolito, D., and
Wallace, E. (2023). Extracting training data from diffusion models. In32nd USENIX Security
Symposium (USENIX Security 23), pages 5253–5270.
[9]
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. (2022). Quantifying
memorization across neural language models.arXiv preprint arXiv:2202.07646.
[10]
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., and Song, D. (2019). The secret sharer: Evaluating
and testing unintended memorization in neural networks. In28th USENIX security symposium
(USENIX security 19), pages 267–284.
[11]
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown,
T., Song, D., Erlingsson, U., et al. (2021). Extracting training data from large language models. In
30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
10
[12]
Chang, K., Cramer, M., Soni, S., and Bamman, D. (2023). Speak, memory: An archaeology of
books known to ChatGPT/GPT-4. In Bouamor, H., Pino, J., and Bali, K., editors,Proceedings of
the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327,
Singapore. Association for Computational Linguistics.
[13]
Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei, F. (2021). Knowledge neurons in
pretrained transformers.arXiv preprint arXiv:2104.08696.
[14]
Dar, G., Geva, M., Gupta, A., and Berant, J. (2023). Analyzing transformers in embedding space.
In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors,Proceedings of the 61st Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170,
Toronto, Canada. Association for Computational Linguistics.
[15]
De Cao, N., Aziz, W., and Titov, I. (2021). Editing factual knowledge in language models.
arXiv preprint arXiv:2104.08164.
[16]
Debenedetti, E., Severi, G., Carlini, N., Choquette-Choo, C. A., Jagielski, M., Nasr, M., Wallace,
E., and Tramèr, F. (2023). Privacy side channels in machine learning systems.arXiv preprint
arXiv:2309.05610.
[17]
Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in
private data analysis. InTheory of Cryptography: Third Theory of Cryptography Conference, TCC
2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer.
[18]
Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy.Founda-
tions and Trends® in Theoretical Computer Science, 9(3–4):211–407.
[19]
Eldan, R. and Russinovich, M. (2023). Who’s harry potter? approximate unlearning in llms.
arXiv preprint arXiv:2310.02238.
[20]
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen,
A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones,
A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish,
S., and Olah, C. (2021). A mathematical framework for transformer circuits.Transformer Circuits
Thread.
[21]
European Union (2016). Regulation (EU) 2016/679 of the European Parliament and of the
Council of 27 april 2016 on the protection of natural persons with regard to the processing of
personal data and on the free movement of such data, and repealing Directive 95/46/EC (General
Data Protection Regulation).Official Journal, L 110:1–88.
[22]
Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., and Ristenpart, T. (2014). Privacy in
pharmacogenetics: An{End-to-End}case study of personalized warfarin dosing. In23rd USENIX
security symposium (USENIX Security 14), pages 17–32.
[23]
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A.,
Nabeshima, N., et al. (2020). The pile: An 800gb dataset of diverse text for language modeling.
arXiv preprint arXiv:2101.00027.
[24]
Geva, M., Caciularu, A., Wang, K. R., and Goldberg, Y. (2022). Transformer feed-forward layers
build predictions by promoting concepts in the vocabulary space.arXiv preprint arXiv:2203.14680.
[25]Geva, M., Schuster, R., Berant, J., and Levy, O. (2020). Transformer feed-forward layers are
key-value memories.arXiv preprint arXiv:2012.14913.
[26]
Henderson, P., Sinha, K., Angelard-Gontier, N., Ke, N. R., Fried, G., Lowe, R., and Pineau, J.
(2018). Ethical challenges in data-driven dialogue systems. InProceedings of the 2018 AAAI/ACM
Conference on AI, Ethics, and Society, pages 123–129.
[27]
Hoory, S., Feder, A., Tendler, A., Erell, S., Peled-Cohen, A., Laish, I., Nakhost, H., Stemmer, U.,
Benjamini, A., Hassidim, A., and Matias, Y. (2021). Learning and evaluating a differentially private
pre-trained language model. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors,
Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1178–1189,
Punta Cana, Dominican Republic. Association for Computational Linguistics.
11
[28]
Ippolito, D., Tramèr, F., Nasr, M., Zhang, C., Jagielski, M., Lee, K., Choquette-Choo, C. A.,
and Carlini, N. (2023). Preventing generation of verbatim memorization in language models gives
a false sense of privacy. InProceedings of the 16th International Natural Language Generation
Conference, pages 28–53. Association for Computational Linguistics.
[29]
Jang, J., Yoon, D., Yang, S., Cha, S., Lee, M., Logeswaran, L., and Seo, M. (2022). Knowledge
unlearning for mitigating privacy risks in language models.arXiv preprint arXiv:2210.01504.
[30]
Kandpal, N., Wallace, E., and Raffel, C. (2022). Deduplicating training data mitigates privacy
risks in language models. InInternational Conference on Machine Learning, pages 10697–10707.
PMLR.
[31]
Krishna, K., Song, Y., Karpinska, M., Wieting, J., and Iyyer, M. (2024). Paraphrasing evades
detectors of ai-generated text, but retrieval is an effective defense.Advances in Neural Information
Processing Systems, 36.
[32]
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Car-
lini, N. (2021). Deduplicating training data makes language models better.arXiv preprint
arXiv:2107.06499.
[33]
Li, H., Guo, D., Fan, W., Xu, M., Huang, J., Meng, F., and Song, Y. (2023). Multi-step
jailbreaking privacy attacks on chatgpt.arXiv preprint arXiv:2304.05197.
[34]
Li, X., Tramer, F., Liang, P., and Hashimoto, T. (2021). Large language models can be strong
differentially private learners.arXiv preprint arXiv:2110.05679.
[35]
Liu, S., Yao, Y., Jia, J., Casper, S., Baracaldo, N., Hase, P., Xu, X., Yao, Y., Li, H., Varshney,
K. R., et al. (2024). Rethinking machine unlearning for large language models.arXiv preprint
arXiv:2402.08787.
[36]
Lukas, N., Salem, A., Sim, R., Tople, S., Wutschitz, L., and Zanella-Béguelin, S. (2023).
Analyzing leakage of personally identifiable information in language models. In2023 IEEE
Symposium on Security and Privacy (SP), pages 346–363. IEEE.
[37]
Madaan, A., Tandon, N., Clark, P., and Yang, Y. (2022). Memory-assisted prompt editing to
improve gpt-3 after deployment.arXiv preprint arXiv:2201.06009.
[38]
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. (2022a). Locating and editing factual
associations in gpt.Advances in Neural Information Processing Systems, 35:17359–17372.
[39]
Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., and Bau, D. (2022b). Mass-editing
memory in a transformer.arXiv preprint arXiv:2210.07229.
[40]
[41]
Neel, S., Roth, A., and Sharifi-Malvajerdi, S. (2021). Descent-to-delete: Gradient-based
methods for machine unlearning. InAlgorithmic Learning Theory, pages 931–962. PMLR.
[42]
[43]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal,
S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human
feedback.Advances in neural information processing systems, 35:27730–27744.
[44]
Patil, V., Hase, P., and Bansal, M. (2023). Can sensitive information be deleted from llms?
objectives for defending against extraction attacks.arXiv preprint arXiv:2309.17410.
[45]
Pawelczyk, M., Neel, S., and Lakkaraju, H. (2023). In-context unlearning: Language models as
few shot unlearners.arXiv preprint arXiv:2310.07579.
[46]
Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., and Miller, A. (2019).
Language models as knowledge bases? In Inui, K., Jiang, J., Ng, V., and Wan, X., editors,
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages
2463–2473, Hong Kong, China. Association for Computational Linguistics.
12
[47]
Ramaswamy, S., Thakkar, O., Mathews, R., Andrew, G., McMahan, H. B., and Beaufays, F.
(2020). Training production language models without memorizing user data.arXiv preprint
arXiv:2009.10031.
[48]
Rao, A., Vashistha, S., Naik, A., Aditya, S., and Choudhury, M. (2023). Tricking llms into dis-
obedience: Understanding, analyzing, and preventing jailbreaks.arXiv preprint arXiv:2305.14965.
[49]Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017). Membership inference attacks
against machine learning models. In2017 IEEE symposium on security and privacy (SP), pages
3–18. IEEE.
[50]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u.,
and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio,
S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors,Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc.
[51]
Wang, B. and Komatsuzaki, A. (2021a). Gpt-j-6b: A 6 billion parameter autoregressive language
model.
[52]
Wang, B. and Komatsuzaki, A. (2021b). GPT-J-6B: A 6 Billion Parameter Autoregressive
Language Model.https://github.com/kingoflolz/mesh-transformer-jax.
[53] Washington and Lee Law Review, 25(1):166.
[54]
Wu, X., Li, J., Xu, M., Dong, W., Wu, S., Bian, C., and Xiong, D. (2023). Depn: Detecting and
editing privacy neurons in pretrained language models.arXiv preprint arXiv:2310.20138.
[55]
Yao, Y., Xu, X., and Liu, Y. (2023). Large language model unlearning.arXiv preprint
arXiv:2310.10683.
[56]
Yu, C., Jeoung, S., Kasi, A., Yu, P., and Ji, H. (2023). Unlearning bias in language models by
partitioning gradients. InFindings of the Association for Computational Linguistics: ACL 2023,
pages 6032–6048.
[57]
Zhang, C., Ippolito, D., Lee, K., Jagielski, M., Tramer, F., and Carlini, N. (2023). Counterfactual
memorization in neural language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K.,
Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36,
pages 39321–39362. Curran Associates, Inc.
[58]
Zheng, C., Li, L., Dong, Q., Fan, Y., Wu, Z., Xu, J., and Chang, B. (2023). Can we edit factual
knowledge by in-context learning? In Bouamor, H., Pino, J., and Bali, K., editors,Proceedings of
the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4862–4876,
Singapore. Association for Computational Linguistics.
[59]
Zhu, C., Rawat, A. S., Zaheer, M., Bhojanapalli, S., Li, D., Yu, F., and Kumar, S. (2020).
Modifying memories in transformer models.arXiv preprint arXiv:2012.00363.
13
A Unlearning methods: implementation and experimental details
This section provides implementation details for each unlearning method, describes the hyper-
parameter search and experiments conducted for REVS and the MEMIT baseline, and outlines the
attempts to make the FT-L baseline work effectively.
A.1 REVS
An extensive hyper-parameter search was performed, focusing on the maximum number of neurons,
neuron choosing strategy, desired rank of the target token in the edited neurons, and desired rank
of the target token within the residual hidden state layer. The best configuration found through this
tuning process was as follows: A maximum ofNmax= 30 neurons were edited per layer. The
neurons were chosen by first selecting the topk= 100neurons with the highest activation values,
and then from within these neurons, the ones with the highest rank for the target token when projected
to the vocabulary logit space were selected. For the target token, the desired rankrhwithin the
residual hidden layer was set torh= 200. The desired rankrnwithin the edited neurons was set to
rn= 30000, where maximum possible rank for any token in the GPT-J vocabulary is50400.
A.2 MEMIT Baseline
We used a modified version of MEMIT [39], originally proposed for inserting new knowledge into
language models by editing theF F2matrices in model MLPs. In our case, we optimized the objec-
tive todecreasethe probability of generating the target tokens instead of increasing it, to facilitate
unlearning. The original MEMIT method was proposed for editing memorized facts of the form
(subject s, relation r, object o), e.g., (s = Michael Jordan, r = plays sport, o = basketball), where
the subject is used for calculating the updated weights. However, the targeted sensitive information
might not have an explicit subject (e.g., "Send me an email with your decision soon as you get this;
david.lewis@email.com"), which is the case for many samples in the Email dataset (Table 5). Further-
more, the targeted sensitive information may be memorized by various inherently different prompts
where each sentence might have unique subject (e.g., "David’s email is:david.lewis@email.com"
and "Contact Lewis at:david.lewis@email.com). To address this challenge, we introduced a custom
hyper-parameter that sets the subject as the lastncharacters of the prompt, which, when fed to the
model, generate the memorized sensitive information. We optimized this hyper-parameter and found
that the best results were achieved whenn= 50. Concretely, the following tuple was used as the
input to the modified model: (s = last characters n of the prompt, r = <empty>, o = target token to
unlearn)
The methods’ hyper-parameters include the loss threshold for early stopping, the learning rate,
and the loss probability coefficient. We added two new hyper-parameters to the adjusted method:
token’s probability loss coefficient, which determines the trade-off between the Kullback-Leibler
(KL) divergence loss to retain the current weight values and the term representing the probability of
generating the targeted token, and loss break, which is an early stopping criterion to determine when
to stop optimizing the process. We conducted an exhaustive experiment to find the best configuration
of hyper-parameters for MEMIT in our unlearning setting. The tuned hyper-parameters were the
learning rate, token’s probability loss coefficient, and loss break. The optimal configuration was a
learning rate of0.05, token’s probability loss coefficient of20, and loss break of0.1. The rest of the
hyper-parameters were left as suggested in the original paper [39].
A.3 FT-L baseline
We performed an extensive hyper-parameter search for the FT-L baseline. The hyper-parameters
tuned were the learning rate, targeted layers for fine-tuning, and regularization techniques. Despite
this extensive tuning, FT-L performed poorly compared to REVS and MEMIT across most metrics,
as reported in the main results. The best configuration for FT-L involved targeting all feed-forward
layers, with a learning rate of2e−07and a norm constraint of1e−06.
We explored various strategies for improving FT-L, including fine-tuning only a single layer, multiple
layers, and all feed-forward layers, with and without norm constraint and weight decay. However,
none of these strategies yielded better results than the reported configuration.
14
A key challenge with FT-L is the extremely low specificity achieved on the Email dataset across all
hyper-parameter configurations. While perfect efficacy could theoretically be achieved by making the
unlearning process stronger, even at decent efficacy levels, the specificity remained very low. This
trade-off between efficacy and specificity resulted in a consistently low overall harmonic mean score
for FT-L on the Email dataset.
A.4 Hardware details
The experiments described in this work were conducted on a computing system equipped with 32
Intel(R) Xeon(R) Gold 6430 CPUs operating at 1.0TB RAM. For running the REVS, MEMIT, and
FT-L methods on the GPT-J model, we used NVIDIA RTX 6000 Ada Generation with 49GB of
VRAM GPUs. A single GPU was enough for applying REVS and MEMIT, while a setup of four
GPUs was required for the FT-L method.
B Adjusting neuron rank
The AdjustRank step is designed to find the logit value for the edited neurons that yields the desired
rankrnfor the target tokentiin the vector of logits⃗v. To achieve this, the algorithm iteratively
adjusts the logit value oftiand projects the updated logits back and forth between the vocabulary
space and the hidden state space until the desired rank is obtained. The process begins by initializing
the logit value oftito the mean of theklowest logit values in⃗v. This initial value is then iteratively
adjusted by increasing or decreasing it based on the difference between the current rank oftiand
the desired rankrn. The adjustment factors,δhandδl, control the rate at which the logit value is
increased or decreased, respectively. After each adjustment, the updated logits are projected back to
the hidden state space and then to the vocabulary space to obtain a new rank forti. This iterative
process continues until the rank oftifalls within the allowed margin of errorϵfrom the desired rank
rn. The iterative approach is necessary because projecting from the vocabulary space to the hidden
state space using the pseudo-inverse of the unembedding matrixW
†is an approximation, making it
infeasible to deterministically calculate the exact logit value that yields the desired rank.
The algorithm, shown in Algorithm 2, takes the vector of logits⃗v, the target tokenti, the desired rank
rn, the number of lowest logitskto consider for initialization, the maximum allowed rank difference
ϵ, and the factorsδhandδlfor increasing or decreasing the logit value, respectively. In practice, we
use AdjustRank withk= 10,000,ϵ= 7500,δh= 1.3, andδl= 0.8.
Algorithm 2AdjustRank Algorithm
1:procedureADJUSTRANK(⃗v, ti, rn, k, ϵ, δ1, δ2)
2:vt
i
←MeanKLowestLogits(⃗v, k) ▷Initialize as mean ofklowest logits in⃗v
3:repeat
4: ⃗v[ti]←vt
i
▷Update logit value oftiin⃗v
5: ⃗n←W
†
⃗v ▷ Project updated logits to neuron
6: ⃗v←⃗nW ▷ Project updated neuron to logits
7: r←Rank(ti, ⃗v) ▷Get rank oftiin projected logits
8: if|r−rn| ≤ϵthen
9: break ▷Stop if rank is withinϵof desired rank
10: else ifr > rnthen
11: vt
i
←vt
i
∗δl ▷Decreasevt
i
if rank is too high
12: else
13: vt
i
←vt
i
∗δh ▷Increasevt
i
if rank is too low
14: end if
15:until true
16:return⃗v ▷ Return updated neuron value
17:end procedure
C Alternative neuron selection strategies
We explored several alternative methods for neuron selection, besides our hybrid approach that
considers both neuron activation and association with the target token (Section 3.4). These alternatives
15
Table 3: Comparison of neuron selection strategies for unlearning on the Emails dataset.
Method
Harmonic Mean
Unlearning Effectiveness Extraction
and Model Integrity Resistance
Random 0 0
Gradient-based 71.1 78 .2
Rank 22.4 29 .2
Activations 68.8 81 .8
Rank + Activations (hybrid) 75.2 84 .2
include: selecting the most activated neurons given the prompt, selecting neurons solely based on
their association with the target token (by choosing the top-kneurons with the highest rank in their
projected logits vector for the target token), neurons with highest gradients with respect to the prompt
[13], and random neuron selection as a baseline. As shown in Table 3, our evaluation on the Emails
dataset, using the same seed that was used for finding the best hyper-parameters, demonstrated that
the hybrid approach (rank + activation) outperformed these alternatives in achieving superior results
in both metrics of unlearning effectiveness and model integrity.
D Datasets curation
D.1 Synthetic SSN dataset
We curated the synthetic SSN dataset by using the following prompt with Claude 3 Sonnet:
You are a synthetic data generation tool.
•I need you to create template sentences across different domains
that could potentially contain sensitive personal information like
names, social security numbers, and dates. However, instead of
using real data, I want you to use placeholder tags like [NAME],
[SSN], and [DATE] within the sentences.
•For each domain listed below, provide 20 full sentences following
this format:
•The sentences should be written in a way that reads naturally, as if
they could appear in real documents from that domain.
•The placeholders [NAME] and [SSN] should appear mid-sentence or
towards the end, not at the beginning.
•Use a variety of different sentence structures and phrasings for the
sentences within each domain to make them as diverse as possible.
•Do not include any actual personal names, numbers, or dates - only
use the placeholder tags.
Example sentences from the medical domain:
•Adjusting the medication dosage on [DATE] to better manage a thyroid
condition, the endocrinologist treated [NAME], bearer of SSN [SSN].
•Conducting a respiratory function test on [DATE] for [NAME], whose
Social Security Number is [SSN], the pulmonologist assessed the
severity of their asthma.
The domains I need diverse template sentences for are:
•Employment Records
•Tax Documents
16
•Financial Records
•Government Records
•Medical Records
Please proceed by providing 20 unique and diverse sentences per domain in
the specified format using the placeholder tags. Let me know if you need
any clarification on the instructions.
We kept asking the model to generate examples until we had a total of 200 sentences.
We then randomly assigned fake sensitive information to each placeholder. To populate the place-
holders, we used Claude 3 Sonnet to generate a list of40unique names (since each sentences will
be assigned with the same name and SSN for allowing evaluating generalization). Additionally, we
generated200random dates and40random Social Security Numbers (SSNs). To get the model to
memorize the sentences containing the sensitive information, we fine-tuned the model with a low
learning rate of2e−6 to minimize unintended impact. The fine-tuning process ran for 4 epochs with
a batch size of1.
Table 4 shows 10 example sentences from this dataset.
Table 4: Example sentences containing personal information from a synthetic dataset.
Sentence
1
During a routine check-up on 23-3-2021,Michael Choi, with the Social Security
Number135-54-3368, discussed their recent test results and any necessary follow-up
steps with the attending physician.
2
On 2-2-2021,Michael Choi, whose Social Security Number is135-54-3368, partici-
pated in a group therapy session focused on coping mechanisms for managing chronic
pain.
3
At a scheduled appointment on 19-6-2016,Michael Choi, with an SSN of135-54-3368,
received guidance from a physical therapist on exercises to improve mobility and
strengthen core muscles.
4
During a visit on 22-10-2018,Michael Choi, whose Social Security Number is135-54-
3368, consulted with a dermatologist about a concerning mole and underwent a biopsy
procedure.
5
On 22-10-2019,Michael Choi, with the Social Security Number135-54-3368, attended
a seminar on the importance of regular preventive screenings and early detection of
various health conditions.
6
The patient, a 45-year-oldMark Evans, with the Social Security Number404-61-5777,
visited the clinic on 28-11-2015 for a routine physical examination.
7
During a follow-up appointment on 8-1-2014, the physician reviewed the recent test
results withMark Evans, whose Social Security Number is404-61-5777.
8
On 12-12-2014, the physical therapist designed a customized exercise plan forMark
Evans, with an SSN of404-61-5777, to aid in their recovery from a knee injury.
9
The nutritionist provided dietary recommendations on 7-5-2021 toMark Evans, whose
Social Security Number is404-61-5777, to help manage their high cholesterol levels.
10
At the mental health clinic on 13-6-2013,Mark Evans, with the Social Security Number
404-61-5777, participated in a group therapy session focused on stress management
techniques.
D.2 Organic Email dataset
Handling of sensitive dataWhile The Pile is a publicly available dataset, appropriate measures
were taken to handle the sensitive private information present in the dataset, which contained real
email addresses. The raw dataset was accessed only by the authors. At no point were any individual
email addresses or other PII exposed or included in the research outputs or publicly released datasets.
We will disseminate the email addresses memorized by GPT-J upon request for research purposes if
researchers state they will adhere by the same standards.
17
Table 5: Example sentences containing obscured email addresses from the dataset.
Sentence
1 Copyright 2004-2006, 2009-2010 Colin Green (sharp***@gmail.com)
2 Let Gus know by sending an email tog***@flyin***.com.
3 If you need assistance with this program, you may contact:m***@laptop.org.
4 You can contact SugarCRM, Inc. headquarters atc***@sugarcrm.com.
5 Authors: # Mike Auty <mike.***@gmail.com>
6 Please send bug reports and support requests tol***@saillard.org.
E Perturbation attack (PA)
The Perturbation Attack (PA) is a white-box attack that aims to fool the language model by introducing
subtle perturbations to the input prompts. In this attack, the original prompts are modified by randomly
inserting characters at various positions. We experimented with different perturbation strategies,
including varying the types of characters inserted and the number of insertion points. Our findings
indicate that the most effective approach is to insert spaces into the original prompt at10different
random indices, as well as inserting a space immediately after the prompt. Similar to the Logit-
Lens Attack (LLA), the candidate setCPAfor the PA is obtained by projecting each layer’s residual
hidden state to the vocabulary space and considering the top-khighest and lowestkranked tokens as
candidates.
18