Extracted Text
1909.00015v2.pdf
Adaptively Sparse Transformers
Gonc¸alo M. Correia
ä
goncalo.correia@lx.it.pt
Vlad Niculae
ä
vlad@vene.ro
ä
Instituto de Telecomunicac¸oes, Lisbon, Portugal
ã
Unbabel, Lisbon, Portugal
Andr´e F.T. Martins
ä ã
andre.martins@unbabel.com
Abstract
Attention mechanisms have become ubiqui-
tous in NLP. Recent architectures, notably
the Transformer, learn powerful context-aware
word representations through layered, multi-
headed attention. The multiple heads learn
diverse types of word relationships. How-
ever, with standard softmax attention, all at-
tention heads are dense, assigning a non-zero
weight to all context words. In this work, we
introduce the adaptively sparse Transformer,
wherein attention heads have exible, context-
dependent sparsity patterns. This sparsity is
accomplished by replacing softmax with-
entmax: a differentiable generalization of soft-
max that allows low-scoring words to receive
precisely zero weight. Moreover, we derive a
method to automatically learn theparameter
which controls the shape and sparsity of-
entmax allowing attention heads to choose
between focused or spread-out behavior. Our
adaptively sparse Transformer improves inter-
pretability and head diversity when compared
to softmax Transformers on machine transla-
tion datasets. Findings of the quantitative and
qualitative analysis of our approach include
that heads in different layers learn different
sparsity preferences and tend to be more di-
verse in their attention distributions than soft-
max Transformers. Furthermore, at no cost in
accuracy, sparsity in attention heads helps to
uncover different head specializations.
1 Introduction
The Transformer architecture (Vaswani et al.,)
for deep neural networks has quickly risen to promi-
nence in NLP through its efciency and perfor-
mance, leading to improvements in the state of the
art of Neural Machine Translation (NMT;
Dowmunt et al.,;,), as well as
inspiring other powerful general-purpose models
like BERT (Devlin et al.,) and GPT-2(Rad-
ford et al.,). At the heart of the TransformerThequickbrownfoxjumpsoverThequickbrownfoxjumpsoverThequickbrownfoxjumpsover
head 1
head 4
head 3
head 2
Sparse Transformer
Adaptive Span
Transformer
Adaptively Sparse
Transformer (Ours)
Figure 1: Attention distributions of different self-
attention heads for the time step of the token over,
shown to compare our model to other related work.
While the sparse Transformer (Child et al.,) and
the adaptive span Transformer (Sukhbaatar et al.,)
only attend to words within a contiguous span of the
past tokens, our model is not only able to obtain differ-
ent and not necessarily contiguous sparsity patterns for
each attention head, but is also able to tune its support
over which tokens to attend adaptively.
liemulti-head attentionmechanisms: each word
is represented by multiple different weighted aver-
ages of its relevant context. As suggested by recent
works on interpreting attention head roles, sepa-
rate attention heads may learn to look for various
relationships between tokens (Tang et al.,;
ganato and Tiedemann,; cek and Rosa,
2018;,;,).
The attention distribution of each head is pre-
dicted typically using thesoftmaxnormalizing
transform. As a result, all context words have
non-zero attention weight. Recent work on sin-
gle attention architectures suggest that using sparse
normalizing transforms in attention mechanisms
such as sparsemax which can yield exactly zero
probabilities for irrelevant words may improve
performance and interpretability (Malaviya et al.,
2018;,;,). Qual-
itative analysis of attention heads (Vaswani et al.,
2017, Figure 5) suggests that, depending on what
phenomena they capture, heads tend to favor atter
or more peaked distributions.
Recent works have proposed sparse Transform-
ers (Child et al.,) and adaptive span Trans-
formers (Sukhbaatar et al.,). However, the
sparsity of those models only limits the attention
to a contiguous span of past tokens, while in this
work we propose ahighly adaptiveTransformer
model that is capable of attending to a sparse set of
words that are not necessarily contiguous. Figure
shows the relationship of these methods with ours.
Our contributions are the following:
We introducesparse attentioninto the Trans-
former architecture, showing that it eases inter-
pretability and leads to slight accuracy gains.
We propose an adaptive version of sparse at-
tention, where the shape of each attention
head islearnableand can vary continuously
and dynamically between the dense limit case
ofsoftmaxand the sparse, piecewise-linear
sparsemaxcase.
1
We make an extensive analysis of the added
interpretability of these models, identifying
both crisper examples of attention head behav-
ior observed in previous work, as well as novel
behaviors unraveled thanks to the sparsity and
adaptivity of our proposed model.
2 Background
2.1 The Transformer
In NMT, the Transformer (Vaswani et al.,)
is a sequence-to-sequence (seq2seq) model which
maps an input sequence to an output sequence
through hierarchicalmulti-head attentionmech-
anisms, yielding a dynamic, context-dependent
strategy for propagating information within and
across sentences. It contrasts with previous seq2seq
models, which usually rely either on costly gated
recurrent operations (often LSTMs:
et al.,;,) or static convo-
lutions (Gehring et al.,).
Givennquery contexts andmsequence items
under consideration, attention mechanisms com-
pute, for each query, a weighted representation of
the items. The particular attention mechanism used
in2017) is called scaled dot-product
attention, and it is computed in the following way:
Att(Q;K;V) =
QK
>
p
d
V;(1)
1
Code and pip package available athttps://github.
com/deep-spin/entmax.
whereQ2R
nd contains representations of the
queries,K;V2R
md are thekeysandvalues
of the items attended over, anddis the dimen-
sionality of these representations. Themapping
normalizes row-wise usingsoftmax,(Z)ij=
softmax(zi)j, where
softmax(z) =
exp(zj)
P
j
0exp(zj
0)
: (2)
In words, thekeysare used to compute a relevance
score between each item and query. Then, normal-
ized attention weights are computed using softmax,
and these are used to weight thevaluesof each item
at each query context.
However, for complex tasks, different parts of a
sequence may be relevant in different ways, moti-
vatingmulti-head attentionin Transformers. This
is simply the application of Equation
lelHtimes, each with a different, learned linear
transformation that allows specialization:
Headi(Q;K;V)=Att(QW
Q
i
;KW
K
i;V W
V
i)(3)
In the Transformer, there are three separate multi-
head attention mechanisms for distinct purposes:
Encoder self-attention:
builds rich, layered
representations of each input word, by attend-
ing on the entire input sentence.
Context attention:
selects a representative
weighted average of the encodings of the input
words, at each time step of the decoder.
Decoder self-attention:
attends over the par-
tial output sentence fragment produced so far.
Together, these mechanisms enable the contextual-
ized ow of information between the input sentence
and the sequential decoder.
2.2 Sparse Attention
The softmax mapping (Equation) is elementwise
proportional toexp, therefore it can never assign a
weight ofexactly zero. Thus, unnecessary items
are still taken into consideration to some extent.
Since its output sums to one, this invariably means
less weight is assigned to the relevant items, po-
tentially harming performance and interpretabil-
ity (Jain and Wallace,). This has motivated a
line of research on learning networks withsparse
mappings (Martins and Astudillo,;
and Blondel,;,;,
2019). We focus on a recently-introduced exible
family of transformations,-entmax (Blondel et al.,
2019;,), dened as:
-entmax(z):=argmax
p24
d
p
>
z+H
T
(p);(4)
where4
d
:=fp2R
d
:
P
i
pi= 1g is theprob-
ability simplex, and, for1 ,H
T
is the Tsallis
continuous family of entropies (Tsallis,):
H
T
(p):=
(
1
( 1)
P
j
pj p
j
; 6= 1;
P
j
pjlogpj; = 1:
(5)
This family contains the well-known Shannon and
Gini entropies, corresponding to the cases= 1
and= 2, respectively.
Equation
problem. Using the denition ofH
T
, the optimality
conditions may be used to derive the following
form for the solution (Appendix):
-entmax(z) = [( 1)z 1]
1= 1
+
;(6)
where[]+is the positive part (ReLU) function,
1denotes the vector of all ones, and which
acts like a threshold is the Lagrange multiplier
corresponding to the
P
i
pi= 1constraint.
Properties of-entmax.
The appeal of-
entmax for attention rests on the following prop-
erties. For= 1(i.e., whenH
T
becomes the
Shannon entropy), it exactly recovers the softmax
mapping (We provide a short derivation in Ap-
pendix.). For all >1 it permits sparse solu-
tions, in stark contrast to softmax. In particular, for
= 2, it recovers the sparsemax mapping (Martins
and Astudillo,), which is piecewise linear. In-
between, asincreases, the mapping continuously
gets sparser as its curvature changes.
To compute the value of-entmax, one must
nd the thresholdsuch that ther.h.s.in Equa-
tion2019) propose
a general bisection algorithm.2019)
introduce a faster, exact algorithm for= 1:5 , and
enable using-entmaxwith xedwithin a neu-
ral network by showing that the-entmax Jacobian
w.r.t.zforp
?
=-entmax(z)is
@ -entmax(z)
@z
=diag(s)
1
P
j
sj
ss
>
;
wheresi=
(
(p
?
i
)
2
; p
?
i
>0;
0; p
?
i
= 0:
(7)
Our work furthers the study of-entmax by
providing a derivation of the Jacobianw.r.t.the
hyper-parameter
(Section), thereby allowing
the shape and sparsity of the mapping to be learned
automatically. This is particularly appealing in the
context of multi-head attention mechanisms, where
we shall show in Section
tend to learn different sparsity behaviors.
3 Adaptively Sparse Transformers
with-entmax
We now propose a novel Transformer architecture
wherein we simply replace softmax with-entmax
in the attention heads. Concretely, we replace the
row normalizationin Equation
(Z)ij=-entmax(zi)j (8)
This change leads to sparse attention weights, as
long as >1; in particular,= 1:5 is a sensible
starting point (Peters et al.,).
Differentper head.
Unlike LSTM-based
seq2seq models, wherecan be more easily tuned
by grid search, in a Transformer, there are many
attention heads in multiple layers. Crucial to the
power of such models, the different heads capture
different linguistic phenomena, some of them iso-
lating important words, others spreading out atten-
tion across phrases (Vaswani et al.,, Figure 5).
This motivates using different, adaptivevalues
for each attention head, such that some heads may
learn to be sparser, and others may become closer
to softmax. We propose doing so by treating the
values as neural network parameters, optimized via
stochastic gradients along with the other weights.
Derivativesw.r.t..
In order to optimizeau-
tomatically via gradient methods, we must com-
pute the Jacobian of the entmax outputw.r.t..
Since entmax is dened through an optimization
problem, this is non-trivial and cannot be simply
handled through automatic differentiation; it falls
within the domain ofargmin differentiation, an ac-
tive research topic in optimization (Gould et al.,
2016;,).
One of our key contributions is the derivation
of a closed-form expression for this Jacobian. The
next proposition provides such an expression, en-
abling entmax layers with adaptive. To the best
of our knowledge, ours is the rst neural network
module that can automatically, continuously vary
in shape away from softmax and toward sparse
mappings like sparsemax.
Proposition 1.
Letp
?
:=-entmax(z) be the so-
lution of Equation. Denote the distribution ~pi:=
(p
?
i
)
2
=
P
j
(p
?
j
)
2
and lethi:= p
?
i
logp
?
i . Thei
th
component of the Jacobiang:=
@ -entmax(z)
@
is
gi=
8
<
:
p
?
i
~pi
( 1)
2+
hi ~pi
P
j
hj
1
; >1;
hilogp
?
i
p
?
i
P
j
hjlogp
?
j
2
; = 1:
(9)
The proof uses implicit function differentiation and
is given in Appendix.
Proposition
piece needed for training adaptively sparse Trans-
formers. In the following section, we evaluate this
strategy on neural machine translation, and analyze
the behavior of the learned attention heads.
4 Experiments
We apply our adaptively sparse Transformers on
four machine translation tasks. For comparison,
a natural baseline is the standard Transformer ar-
chitecture using the softmax transform in its multi-
head attention mechanisms. We consider two other
model variants in our experiments that make use of
different normalizing transformations:
1.5-entmax:
a Transformer with sparse ent-
max attention with xed= 1:5 for all heads.
This is a novel model, since 1.5-entmax had
only been proposed for RNN-based NMT
models (Peters et al.,), but never in
Transformers, where attention modules are
not just one single component of the seq2seq
model but rather an integral part of all of the
model components.
-entmax:
anadaptiveTransformer with
sparse entmax attention with a different,
learned
t
i;j
for each head.
The adaptive model has an additional scalar pa-
rameter per attention head per layer for each of the
three attention mechanisms (encoder self-attention,
context attention, and decoder self-attention),i.e.,
a
t
i;j2R:i2 f1; : : : ; Lg; j2 f1; : : : ; Hg;
t2 fenc;ctx;decg
;
(10)
and we set
t
i;j
= 1 +sigmoid(a
t
i;j
)2]1;2[ . All or
some of thevalues can be tied if desired, but we
keep them independent for analysis purposes.
Datasets.
Our models were trained on 4 machine
translation datasets of different training sizes:
IWSLT 2017 German!English (DEEN,
tolo et al.,): 200K sentence pairs.
KFTT Japanese!English (JAEN,,
2011): 300K sentence pairs.
WMT 2016 Romanian!English (ROEN,
jar et al.,): 600K sentence pairs.
WMT 2014 English!German (ENDE,
et al.,): 4.5M sentence pairs.
All of these datasets were preprocessed with
byte-pair encoding (BPE;,),
using joint segmentations of 32k merge operations.
Training.We follow the dimensions of the
Transformer-Base model of2017):
The number of layers isL= 6and number of
heads isH= 8in the encoder self-attention, the
context attention, and the decoder self-attention.
We use a mini-batch size of 8192 tokens and warm
up the learning rate linearly until 20k steps, after
which it decays according to an inverse square root
schedule. All models were trained until conver-
gence of validation accuracy, and evaluation was
done at each 10k steps forROENandENDE
and at each 5k steps forDEENandJAEN. The
end-to-end computational overhead of our methods,
when compared to standard softmax, is relatively
small; in training tokens per second, the models
using-entmax and1:5-entmax are, respectively,
75%and90%the speed of the softmax model.
Results.
We report test set tokenized BLEU (Pa-
pineni et al.,) results in Table. We can see
that replacing softmax by entmax does not hurt
performance in any of the datasets; indeed, sparse
attention Transformers tend to have slightly higher
BLEU, but their sparsity leads to a better poten-
tial for analysis. In the next section, we make use
of this potential by exploring the learned internal
mechanics of the self-attention heads.
5 Analysis
We conduct an analysis for the higher-resource
dataset WMT 2014 English!German of the at-
tention in the sparse adaptive Transformer model
(-entmax) at multiple levels: we analyze high-
level statistics as well as individual head behavior.
Moreover, we make a qualitative analysis of the
interpretability capabilities of our models.
activationDEEN JAEN ROEN ENDE
softmax 29.79 21.57 32.70 26.02
1:5-entmax 29.8322.13 33.10 25.89
-entmax 29.9021.74 32.89 26.93
Table 1: Machine translation tokenized BLEU test results on IWSLT 2017DEEN, KFTTJAEN, WMT 2016
ROENand WMT 2014ENDE, respectively.
5.1 High-Level Statistics
What kind ofvalues are learned?
Figure
shows the learning trajectories of theparameters
of a selected subset of heads. We generally observe
a tendency for the randomly-initializedparame-
ters to decrease initially, suggesting that softmax-
like behavior may be preferable while the model
is still very uncertain. After around one thousand
steps, some heads change direction and become
sparser, perhaps as they become more condent
and specialized. This shows that the initialization
ofdoes not predetermine its sparsity level or the
role the head will have throughout. In particular,
head8in the encoder self-attention layer2rst
drops to around= 1:3 before becoming one of
the sparsest heads, with2.
The overall distribution ofvalues at conver-
gence can be seen in Figure. We can observe
that the encoder self-attention blocks learn to con-
centrate thevalues in two modes: a very sparse
one around!2 , and a dense one between soft-
max and 1.5-entmax. However, the decoder self
and context attention only learn to distribute these
parameters in a single mode. We show next that
this is reected in the average density of attention
weight vectors as well.
Attention weight density when translating.
For any >1 , it would still be possible for the
weight matrices in Equation
so as to make attention sparser or denser. To visu-
alize the impact of adaptivevalues, we compare
the empirical attention weight density (the aver-
age number of tokens receiving non-zero attention)
within each module, against sparse Transformers
with xed= 1:5.
Figure = 1:5 , heads
tend to be sparse and similarly-distributed in all
three attention modules. With learned, there are
two notable changes: (i) a prominent mode corre-
sponding to fully dense probabilities, showing that
our models learn to combine sparse and dense atten-
tion, and (ii) a distinction between the encoder self-0 20004000600080001000012000
training steps
1.0
1.2
1.4
1.6
1.8
decoder, layer 1, head 8
encoder, layer 1, head 3
encoder, layer 1, head 4
encoder, layer 2, head 8
encoder, layer 6, head 2
Figure 2: Trajectories ofvalues for a subset of
the heads during training. Initialized at random, most
heads become denser in the beginning, before converg-
ing. This suggests that dense attention may be more
benecial while the network is still uncertain, being re-
placed by sparse attention afterwards.
attention whose background distribution tends
toward extreme sparsity and the other two mod-
ules, who exhibit more uniform background distri-
butions. This suggests that perhaps entirely sparse
Transformers are suboptimal.
The fact that the decoder seems to prefer denser
attention distributions might be attributed to it be-
ing auto-regressive, only having access to past to-
kens and not the full sentence. We speculate that
it might lose too much information if it assigned
weights of zero to too many tokens in the self-
attention, since there are fewer tokens to attend to
in the rst place.
Teasing this down into separate layers, Figure
shows the average (sorted) density of each head for
each layer. We observe that-entmax is able to
learn different sparsity patterns at each layer, lead-
ing to more variance in individual head behavior, to
clearly-identied dense and sparse heads, and over-
all to different tendencies compared to the xed
case of= 1:5.
Head diversity.
To measure the overall disagree-
ment between attention heads, as a measure of head
0
10
20
Encoder
Self-Attention
0
10
20
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
Decoder
Self-Attention Figure 3: Distribution of learnedvalues per attention
block. While the encoder self-attention has a bimodal
distribution of values of, the decoder self-attention
and context attention have a single mode.
diversity, we use the following generalization of
the Jensen-Shannon divergence:
JS=H
S
0
@
1
H
H
X
j=1
pj
1
A
1
H
H
X
j=1
H
S
(pj)(11)
wherepjis the vector of attention weights as-
signed by headjto each word in the sequence, and
H
Sis the Shannon entropy, base-adjusted based on
the dimension ofpsuch thatJS1. We average
this measure over the entire validation set. The
higher this metric is, the more the heads are taking
different roles in the model.
Figure
variants show more diversity than the traditional
softmax one. Interestingly, diversity seems to peak
in the middle layers of the encoder self-attention
and context attention, while this is not the case for
the decoder self-attention.
The statistics shown in this section can be found
for the other language pairs in Appendix.
5.2 Identifying Head Specializations
Previous work pointed out some specic roles
played by different heads in the softmax Trans-
former model (Voita et al.,;,;
Voita et al.,). Identifying the specialization of
a head can be done by observing the type of tokens0.0 0.5 1.0
0
10k
30k
50k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
10k
30k
50k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
10k
30k
50k
Decoder
Self-Attention
0.0 0.5 1.0
density
Figure 4: Distribution of attention densities (average
number of tokens receiving non-zero attention weight)
for all attention heads and all validation sentences.
When compared to 1.5-entmax,-entmax distributes
the sparsity in a more uniform manner, with a clear
mode at fully dense attentions, corresponding to the
heads with low. In the softmax case, this distribution
would lead to a single bar with density 1.
or sequences that the head often assigns most of its
attention weight; this is facilitated by sparsity.
Positional heads.
One particular type of head, as
noted by2019), is the positional head.
These heads tend to focus their attention on either
the previous or next token in the sequence, thus
obtaining representations of the neighborhood of
the current time step. In Figure, we show atten-
tion plots for such heads, found for each of the
studied models. The sparsity of our models allows
these heads to be more condent in their represen-
tations, by assigning the whole probability distribu-
tion to a single token in the sequence. Concretely,
we may measure a positional head'scondenceas
the average attention weight assigned to the pre-
vious token. The softmax model has three heads
for position 1, with median condence93:5%.
The1:5-entmax model also has three heads for
this position, with median condence94:4%. The
adaptive model has four heads, with median con-
dences95:9%, the lowest-condence head being
dense with= 1:18 , while the highest-condence
head being sparse (= 1:91).
0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers Figure 5: Head density per layer for xed and learned
. Each line corresponds to an attention head; lower
values mean that that attention head is sparser. Learned
has higher variance.
For position+1, the models each dedicate one
head, with condence around95%, slightly higher
for entmax. The adaptive model sets= 1:96 for
this head.
BPE-merging head.
Due to the sparsity of our
models, we are able to identify other head special-
izations, easily identifying which heads should be
further analysed. In Figure
head where thevalue is particularly high (in the
encoder, layer 1, head 4 depicted in Figure). We
found that this head most often looks at the cur-
rent time step with high condence, making it a
positional head with offset0. However, this head
often spreads weight sparsely over 2-3 neighbor-
ing tokens, when the tokens are part of the same
BPE cluster
2
or hyphenated words. As this head
is in the rst layer, it provides a useful service to
the higher layers by combining information evenly
within some BPE clusters.
For each BPE cluster or cluster of hyphenated
words, we computed a score between 0 and 1 that
corresponds to the maximum attention mass as-
signed by any token to the rest of the tokens inside
the cluster in order to quantify the BPE-merging
2
BPE-segmented words are denoted byin the gures.0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.20
0.25
0.30
0.35
Context
Attention
1 2 3 4 5 6
Layers
0.25
0.30
0.35
Decoder
Self-Attention
Figure 6: Jensen-Shannon Divergence between heads
at each layer. Measures the disagreement between
heads: the higher the value, the more the heads are dis-
agreeing with each other in terms of where to attend.
Models using sparse entmax have more diverse atten-
tion than the softmax baseline.
capabilities of these heads.
3
There are not any at-
tention heads in the softmax model that are able
to obtain a score over80%, while for1:5-entmax
and-entmax there are two heads in each (83:3%
and85:6%for1:5-entmax and88:5%and89:8%
for-entmax).
Interrogation head.
On the other hand, in Fig-
ure
sparse model chose anclose to 1, making it
closer to softmax (also shown inencoder, layer 1,
head 3depicted in Figure). We observe that this
head assigns a high probability to question marks
at the end of the sentence in time steps where the
current token is interrogative, thus making it an
interrogation-detecting head. We also observe this
type of heads in the other models, which we also
depict in Figure. The average attention weight
placed on the question mark when the current to-
ken is an interrogative word is98:5%for softmax,
97:0%for1:5-entmax, and99:5%for-entmax.
Furthermore, we can examine sentences where
some tendentially sparse heads become less so, thus
identifying sources of ambiguity where the head
3
If the cluster has size 1, the score is the weight the token
assigns to itself.
Figure 7: Self-attention from the most condently
previous-position head in each model. The learned pa-
rameter in the -entmax model is = 1 : 91 . Quanti-
tatively more condent, visual inspection conrms that
the adaptive head behaves more consistently.
Figure 8: BPE-merging head ( = 1 : 91) discovered
in the -entmax model. Found in the rst encoder
layer, this head learns to discover some subword units
and combine their information, leaving most words in-
tact. It places 99 : 09% of its probability mass within the
same BPE cluster as the current token: more than any
head in any other model.
is less condent in its prediction. An example is
shown in Figure
differs for sentences of similar length.
6 Related Work
Sparse attention.
Prior work has developed
sparse attention mechanisms, including appli-
cations to NMT (,;
Malaviya et al.,;,;
Shao et al.,;,).
(2019) introduced the entmax function this work
builds upon. In their work, there is a single atten-
tion mechanism which is controlled by a xed .
In contrast, this is the rst work to allow such atten-
Figure 9: Interrogation-detecting heads in the three
models. The top sentence is interrogative while the
bottom one is declarative but includes the interrogative
word ìwhatî. In the top example, these interrogation
heads assign a high probability to the question mark in
the time step of the interrogative word (with 97 : 0%
probability), while in the bottom example since there
is no question mark, the same head does not assign a
high probability to the last token in the sentence dur-
ing the interrogative word time step. Surprisingly, this
head prefers a low = 1 : 05 , as can be seen from the
dense weights. This allows the head to identify the
noun phrase ìArmani Poloî better.
tion mappings to dynamically adapt their curvature
and sparsity, by automatically adjusting the contin-
uous parameter. We also provide the rst results
using sparse attention in a Transformer model.
Fixed sparsity patterns.
Recent research im-
proves the scalability of Transformer-like networks
through static, xed sparsity patterns (Child et al.,
2019;,). Our adaptively-sparse
Transformer can dynamically select a sparsity pat-
tern that nds relevant words regardless of their po-
sition ( e. g. , Figure). Moreover, the two strategies
could be combined. In a concurrent line of research,
Sukhbaatar et al.2019) propose an adaptive atten-
tion span for Transformer language models. While
their work has each head learn a different contigu-
ous span of context tokens to attend to, our work
nds different sparsity patterns in the same span.
Interestingly, some of their ndings mirror ours ñ
we found that attention heads in the last layers tend
to be denser on average when compared to the ones
in the rst layers, while their work has found that
lower layers tend to have a shorter attention span
compared to higher layers.
M Cettolo, M Federico, L Bentivogli, J Niehues,
S St¨uker, K Sudoh, K Yoshino, and C Federmann.
2017.
paign. InProc. IWSLT.
Rewon Child, Scott Gray, Alec Radford, and Ilya
Sutskever. 2019.
sparse Transformers. preprint arXiv:1904.10509.
Frank H Clarke. 1990.Optimization and Nonsmooth
Analysis. SIAM.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and
Alexander Rush. 2018.
tional attention. In Proc. NeurIPS.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019.
deep bidirectional transformers for language under-
standing. In Proc. NAACL-HLT.
Jonas Gehring, Michael Auli, David Grangier, Denis
Yarats, and Yann N Dauphin. 2017.
sequence to sequence learning. In Proc. ICML.
Stephen Gould, Basura Fernando, Anoop Cherian, Pe-
ter Anderson, Rodrigo Santa Cruz, and Edison Guo.
2016.
argmax problems with application to bi-level opti-
mization. preprint arXiv:1607.05447.
Michael Held, Philip Wolfe, and Harlan P Crowder.
1974.. Math-
ematical Programming, 6(1):6288.
Sarthak Jain and Byron C. Wallace. 2019.
not explanation. In Proc. NAACL-HLT.
Marcin Junczys-Dowmunt, Kenneth Heaeld, Hieu
Hoang, Roman Grundkiewicz, and Anthony Aue.
2018.
machine translation in C++. In Proc. WNMT.
Jian Li, Zhaopeng Tu, Baosong Yang, Michael R Lyu,
and Tong Zhang. 2018.
Disagreement Regularization. In Proc. EMNLP.
Christos Louizos, Max Welling, and Diederik P
Kingma. 2018.
throughL0regularization. Proc. ICLR.
Minh-Thang Luong, Hieu Pham, and Christopher D
Manning. 2015.
based neural machine translation. In Proc. EMNLP.
Chaitanya Malaviya, Pedro Ferreira, and Andr´e FT
Martins. 2018.
neural machine translation. In Proc. ACL.
David Marecek and Rudolf Rosa. 2018.
ing syntactic trees from Transformer encoder self-
attentions. In Proc. BlackboxNLP.
Andr´e FT Martins and Ram´on Fernandez Astudillo.
2016.
of attention and multi-label classication. In Proc.
of ICML.
Sameen Maruf, Andr´e FT Martins, and Gholam-
reza Haffari. 2019.
context-aware neural machine translation. preprint
arXiv:1903.08788.
Graham Neubig. 2011. The Kyoto free translation task.
http://www.phontron.com/kftt.
Vlad Niculae and Mathieu Blondel. 2017.
ized framework for sparse and structured neural at-
tention. In Proc. NeurIPS.
Myle Ott, Sergey Edunov, David Grangier, and
Michael Auli. 2018.
lation. InProc. WMT.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002.
uation of machine translation. In Proc. ACL.
Ben Peters, Vlad Niculae, and Andr´e FT Martins. 2019.
Sparse sequence-to-sequence models. In Proc. ACL.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019.
guage models are unsupervised multitask learners.
preprint.
Alessandro Raganato and J¨org Tiedemann. 2018.
analysis of encoder representations in Transformer-
based machine translation. In Proc. BlackboxNLP.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016.
subword units. In Proc. ACL.
Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang,
Yudian Li, Xiaogang Wang, and Ping Luo. 2019.
SSN: Learning sparse switchable normalization via
SparsestMax. In Proc. CVPR.
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-
janowski, and Armand Joulin. 2019.
tention Span in Transformers. In Proc. ACL.
Gongbo Tang, Mathias M¨uller, Annette Rios, and Rico
Sennrich. 2018.
evaluation of neural machine translation architec-
tures. InProc. EMNLP.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
BERT rediscovers the classical NLP pipeline. In
Proc. ACL.
Constantino Tsallis. 1988.
Boltzmann-Gibbs statistics. Journal of Statistical
Physics, 52:479487.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, ukasz
Kaiser, and Illia Polosukhin. 2017.
you need. In Proc. NeurIPS.
Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
Titov. 2018.
tion learns anaphora resolution. In Proc. ACL.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019.
self-attention: Specialized heads do the heavy lift-
ing, the rest can be pruned. In Proc. ACL.
Felix Wu, Angela Fan, Alexei Baevski, Yann N
Dauphin, and Michael Auli. 2019.
tion with lightweight and dynamic convolutions. In
Proc. ICLR.
SupplementaryMaterial
A High-Level Statistics Analysis of Other Language Pairs0
10
20
30
Encoder
Self-Attention
0
10
20
30
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
30
Decoder
Self-Attention
(a) WMT 2016ROEN.0
10
20
30
Encoder
Self-Attention
0
10
20
30
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
30
Decoder
Self-Attention (b) KFTTJAEN.0
10
20
Encoder
Self-Attention
0
10
20
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
Decoder
Self-Attention (c) WMT 2014ENDE.0
10
20
30
Encoder
Self-Attention
0
10
20
30
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
30
Decoder
Self-Attention (d) IWSLT 2017DEEN.
Figure 11: Histograms ofvalues.
0.0 0.5 1.0
0
10k
30k
50k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
10k
30k
50k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
10k
30k
50k
Decoder
Self-Attention
0.0 0.5 1.0
density (a) WMT 2016ROEN.0.0 0.5 1.0
0
10k
30k
50k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
10k
30k
50k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
10k
30k
50k
Decoder
Self-Attention
0.0 0.5 1.0
density (b) KFTTJAEN.0.0 0.5 1.0
0
10k
30k
50k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
10k
30k
50k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
10k
30k
50k
Decoder
Self-Attention
0.0 0.5 1.0
density (c) WMT 2014ENDE.0.0 0.5 1.0
0
50k
100k
150k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
50k
100k
150k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
50k
100k
150k
Decoder
Self-Attention
0.0 0.5 1.0
density (d) IWSLT 2017DEEN.
Figure 12: Histograms of head densities.
0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.3
0.4
0.5
Context
Attention
1 2 3 4 5 6
Layers
0.3
0.4
Decoder
Self-Attention (a) WMT 2016ROEN.0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.2
0.3
0.4
0.5
Context
Attention
1 2 3 4 5 6
Layers
0.25
0.30
0.35
0.40
Decoder
Self-Attention (b) KFTTJAEN.0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.20
0.25
0.30
0.35
Context
Attention
1 2 3 4 5 6
Layers
0.25
0.30
0.35
Decoder
Self-Attention (c) WMT 2014ENDE.0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.3
0.4
Context
Attention
1 2 3 4 5 6
Layers
0.25
0.30
0.35
Decoder
Self-Attention (d) IWSLT 2017DEEN.
Figure 13: Jensen-Shannon divergence over layers.
0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers (a) WMT 2016ROEN.0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers (b) KFTTJAEN.0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers (c) WMT 2014ENDE.0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers (d) IWSLT 2017DEEN.
Figure 14: Head densities over layers.
B Background
B.1 Regularized Fenchel-Young prediction functions
Denition 1
(Blondel et al.) .Let:4
d
!R[ f1g be a strictly convex regularization function.
We dene the prediction functionas
(z) =argmax
p24
d
p
>
z (p)
(12)
B.2 Characterizing the-entmax mapping
Lemma 1(Peters et al.) .For anyz, there exists a unique
?
such that
-entmax(z) = [( 1)z
?
1]
1= 1
+
: (13)
Proof:From the denition of-entmax,
-entmax(z):=argmax
p24
d
p
>
z+H
T
(p); (14)
we may easily identify it with a regularized prediction function (Def.):
-entmax(z)
H
T
(z):
We rst note that for allp2 4
d
,
( 1)H
T
(p) =
1
d
X
i=1
p
i+const: (15)
From the constant invariance and scaling properties of(Blondel et al.,, Proposition 1, items 45),
H
T
(z) =(( 1)z);with(p) =
d
X
j=1
g(pj); g(t) =
t
:
Using (Blondel et al.,, Proposition 5), noting that g
0
(t) =t
1
and(g
0
)
1
(u) =u
1= 1
, yields
(z) = [z
?
1]
1= 1
+
;and therefore-entmax(z) = [( 1)z
?
1]
1= 1
+
:(16)
SinceH
T
is strictly convex on the simplex,-entmaxhas a unique solutionp
?. Equation
denes a one-to-one mapping betweenp
?
and
?
as long asp
?
2 4, therefore
?
is also unique.
B.3 Connections to softmax and sparsemax
The Euclidean projection onto the simplex, sometimes referred to, in the context of neural attention, as
sparsemax (Martins and Astudillo,), is dened as
sparsemax(z):=argmin
p24
kp zk
2
2: (17)
The solution can be characterized through the unique thresholdsuch that
P
i
sparsemax(z)i= 1
and
(Held et al.,)
sparsemax(z) = [z 1]+: (18)
Thus, each coordinate of the sparsemax solution is a piecewise-linear function. Visibly, this expression
is recovered when setting= 2in the-entmax expression (Equation); for other values of , the
exponent induces curvature.
On the other hand, the well-known softmax is usually dened through the expression
softmax(z)i:=
exp(zi)
P
j
exp(zj)
; (19)
which can be shown to be the unique solution of the optimization problem
softmax(z)i=argmax
p24
p
>
z+H
S
(p); (20)
whereH
S
(p):=
P
i
pilogpi is the Shannon entropy. Indeed, setting the gradient to0yields the
conditionlogpi=zj i 1 , whereand >0 are Lagrange multipliers for the simplex constraints
P
i
pi= 1
andpi0 , respectively. Since thel.h.s.is only nite forpi>0 , we must havei= 0for alli,
by complementary slackness. Thus, the solution must have the formpi=
exp(zi)=Z , yielding Equation.
C Jacobian of-entmaxw.r.t.the shape parameter: Proof of Proposition
Recall that the entmax transformation is dened as:
-entmax(z):=argmax
p24
d
p
>
z+H
T
(p); (21)
where1andH
T
is the Tsallis entropy,
H
T
(p):=
(
1
( 1)
P
j
pj p
j
; 6= 1;
H
S
(p); = 1;
(22)
andH
S
(p) :=
P
j
pjlogpjis the Shannon entropy.
In this section, we derive the Jacobian ofentmaxwith respect to the scalar parameter.
C.1 General case of >1
From the KKT conditions associated with the optimization problem in Eq., we have that the solution
p
?
has the following form, coordinate-wise:
p
?
i= [( 1)(zi
?
)]
1=( 1)
+
; (23)
where
?is a scalar Lagrange multiplier that ensures thatp
?normalizes to 1,i.e., it is dened implicitly
by the condition:
X
i
[( 1)(zi
?
)]
1=( 1)
+
= 1: (24)
For general values of, Eq.
@ -entmax(z)
@
(25)
non-trivial. Fortunately, we can use the technique of implicit differentiation to obtain this Jacobian.
The Jacobian exists almost everywhere, and the expressions we derive expressions yield a generalized
Jacobian (Clarke,) at any non-differentiable points that may occur for certain ( ,z) pairs. We begin
by noting that
@p
?
i
@
= 0
ifp
?
i
= 0, because increasingkeeps sparse coordinates sparse.
4
Therefore we
need to worry only about coordinates that are in the support ofp
?. We will assume hereafter that thei
th
coordinate ofp
?
is non-zero. We have:
@p
?
i
@
=
@
@
[( 1)(zi
?
)]
1
1
=
@
@
exp
1
1
log[( 1)(zi
?
)]
=p
?
i
@
@
1
1
log[( 1)(zi
?
)]
=
p
?
i
( 1)
2
"
@
@
[( 1)(zi
?
)]
zi
?
log[( 1)(zi
?
)]
#
=
p
?
i
( 1)
2
"
zi
?
( 1)
@
?
@
zi
?
log[( 1)(zi
?
)]
#
=
p
?
i
( 1)
2
1
1
zi
?
@
?
@
log[( 1)(zi
?
)]
: (26)
We can see that this Jacobian depends on
@
?
@
, which we now compute using implicit differentiation.
LetS=fi:p
?
i
>0g ). By differentiating both sides of Eq., re-using some of the steps in Eq., and
recalling Eq., we get
0 =
X
i2S
@
@
[( 1)(zi
?
)]
1=( 1)
=
X
i2S
p
?
i
( 1)
2
1
1
zi
?
@
?
@
log[( 1)(zi
?
)]
=
1
( 1)
2
@
?
@
X
i2S
p
?
i
( 1)(zi
?
)
X
i2S
p
?
i
( 1)
2
log[( 1)(zi
?
)]
=
1
( 1)
2
@
?
@
X
i
(p
?
i)
2
X
i
p
?
i
1
logp
?
i
=
1
( 1)
2
@
?
@
X
i
(p
?
i)
2
+
H
S
(p
)
1
; (27)
from which we obtain:
@
?
@
=
1
( 1)
2+
H
S
(p
?
)
1
P
i
(p
?
i
)
2
: (28)
Finally, plugging Eq., we get:
@p
?
i
@
=
p
?
i
( 1)
2
1
1
(p
?
i
)
1
@
?
@
( 1) logp
?
i
=
p
?
i
( 1)
2
2
41
1
(p
?
i
)
1
1
( 1)
2+
H
S
(p
?
)
1
P
i
(p
?
i
)
2
( 1) logp
?
i
3
5
=
p
?
i
~pi()
( 1)
2
p
?
i
logp
?
i
+ ~pi()H
S
(p
?
)
1
; (29)
4
This follows from the margin property ofH
T
(Blondel et al.,).
where we denote by
~pi() =
(p
?
i
)
2
P
j
(p
?
j
)
2
: (30)
The distribution~p()can be interpreted as a skewed distribution obtained fromp
?, which appears in
the Jacobian of-entmax(z)w.r.t.zas well (Peters et al.,).
C.2 Solving the indetermination for= 1
We can write Eq.
@p
?
i
@
=
p
?
i
~pi() ( 1)(p
?
i
logp
?
i
+ ~pi()H
S
(p
?
))
( 1)
2
: (31)
When!1
+
, we have~p()!p
?
, which leads to a
0
0
indetermination.
To solve this indetermination, we will need to apply L'Hopital's rule twice. Let us rst compute the
derivative of~pi()with respect to. We have
@
@
(p
?
i)
2
= (p
?
i)
2
logp
?
i; (32)
therefore
@
@
~pi() =
@
@
(p
?
i
)
2
P
j
(p
?
j
)
2
=
(p
?
i
)
2
logp
?
i
P
j
(p
?
j
)
2
+ (p
?
i
)
2
P
j
(p
?
j
)
2
logp
?
j
P
j
(p
?
j
)
2
2
= ~pi() logp
?
i+ ~pi()
X
j
~pj() logp
?
j: (33)
Differentiating the numerator and denominator in Eq., we get:
@p
?
i
@
= lim
!1
+
(1 + ( 1)H
S
(p
?
))~pi()(logp
?
i
P
j
~pj() logp
?
j
) p
?
i
logp
?
i
~pi()H
S
(p
?
)
2( 1)
=A+B; (34)
with
A= lim
!1
+
H
S
(p
?
)~pi()(logp
?
i
P
j
~pj() logp
?
j
)H
S
(p
?
)
2
=
H
S
(p
?
)p
?
i
logp
?
i
+p
?
i
(H
S
(p
?
))
2
2
; (35)
and
B= lim
!1
+
~pi()(logp
?
i
P
j
~pj() logp
?
j
) p
?
i
logp
?
i
~pi()H
S
(p
?
)
2( 1)
: (36)
When!1
+ ,Bbecomes again a
0
0
indetermination, which we can solve by applying again L'Hopital's
rule. Differentiating the numerator and denominator in Eq.:
B=
1
2
lim
!1
+
8
<
:
~pi() logp
?
i
0
@
X
j
~pj() logp
?
j logp
?
i
1
A
~pi()
0
@
X
j
~pj() logp
?
j logp
?
i
1
A
0
@
X
j
~pj() logp
?
j+H
S
(p
?
)
1
A
~pi()
X
j
~pj() logp
?
j
X
k
~pk() logp
?
k
logp
?
j
!
9
=
;
=
p
?
i
logp
?
i
(H
S
(p
?
) + logp
?
i
) +p
?
i
P
j
p
?
j
logp
?
j
(H
S
(p
?
) + logp
?
j
)
2
=
H
S
(p
?
)p
?
i
logp
?
i
p
?
i
(H
S
(p
?
))
2
p
?
i
log
2
p
?
i
+p
?
i
P
j
p
?
j
log
2
p
?
j
2
: (37)
Finally, summing Eq., we get
@p
?
i
@
=1
=
p
?
i
log
2
p
?
i
+p
?
i
P
j
p
?
j
log
2
p
?
j
2
: (38)
C.3 Summary
To sum up, we have the following expression for the Jacobian of-entmaxwith respect to:
@p
?
i
@
=
8
<
:
p
?
i
~pi()
( 1)
2
p
?
i
logp
?
i
+~pi()H
S
(p
?
)
1
;for >1
p
?
i
log
2
p
?
i
+p
?
i
P
j
p
?
j
log
2
p
?
j
2
; for= 1.
(39)
Gonc¸alo M. Correia
ä
goncalo.correia@lx.it.pt
Vlad Niculae
ä
vlad@vene.ro
ä
Instituto de Telecomunicac¸oes, Lisbon, Portugal
ã
Unbabel, Lisbon, Portugal
Andr´e F.T. Martins
ä ã
andre.martins@unbabel.com
Abstract
Attention mechanisms have become ubiqui-
tous in NLP. Recent architectures, notably
the Transformer, learn powerful context-aware
word representations through layered, multi-
headed attention. The multiple heads learn
diverse types of word relationships. How-
ever, with standard softmax attention, all at-
tention heads are dense, assigning a non-zero
weight to all context words. In this work, we
introduce the adaptively sparse Transformer,
wherein attention heads have exible, context-
dependent sparsity patterns. This sparsity is
accomplished by replacing softmax with-
entmax: a differentiable generalization of soft-
max that allows low-scoring words to receive
precisely zero weight. Moreover, we derive a
method to automatically learn theparameter
which controls the shape and sparsity of-
entmax allowing attention heads to choose
between focused or spread-out behavior. Our
adaptively sparse Transformer improves inter-
pretability and head diversity when compared
to softmax Transformers on machine transla-
tion datasets. Findings of the quantitative and
qualitative analysis of our approach include
that heads in different layers learn different
sparsity preferences and tend to be more di-
verse in their attention distributions than soft-
max Transformers. Furthermore, at no cost in
accuracy, sparsity in attention heads helps to
uncover different head specializations.
1 Introduction
The Transformer architecture (Vaswani et al.,)
for deep neural networks has quickly risen to promi-
nence in NLP through its efciency and perfor-
mance, leading to improvements in the state of the
art of Neural Machine Translation (NMT;
Dowmunt et al.,;,), as well as
inspiring other powerful general-purpose models
like BERT (Devlin et al.,) and GPT-2(Rad-
ford et al.,). At the heart of the TransformerThequickbrownfoxjumpsoverThequickbrownfoxjumpsoverThequickbrownfoxjumpsover
head 1
head 4
head 3
head 2
Sparse Transformer
Adaptive Span
Transformer
Adaptively Sparse
Transformer (Ours)
Figure 1: Attention distributions of different self-
attention heads for the time step of the token over,
shown to compare our model to other related work.
While the sparse Transformer (Child et al.,) and
the adaptive span Transformer (Sukhbaatar et al.,)
only attend to words within a contiguous span of the
past tokens, our model is not only able to obtain differ-
ent and not necessarily contiguous sparsity patterns for
each attention head, but is also able to tune its support
over which tokens to attend adaptively.
liemulti-head attentionmechanisms: each word
is represented by multiple different weighted aver-
ages of its relevant context. As suggested by recent
works on interpreting attention head roles, sepa-
rate attention heads may learn to look for various
relationships between tokens (Tang et al.,;
ganato and Tiedemann,; cek and Rosa,
2018;,;,).
The attention distribution of each head is pre-
dicted typically using thesoftmaxnormalizing
transform. As a result, all context words have
non-zero attention weight. Recent work on sin-
gle attention architectures suggest that using sparse
normalizing transforms in attention mechanisms
such as sparsemax which can yield exactly zero
probabilities for irrelevant words may improve
performance and interpretability (Malaviya et al.,
2018;,;,). Qual-
itative analysis of attention heads (Vaswani et al.,
2017, Figure 5) suggests that, depending on what
phenomena they capture, heads tend to favor atter
or more peaked distributions.
Recent works have proposed sparse Transform-
ers (Child et al.,) and adaptive span Trans-
formers (Sukhbaatar et al.,). However, the
sparsity of those models only limits the attention
to a contiguous span of past tokens, while in this
work we propose ahighly adaptiveTransformer
model that is capable of attending to a sparse set of
words that are not necessarily contiguous. Figure
shows the relationship of these methods with ours.
Our contributions are the following:
We introducesparse attentioninto the Trans-
former architecture, showing that it eases inter-
pretability and leads to slight accuracy gains.
We propose an adaptive version of sparse at-
tention, where the shape of each attention
head islearnableand can vary continuously
and dynamically between the dense limit case
ofsoftmaxand the sparse, piecewise-linear
sparsemaxcase.
1
We make an extensive analysis of the added
interpretability of these models, identifying
both crisper examples of attention head behav-
ior observed in previous work, as well as novel
behaviors unraveled thanks to the sparsity and
adaptivity of our proposed model.
2 Background
2.1 The Transformer
In NMT, the Transformer (Vaswani et al.,)
is a sequence-to-sequence (seq2seq) model which
maps an input sequence to an output sequence
through hierarchicalmulti-head attentionmech-
anisms, yielding a dynamic, context-dependent
strategy for propagating information within and
across sentences. It contrasts with previous seq2seq
models, which usually rely either on costly gated
recurrent operations (often LSTMs:
et al.,;,) or static convo-
lutions (Gehring et al.,).
Givennquery contexts andmsequence items
under consideration, attention mechanisms com-
pute, for each query, a weighted representation of
the items. The particular attention mechanism used
in2017) is called scaled dot-product
attention, and it is computed in the following way:
Att(Q;K;V) =
QK
>
p
d
V;(1)
1
Code and pip package available athttps://github.
com/deep-spin/entmax.
whereQ2R
nd contains representations of the
queries,K;V2R
md are thekeysandvalues
of the items attended over, anddis the dimen-
sionality of these representations. Themapping
normalizes row-wise usingsoftmax,(Z)ij=
softmax(zi)j, where
softmax(z) =
exp(zj)
P
j
0exp(zj
0)
: (2)
In words, thekeysare used to compute a relevance
score between each item and query. Then, normal-
ized attention weights are computed using softmax,
and these are used to weight thevaluesof each item
at each query context.
However, for complex tasks, different parts of a
sequence may be relevant in different ways, moti-
vatingmulti-head attentionin Transformers. This
is simply the application of Equation
lelHtimes, each with a different, learned linear
transformation that allows specialization:
Headi(Q;K;V)=Att(QW
Q
i
;KW
K
i;V W
V
i)(3)
In the Transformer, there are three separate multi-
head attention mechanisms for distinct purposes:
Encoder self-attention:
builds rich, layered
representations of each input word, by attend-
ing on the entire input sentence.
Context attention:
selects a representative
weighted average of the encodings of the input
words, at each time step of the decoder.
Decoder self-attention:
attends over the par-
tial output sentence fragment produced so far.
Together, these mechanisms enable the contextual-
ized ow of information between the input sentence
and the sequential decoder.
2.2 Sparse Attention
The softmax mapping (Equation) is elementwise
proportional toexp, therefore it can never assign a
weight ofexactly zero. Thus, unnecessary items
are still taken into consideration to some extent.
Since its output sums to one, this invariably means
less weight is assigned to the relevant items, po-
tentially harming performance and interpretabil-
ity (Jain and Wallace,). This has motivated a
line of research on learning networks withsparse
mappings (Martins and Astudillo,;
and Blondel,;,;,
2019). We focus on a recently-introduced exible
family of transformations,-entmax (Blondel et al.,
2019;,), dened as:
-entmax(z):=argmax
p24
d
p
>
z+H
T
(p);(4)
where4
d
:=fp2R
d
:
P
i
pi= 1g is theprob-
ability simplex, and, for1 ,H
T
is the Tsallis
continuous family of entropies (Tsallis,):
H
T
(p):=
(
1
( 1)
P
j
pj p
j
; 6= 1;
P
j
pjlogpj; = 1:
(5)
This family contains the well-known Shannon and
Gini entropies, corresponding to the cases= 1
and= 2, respectively.
Equation
problem. Using the denition ofH
T
, the optimality
conditions may be used to derive the following
form for the solution (Appendix):
-entmax(z) = [( 1)z 1]
1= 1
+
;(6)
where[]+is the positive part (ReLU) function,
1denotes the vector of all ones, and which
acts like a threshold is the Lagrange multiplier
corresponding to the
P
i
pi= 1constraint.
Properties of-entmax.
The appeal of-
entmax for attention rests on the following prop-
erties. For= 1(i.e., whenH
T
becomes the
Shannon entropy), it exactly recovers the softmax
mapping (We provide a short derivation in Ap-
pendix.). For all >1 it permits sparse solu-
tions, in stark contrast to softmax. In particular, for
= 2, it recovers the sparsemax mapping (Martins
and Astudillo,), which is piecewise linear. In-
between, asincreases, the mapping continuously
gets sparser as its curvature changes.
To compute the value of-entmax, one must
nd the thresholdsuch that ther.h.s.in Equa-
tion2019) propose
a general bisection algorithm.2019)
introduce a faster, exact algorithm for= 1:5 , and
enable using-entmaxwith xedwithin a neu-
ral network by showing that the-entmax Jacobian
w.r.t.zforp
?
=-entmax(z)is
@ -entmax(z)
@z
=diag(s)
1
P
j
sj
ss
>
;
wheresi=
(
(p
?
i
)
2
; p
?
i
>0;
0; p
?
i
= 0:
(7)
Our work furthers the study of-entmax by
providing a derivation of the Jacobianw.r.t.the
hyper-parameter
(Section), thereby allowing
the shape and sparsity of the mapping to be learned
automatically. This is particularly appealing in the
context of multi-head attention mechanisms, where
we shall show in Section
tend to learn different sparsity behaviors.
3 Adaptively Sparse Transformers
with-entmax
We now propose a novel Transformer architecture
wherein we simply replace softmax with-entmax
in the attention heads. Concretely, we replace the
row normalizationin Equation
(Z)ij=-entmax(zi)j (8)
This change leads to sparse attention weights, as
long as >1; in particular,= 1:5 is a sensible
starting point (Peters et al.,).
Differentper head.
Unlike LSTM-based
seq2seq models, wherecan be more easily tuned
by grid search, in a Transformer, there are many
attention heads in multiple layers. Crucial to the
power of such models, the different heads capture
different linguistic phenomena, some of them iso-
lating important words, others spreading out atten-
tion across phrases (Vaswani et al.,, Figure 5).
This motivates using different, adaptivevalues
for each attention head, such that some heads may
learn to be sparser, and others may become closer
to softmax. We propose doing so by treating the
values as neural network parameters, optimized via
stochastic gradients along with the other weights.
Derivativesw.r.t..
In order to optimizeau-
tomatically via gradient methods, we must com-
pute the Jacobian of the entmax outputw.r.t..
Since entmax is dened through an optimization
problem, this is non-trivial and cannot be simply
handled through automatic differentiation; it falls
within the domain ofargmin differentiation, an ac-
tive research topic in optimization (Gould et al.,
2016;,).
One of our key contributions is the derivation
of a closed-form expression for this Jacobian. The
next proposition provides such an expression, en-
abling entmax layers with adaptive. To the best
of our knowledge, ours is the rst neural network
module that can automatically, continuously vary
in shape away from softmax and toward sparse
mappings like sparsemax.
Proposition 1.
Letp
?
:=-entmax(z) be the so-
lution of Equation. Denote the distribution ~pi:=
(p
?
i
)
2
=
P
j
(p
?
j
)
2
and lethi:= p
?
i
logp
?
i . Thei
th
component of the Jacobiang:=
@ -entmax(z)
@
is
gi=
8
<
:
p
?
i
~pi
( 1)
2+
hi ~pi
P
j
hj
1
; >1;
hilogp
?
i
p
?
i
P
j
hjlogp
?
j
2
; = 1:
(9)
The proof uses implicit function differentiation and
is given in Appendix.
Proposition
piece needed for training adaptively sparse Trans-
formers. In the following section, we evaluate this
strategy on neural machine translation, and analyze
the behavior of the learned attention heads.
4 Experiments
We apply our adaptively sparse Transformers on
four machine translation tasks. For comparison,
a natural baseline is the standard Transformer ar-
chitecture using the softmax transform in its multi-
head attention mechanisms. We consider two other
model variants in our experiments that make use of
different normalizing transformations:
1.5-entmax:
a Transformer with sparse ent-
max attention with xed= 1:5 for all heads.
This is a novel model, since 1.5-entmax had
only been proposed for RNN-based NMT
models (Peters et al.,), but never in
Transformers, where attention modules are
not just one single component of the seq2seq
model but rather an integral part of all of the
model components.
-entmax:
anadaptiveTransformer with
sparse entmax attention with a different,
learned
t
i;j
for each head.
The adaptive model has an additional scalar pa-
rameter per attention head per layer for each of the
three attention mechanisms (encoder self-attention,
context attention, and decoder self-attention),i.e.,
a
t
i;j2R:i2 f1; : : : ; Lg; j2 f1; : : : ; Hg;
t2 fenc;ctx;decg
;
(10)
and we set
t
i;j
= 1 +sigmoid(a
t
i;j
)2]1;2[ . All or
some of thevalues can be tied if desired, but we
keep them independent for analysis purposes.
Datasets.
Our models were trained on 4 machine
translation datasets of different training sizes:
IWSLT 2017 German!English (DEEN,
tolo et al.,): 200K sentence pairs.
KFTT Japanese!English (JAEN,,
2011): 300K sentence pairs.
WMT 2016 Romanian!English (ROEN,
jar et al.,): 600K sentence pairs.
WMT 2014 English!German (ENDE,
et al.,): 4.5M sentence pairs.
All of these datasets were preprocessed with
byte-pair encoding (BPE;,),
using joint segmentations of 32k merge operations.
Training.We follow the dimensions of the
Transformer-Base model of2017):
The number of layers isL= 6and number of
heads isH= 8in the encoder self-attention, the
context attention, and the decoder self-attention.
We use a mini-batch size of 8192 tokens and warm
up the learning rate linearly until 20k steps, after
which it decays according to an inverse square root
schedule. All models were trained until conver-
gence of validation accuracy, and evaluation was
done at each 10k steps forROENandENDE
and at each 5k steps forDEENandJAEN. The
end-to-end computational overhead of our methods,
when compared to standard softmax, is relatively
small; in training tokens per second, the models
using-entmax and1:5-entmax are, respectively,
75%and90%the speed of the softmax model.
Results.
We report test set tokenized BLEU (Pa-
pineni et al.,) results in Table. We can see
that replacing softmax by entmax does not hurt
performance in any of the datasets; indeed, sparse
attention Transformers tend to have slightly higher
BLEU, but their sparsity leads to a better poten-
tial for analysis. In the next section, we make use
of this potential by exploring the learned internal
mechanics of the self-attention heads.
5 Analysis
We conduct an analysis for the higher-resource
dataset WMT 2014 English!German of the at-
tention in the sparse adaptive Transformer model
(-entmax) at multiple levels: we analyze high-
level statistics as well as individual head behavior.
Moreover, we make a qualitative analysis of the
interpretability capabilities of our models.
activationDEEN JAEN ROEN ENDE
softmax 29.79 21.57 32.70 26.02
1:5-entmax 29.8322.13 33.10 25.89
-entmax 29.9021.74 32.89 26.93
Table 1: Machine translation tokenized BLEU test results on IWSLT 2017DEEN, KFTTJAEN, WMT 2016
ROENand WMT 2014ENDE, respectively.
5.1 High-Level Statistics
What kind ofvalues are learned?
Figure
shows the learning trajectories of theparameters
of a selected subset of heads. We generally observe
a tendency for the randomly-initializedparame-
ters to decrease initially, suggesting that softmax-
like behavior may be preferable while the model
is still very uncertain. After around one thousand
steps, some heads change direction and become
sparser, perhaps as they become more condent
and specialized. This shows that the initialization
ofdoes not predetermine its sparsity level or the
role the head will have throughout. In particular,
head8in the encoder self-attention layer2rst
drops to around= 1:3 before becoming one of
the sparsest heads, with2.
The overall distribution ofvalues at conver-
gence can be seen in Figure. We can observe
that the encoder self-attention blocks learn to con-
centrate thevalues in two modes: a very sparse
one around!2 , and a dense one between soft-
max and 1.5-entmax. However, the decoder self
and context attention only learn to distribute these
parameters in a single mode. We show next that
this is reected in the average density of attention
weight vectors as well.
Attention weight density when translating.
For any >1 , it would still be possible for the
weight matrices in Equation
so as to make attention sparser or denser. To visu-
alize the impact of adaptivevalues, we compare
the empirical attention weight density (the aver-
age number of tokens receiving non-zero attention)
within each module, against sparse Transformers
with xed= 1:5.
Figure = 1:5 , heads
tend to be sparse and similarly-distributed in all
three attention modules. With learned, there are
two notable changes: (i) a prominent mode corre-
sponding to fully dense probabilities, showing that
our models learn to combine sparse and dense atten-
tion, and (ii) a distinction between the encoder self-0 20004000600080001000012000
training steps
1.0
1.2
1.4
1.6
1.8
decoder, layer 1, head 8
encoder, layer 1, head 3
encoder, layer 1, head 4
encoder, layer 2, head 8
encoder, layer 6, head 2
Figure 2: Trajectories ofvalues for a subset of
the heads during training. Initialized at random, most
heads become denser in the beginning, before converg-
ing. This suggests that dense attention may be more
benecial while the network is still uncertain, being re-
placed by sparse attention afterwards.
attention whose background distribution tends
toward extreme sparsity and the other two mod-
ules, who exhibit more uniform background distri-
butions. This suggests that perhaps entirely sparse
Transformers are suboptimal.
The fact that the decoder seems to prefer denser
attention distributions might be attributed to it be-
ing auto-regressive, only having access to past to-
kens and not the full sentence. We speculate that
it might lose too much information if it assigned
weights of zero to too many tokens in the self-
attention, since there are fewer tokens to attend to
in the rst place.
Teasing this down into separate layers, Figure
shows the average (sorted) density of each head for
each layer. We observe that-entmax is able to
learn different sparsity patterns at each layer, lead-
ing to more variance in individual head behavior, to
clearly-identied dense and sparse heads, and over-
all to different tendencies compared to the xed
case of= 1:5.
Head diversity.
To measure the overall disagree-
ment between attention heads, as a measure of head
0
10
20
Encoder
Self-Attention
0
10
20
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
Decoder
Self-Attention Figure 3: Distribution of learnedvalues per attention
block. While the encoder self-attention has a bimodal
distribution of values of, the decoder self-attention
and context attention have a single mode.
diversity, we use the following generalization of
the Jensen-Shannon divergence:
JS=H
S
0
@
1
H
H
X
j=1
pj
1
A
1
H
H
X
j=1
H
S
(pj)(11)
wherepjis the vector of attention weights as-
signed by headjto each word in the sequence, and
H
Sis the Shannon entropy, base-adjusted based on
the dimension ofpsuch thatJS1. We average
this measure over the entire validation set. The
higher this metric is, the more the heads are taking
different roles in the model.
Figure
variants show more diversity than the traditional
softmax one. Interestingly, diversity seems to peak
in the middle layers of the encoder self-attention
and context attention, while this is not the case for
the decoder self-attention.
The statistics shown in this section can be found
for the other language pairs in Appendix.
5.2 Identifying Head Specializations
Previous work pointed out some specic roles
played by different heads in the softmax Trans-
former model (Voita et al.,;,;
Voita et al.,). Identifying the specialization of
a head can be done by observing the type of tokens0.0 0.5 1.0
0
10k
30k
50k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
10k
30k
50k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
10k
30k
50k
Decoder
Self-Attention
0.0 0.5 1.0
density
Figure 4: Distribution of attention densities (average
number of tokens receiving non-zero attention weight)
for all attention heads and all validation sentences.
When compared to 1.5-entmax,-entmax distributes
the sparsity in a more uniform manner, with a clear
mode at fully dense attentions, corresponding to the
heads with low. In the softmax case, this distribution
would lead to a single bar with density 1.
or sequences that the head often assigns most of its
attention weight; this is facilitated by sparsity.
Positional heads.
One particular type of head, as
noted by2019), is the positional head.
These heads tend to focus their attention on either
the previous or next token in the sequence, thus
obtaining representations of the neighborhood of
the current time step. In Figure, we show atten-
tion plots for such heads, found for each of the
studied models. The sparsity of our models allows
these heads to be more condent in their represen-
tations, by assigning the whole probability distribu-
tion to a single token in the sequence. Concretely,
we may measure a positional head'scondenceas
the average attention weight assigned to the pre-
vious token. The softmax model has three heads
for position 1, with median condence93:5%.
The1:5-entmax model also has three heads for
this position, with median condence94:4%. The
adaptive model has four heads, with median con-
dences95:9%, the lowest-condence head being
dense with= 1:18 , while the highest-condence
head being sparse (= 1:91).
0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers Figure 5: Head density per layer for xed and learned
. Each line corresponds to an attention head; lower
values mean that that attention head is sparser. Learned
has higher variance.
For position+1, the models each dedicate one
head, with condence around95%, slightly higher
for entmax. The adaptive model sets= 1:96 for
this head.
BPE-merging head.
Due to the sparsity of our
models, we are able to identify other head special-
izations, easily identifying which heads should be
further analysed. In Figure
head where thevalue is particularly high (in the
encoder, layer 1, head 4 depicted in Figure). We
found that this head most often looks at the cur-
rent time step with high condence, making it a
positional head with offset0. However, this head
often spreads weight sparsely over 2-3 neighbor-
ing tokens, when the tokens are part of the same
BPE cluster
2
or hyphenated words. As this head
is in the rst layer, it provides a useful service to
the higher layers by combining information evenly
within some BPE clusters.
For each BPE cluster or cluster of hyphenated
words, we computed a score between 0 and 1 that
corresponds to the maximum attention mass as-
signed by any token to the rest of the tokens inside
the cluster in order to quantify the BPE-merging
2
BPE-segmented words are denoted byin the gures.0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.20
0.25
0.30
0.35
Context
Attention
1 2 3 4 5 6
Layers
0.25
0.30
0.35
Decoder
Self-Attention
Figure 6: Jensen-Shannon Divergence between heads
at each layer. Measures the disagreement between
heads: the higher the value, the more the heads are dis-
agreeing with each other in terms of where to attend.
Models using sparse entmax have more diverse atten-
tion than the softmax baseline.
capabilities of these heads.
3
There are not any at-
tention heads in the softmax model that are able
to obtain a score over80%, while for1:5-entmax
and-entmax there are two heads in each (83:3%
and85:6%for1:5-entmax and88:5%and89:8%
for-entmax).
Interrogation head.
On the other hand, in Fig-
ure
sparse model chose anclose to 1, making it
closer to softmax (also shown inencoder, layer 1,
head 3depicted in Figure). We observe that this
head assigns a high probability to question marks
at the end of the sentence in time steps where the
current token is interrogative, thus making it an
interrogation-detecting head. We also observe this
type of heads in the other models, which we also
depict in Figure. The average attention weight
placed on the question mark when the current to-
ken is an interrogative word is98:5%for softmax,
97:0%for1:5-entmax, and99:5%for-entmax.
Furthermore, we can examine sentences where
some tendentially sparse heads become less so, thus
identifying sources of ambiguity where the head
3
If the cluster has size 1, the score is the weight the token
assigns to itself.
Figure 7: Self-attention from the most condently
previous-position head in each model. The learned pa-
rameter in the -entmax model is = 1 : 91 . Quanti-
tatively more condent, visual inspection conrms that
the adaptive head behaves more consistently.
Figure 8: BPE-merging head ( = 1 : 91) discovered
in the -entmax model. Found in the rst encoder
layer, this head learns to discover some subword units
and combine their information, leaving most words in-
tact. It places 99 : 09% of its probability mass within the
same BPE cluster as the current token: more than any
head in any other model.
is less condent in its prediction. An example is
shown in Figure
differs for sentences of similar length.
6 Related Work
Sparse attention.
Prior work has developed
sparse attention mechanisms, including appli-
cations to NMT (,;
Malaviya et al.,;,;
Shao et al.,;,).
(2019) introduced the entmax function this work
builds upon. In their work, there is a single atten-
tion mechanism which is controlled by a xed .
In contrast, this is the rst work to allow such atten-
Figure 9: Interrogation-detecting heads in the three
models. The top sentence is interrogative while the
bottom one is declarative but includes the interrogative
word ìwhatî. In the top example, these interrogation
heads assign a high probability to the question mark in
the time step of the interrogative word (with 97 : 0%
probability), while in the bottom example since there
is no question mark, the same head does not assign a
high probability to the last token in the sentence dur-
ing the interrogative word time step. Surprisingly, this
head prefers a low = 1 : 05 , as can be seen from the
dense weights. This allows the head to identify the
noun phrase ìArmani Poloî better.
tion mappings to dynamically adapt their curvature
and sparsity, by automatically adjusting the contin-
uous parameter. We also provide the rst results
using sparse attention in a Transformer model.
Fixed sparsity patterns.
Recent research im-
proves the scalability of Transformer-like networks
through static, xed sparsity patterns (Child et al.,
2019;,). Our adaptively-sparse
Transformer can dynamically select a sparsity pat-
tern that nds relevant words regardless of their po-
sition ( e. g. , Figure). Moreover, the two strategies
could be combined. In a concurrent line of research,
Sukhbaatar et al.2019) propose an adaptive atten-
tion span for Transformer language models. While
their work has each head learn a different contigu-
ous span of context tokens to attend to, our work
nds different sparsity patterns in the same span.
Interestingly, some of their ndings mirror ours ñ
we found that attention heads in the last layers tend
to be denser on average when compared to the ones
in the rst layers, while their work has found that
lower layers tend to have a shorter attention span
compared to higher layers.
M Cettolo, M Federico, L Bentivogli, J Niehues,
S St¨uker, K Sudoh, K Yoshino, and C Federmann.
2017.
paign. InProc. IWSLT.
Rewon Child, Scott Gray, Alec Radford, and Ilya
Sutskever. 2019.
sparse Transformers. preprint arXiv:1904.10509.
Frank H Clarke. 1990.Optimization and Nonsmooth
Analysis. SIAM.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and
Alexander Rush. 2018.
tional attention. In Proc. NeurIPS.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019.
deep bidirectional transformers for language under-
standing. In Proc. NAACL-HLT.
Jonas Gehring, Michael Auli, David Grangier, Denis
Yarats, and Yann N Dauphin. 2017.
sequence to sequence learning. In Proc. ICML.
Stephen Gould, Basura Fernando, Anoop Cherian, Pe-
ter Anderson, Rodrigo Santa Cruz, and Edison Guo.
2016.
argmax problems with application to bi-level opti-
mization. preprint arXiv:1607.05447.
Michael Held, Philip Wolfe, and Harlan P Crowder.
1974.. Math-
ematical Programming, 6(1):6288.
Sarthak Jain and Byron C. Wallace. 2019.
not explanation. In Proc. NAACL-HLT.
Marcin Junczys-Dowmunt, Kenneth Heaeld, Hieu
Hoang, Roman Grundkiewicz, and Anthony Aue.
2018.
machine translation in C++. In Proc. WNMT.
Jian Li, Zhaopeng Tu, Baosong Yang, Michael R Lyu,
and Tong Zhang. 2018.
Disagreement Regularization. In Proc. EMNLP.
Christos Louizos, Max Welling, and Diederik P
Kingma. 2018.
throughL0regularization. Proc. ICLR.
Minh-Thang Luong, Hieu Pham, and Christopher D
Manning. 2015.
based neural machine translation. In Proc. EMNLP.
Chaitanya Malaviya, Pedro Ferreira, and Andr´e FT
Martins. 2018.
neural machine translation. In Proc. ACL.
David Marecek and Rudolf Rosa. 2018.
ing syntactic trees from Transformer encoder self-
attentions. In Proc. BlackboxNLP.
Andr´e FT Martins and Ram´on Fernandez Astudillo.
2016.
of attention and multi-label classication. In Proc.
of ICML.
Sameen Maruf, Andr´e FT Martins, and Gholam-
reza Haffari. 2019.
context-aware neural machine translation. preprint
arXiv:1903.08788.
Graham Neubig. 2011. The Kyoto free translation task.
http://www.phontron.com/kftt.
Vlad Niculae and Mathieu Blondel. 2017.
ized framework for sparse and structured neural at-
tention. In Proc. NeurIPS.
Myle Ott, Sergey Edunov, David Grangier, and
Michael Auli. 2018.
lation. InProc. WMT.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002.
uation of machine translation. In Proc. ACL.
Ben Peters, Vlad Niculae, and Andr´e FT Martins. 2019.
Sparse sequence-to-sequence models. In Proc. ACL.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019.
guage models are unsupervised multitask learners.
preprint.
Alessandro Raganato and J¨org Tiedemann. 2018.
analysis of encoder representations in Transformer-
based machine translation. In Proc. BlackboxNLP.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016.
subword units. In Proc. ACL.
Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang,
Yudian Li, Xiaogang Wang, and Ping Luo. 2019.
SSN: Learning sparse switchable normalization via
SparsestMax. In Proc. CVPR.
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-
janowski, and Armand Joulin. 2019.
tention Span in Transformers. In Proc. ACL.
Gongbo Tang, Mathias M¨uller, Annette Rios, and Rico
Sennrich. 2018.
evaluation of neural machine translation architec-
tures. InProc. EMNLP.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
BERT rediscovers the classical NLP pipeline. In
Proc. ACL.
Constantino Tsallis. 1988.
Boltzmann-Gibbs statistics. Journal of Statistical
Physics, 52:479487.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, ukasz
Kaiser, and Illia Polosukhin. 2017.
you need. In Proc. NeurIPS.
Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
Titov. 2018.
tion learns anaphora resolution. In Proc. ACL.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019.
self-attention: Specialized heads do the heavy lift-
ing, the rest can be pruned. In Proc. ACL.
Felix Wu, Angela Fan, Alexei Baevski, Yann N
Dauphin, and Michael Auli. 2019.
tion with lightweight and dynamic convolutions. In
Proc. ICLR.
SupplementaryMaterial
A High-Level Statistics Analysis of Other Language Pairs0
10
20
30
Encoder
Self-Attention
0
10
20
30
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
30
Decoder
Self-Attention
(a) WMT 2016ROEN.0
10
20
30
Encoder
Self-Attention
0
10
20
30
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
30
Decoder
Self-Attention (b) KFTTJAEN.0
10
20
Encoder
Self-Attention
0
10
20
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
Decoder
Self-Attention (c) WMT 2014ENDE.0
10
20
30
Encoder
Self-Attention
0
10
20
30
Context
Attention
1.0 1.2 1.4 1.6 1.8 2.0
0
10
20
30
Decoder
Self-Attention (d) IWSLT 2017DEEN.
Figure 11: Histograms ofvalues.
0.0 0.5 1.0
0
10k
30k
50k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
10k
30k
50k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
10k
30k
50k
Decoder
Self-Attention
0.0 0.5 1.0
density (a) WMT 2016ROEN.0.0 0.5 1.0
0
10k
30k
50k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
10k
30k
50k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
10k
30k
50k
Decoder
Self-Attention
0.0 0.5 1.0
density (b) KFTTJAEN.0.0 0.5 1.0
0
10k
30k
50k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
10k
30k
50k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
10k
30k
50k
Decoder
Self-Attention
0.0 0.5 1.0
density (c) WMT 2014ENDE.0.0 0.5 1.0
0
50k
100k
150k
Encoder
Self-Attention
1.5-entmax
0.0 0.5 1.0
-entmax
0.0 0.5 1.0
0
50k
100k
150k
Context
Attention
0.0 0.5 1.0
0.0 0.5 1.0
density
0
50k
100k
150k
Decoder
Self-Attention
0.0 0.5 1.0
density (d) IWSLT 2017DEEN.
Figure 12: Histograms of head densities.
0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.3
0.4
0.5
Context
Attention
1 2 3 4 5 6
Layers
0.3
0.4
Decoder
Self-Attention (a) WMT 2016ROEN.0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.2
0.3
0.4
0.5
Context
Attention
1 2 3 4 5 6
Layers
0.25
0.30
0.35
0.40
Decoder
Self-Attention (b) KFTTJAEN.0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.20
0.25
0.30
0.35
Context
Attention
1 2 3 4 5 6
Layers
0.25
0.30
0.35
Decoder
Self-Attention (c) WMT 2014ENDE.0.4
0.5
Encoder
Self-Attention
softmax
1.5-entmax
-entmax
0.3
0.4
Context
Attention
1 2 3 4 5 6
Layers
0.25
0.30
0.35
Decoder
Self-Attention (d) IWSLT 2017DEEN.
Figure 13: Jensen-Shannon divergence over layers.
0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers (a) WMT 2016ROEN.0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers (b) KFTTJAEN.0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers (c) WMT 2014ENDE.0.0
0.5
1.0
Encoder
Self-Attention
fixed =1.5 learned
0.0
0.5
1.0
Context
Attention
123456
Layers
0.0
0.5
1.0
Decoder
Self-Attention
123456
Layers (d) IWSLT 2017DEEN.
Figure 14: Head densities over layers.
B Background
B.1 Regularized Fenchel-Young prediction functions
Denition 1
(Blondel et al.) .Let:4
d
!R[ f1g be a strictly convex regularization function.
We dene the prediction functionas
(z) =argmax
p24
d
p
>
z (p)
(12)
B.2 Characterizing the-entmax mapping
Lemma 1(Peters et al.) .For anyz, there exists a unique
?
such that
-entmax(z) = [( 1)z
?
1]
1= 1
+
: (13)
Proof:From the denition of-entmax,
-entmax(z):=argmax
p24
d
p
>
z+H
T
(p); (14)
we may easily identify it with a regularized prediction function (Def.):
-entmax(z)
H
T
(z):
We rst note that for allp2 4
d
,
( 1)H
T
(p) =
1
d
X
i=1
p
i+const: (15)
From the constant invariance and scaling properties of(Blondel et al.,, Proposition 1, items 45),
H
T
(z) =(( 1)z);with(p) =
d
X
j=1
g(pj); g(t) =
t
:
Using (Blondel et al.,, Proposition 5), noting that g
0
(t) =t
1
and(g
0
)
1
(u) =u
1= 1
, yields
(z) = [z
?
1]
1= 1
+
;and therefore-entmax(z) = [( 1)z
?
1]
1= 1
+
:(16)
SinceH
T
is strictly convex on the simplex,-entmaxhas a unique solutionp
?. Equation
denes a one-to-one mapping betweenp
?
and
?
as long asp
?
2 4, therefore
?
is also unique.
B.3 Connections to softmax and sparsemax
The Euclidean projection onto the simplex, sometimes referred to, in the context of neural attention, as
sparsemax (Martins and Astudillo,), is dened as
sparsemax(z):=argmin
p24
kp zk
2
2: (17)
The solution can be characterized through the unique thresholdsuch that
P
i
sparsemax(z)i= 1
and
(Held et al.,)
sparsemax(z) = [z 1]+: (18)
Thus, each coordinate of the sparsemax solution is a piecewise-linear function. Visibly, this expression
is recovered when setting= 2in the-entmax expression (Equation); for other values of , the
exponent induces curvature.
On the other hand, the well-known softmax is usually dened through the expression
softmax(z)i:=
exp(zi)
P
j
exp(zj)
; (19)
which can be shown to be the unique solution of the optimization problem
softmax(z)i=argmax
p24
p
>
z+H
S
(p); (20)
whereH
S
(p):=
P
i
pilogpi is the Shannon entropy. Indeed, setting the gradient to0yields the
conditionlogpi=zj i 1 , whereand >0 are Lagrange multipliers for the simplex constraints
P
i
pi= 1
andpi0 , respectively. Since thel.h.s.is only nite forpi>0 , we must havei= 0for alli,
by complementary slackness. Thus, the solution must have the formpi=
exp(zi)=Z , yielding Equation.
C Jacobian of-entmaxw.r.t.the shape parameter: Proof of Proposition
Recall that the entmax transformation is dened as:
-entmax(z):=argmax
p24
d
p
>
z+H
T
(p); (21)
where1andH
T
is the Tsallis entropy,
H
T
(p):=
(
1
( 1)
P
j
pj p
j
; 6= 1;
H
S
(p); = 1;
(22)
andH
S
(p) :=
P
j
pjlogpjis the Shannon entropy.
In this section, we derive the Jacobian ofentmaxwith respect to the scalar parameter.
C.1 General case of >1
From the KKT conditions associated with the optimization problem in Eq., we have that the solution
p
?
has the following form, coordinate-wise:
p
?
i= [( 1)(zi
?
)]
1=( 1)
+
; (23)
where
?is a scalar Lagrange multiplier that ensures thatp
?normalizes to 1,i.e., it is dened implicitly
by the condition:
X
i
[( 1)(zi
?
)]
1=( 1)
+
= 1: (24)
For general values of, Eq.
@ -entmax(z)
@
(25)
non-trivial. Fortunately, we can use the technique of implicit differentiation to obtain this Jacobian.
The Jacobian exists almost everywhere, and the expressions we derive expressions yield a generalized
Jacobian (Clarke,) at any non-differentiable points that may occur for certain ( ,z) pairs. We begin
by noting that
@p
?
i
@
= 0
ifp
?
i
= 0, because increasingkeeps sparse coordinates sparse.
4
Therefore we
need to worry only about coordinates that are in the support ofp
?. We will assume hereafter that thei
th
coordinate ofp
?
is non-zero. We have:
@p
?
i
@
=
@
@
[( 1)(zi
?
)]
1
1
=
@
@
exp
1
1
log[( 1)(zi
?
)]
=p
?
i
@
@
1
1
log[( 1)(zi
?
)]
=
p
?
i
( 1)
2
"
@
@
[( 1)(zi
?
)]
zi
?
log[( 1)(zi
?
)]
#
=
p
?
i
( 1)
2
"
zi
?
( 1)
@
?
@
zi
?
log[( 1)(zi
?
)]
#
=
p
?
i
( 1)
2
1
1
zi
?
@
?
@
log[( 1)(zi
?
)]
: (26)
We can see that this Jacobian depends on
@
?
@
, which we now compute using implicit differentiation.
LetS=fi:p
?
i
>0g ). By differentiating both sides of Eq., re-using some of the steps in Eq., and
recalling Eq., we get
0 =
X
i2S
@
@
[( 1)(zi
?
)]
1=( 1)
=
X
i2S
p
?
i
( 1)
2
1
1
zi
?
@
?
@
log[( 1)(zi
?
)]
=
1
( 1)
2
@
?
@
X
i2S
p
?
i
( 1)(zi
?
)
X
i2S
p
?
i
( 1)
2
log[( 1)(zi
?
)]
=
1
( 1)
2
@
?
@
X
i
(p
?
i)
2
X
i
p
?
i
1
logp
?
i
=
1
( 1)
2
@
?
@
X
i
(p
?
i)
2
+
H
S
(p
)
1
; (27)
from which we obtain:
@
?
@
=
1
( 1)
2+
H
S
(p
?
)
1
P
i
(p
?
i
)
2
: (28)
Finally, plugging Eq., we get:
@p
?
i
@
=
p
?
i
( 1)
2
1
1
(p
?
i
)
1
@
?
@
( 1) logp
?
i
=
p
?
i
( 1)
2
2
41
1
(p
?
i
)
1
1
( 1)
2+
H
S
(p
?
)
1
P
i
(p
?
i
)
2
( 1) logp
?
i
3
5
=
p
?
i
~pi()
( 1)
2
p
?
i
logp
?
i
+ ~pi()H
S
(p
?
)
1
; (29)
4
This follows from the margin property ofH
T
(Blondel et al.,).
where we denote by
~pi() =
(p
?
i
)
2
P
j
(p
?
j
)
2
: (30)
The distribution~p()can be interpreted as a skewed distribution obtained fromp
?, which appears in
the Jacobian of-entmax(z)w.r.t.zas well (Peters et al.,).
C.2 Solving the indetermination for= 1
We can write Eq.
@p
?
i
@
=
p
?
i
~pi() ( 1)(p
?
i
logp
?
i
+ ~pi()H
S
(p
?
))
( 1)
2
: (31)
When!1
+
, we have~p()!p
?
, which leads to a
0
0
indetermination.
To solve this indetermination, we will need to apply L'Hopital's rule twice. Let us rst compute the
derivative of~pi()with respect to. We have
@
@
(p
?
i)
2
= (p
?
i)
2
logp
?
i; (32)
therefore
@
@
~pi() =
@
@
(p
?
i
)
2
P
j
(p
?
j
)
2
=
(p
?
i
)
2
logp
?
i
P
j
(p
?
j
)
2
+ (p
?
i
)
2
P
j
(p
?
j
)
2
logp
?
j
P
j
(p
?
j
)
2
2
= ~pi() logp
?
i+ ~pi()
X
j
~pj() logp
?
j: (33)
Differentiating the numerator and denominator in Eq., we get:
@p
?
i
@
= lim
!1
+
(1 + ( 1)H
S
(p
?
))~pi()(logp
?
i
P
j
~pj() logp
?
j
) p
?
i
logp
?
i
~pi()H
S
(p
?
)
2( 1)
=A+B; (34)
with
A= lim
!1
+
H
S
(p
?
)~pi()(logp
?
i
P
j
~pj() logp
?
j
)H
S
(p
?
)
2
=
H
S
(p
?
)p
?
i
logp
?
i
+p
?
i
(H
S
(p
?
))
2
2
; (35)
and
B= lim
!1
+
~pi()(logp
?
i
P
j
~pj() logp
?
j
) p
?
i
logp
?
i
~pi()H
S
(p
?
)
2( 1)
: (36)
When!1
+ ,Bbecomes again a
0
0
indetermination, which we can solve by applying again L'Hopital's
rule. Differentiating the numerator and denominator in Eq.:
B=
1
2
lim
!1
+
8
<
:
~pi() logp
?
i
0
@
X
j
~pj() logp
?
j logp
?
i
1
A
~pi()
0
@
X
j
~pj() logp
?
j logp
?
i
1
A
0
@
X
j
~pj() logp
?
j+H
S
(p
?
)
1
A
~pi()
X
j
~pj() logp
?
j
X
k
~pk() logp
?
k
logp
?
j
!
9
=
;
=
p
?
i
logp
?
i
(H
S
(p
?
) + logp
?
i
) +p
?
i
P
j
p
?
j
logp
?
j
(H
S
(p
?
) + logp
?
j
)
2
=
H
S
(p
?
)p
?
i
logp
?
i
p
?
i
(H
S
(p
?
))
2
p
?
i
log
2
p
?
i
+p
?
i
P
j
p
?
j
log
2
p
?
j
2
: (37)
Finally, summing Eq., we get
@p
?
i
@
=1
=
p
?
i
log
2
p
?
i
+p
?
i
P
j
p
?
j
log
2
p
?
j
2
: (38)
C.3 Summary
To sum up, we have the following expression for the Jacobian of-entmaxwith respect to:
@p
?
i
@
=
8
<
:
p
?
i
~pi()
( 1)
2
p
?
i
logp
?
i
+~pi()H
S
(p
?
)
1
;for >1
p
?
i
log
2
p
?
i
+p
?
i
P
j
p
?
j
log
2
p
?
j
2
; for= 1.
(39)