Extracted Text


1602.02068v2.pdf

From Softmax to Sparsemax:
A Sparse Model of Attention and Multi-Label Classication
Andr´e F. T. Martins
y]
ANDRE.MARTINS@UNBABEL.COM
Ram´on F. Astudillo
y
RAMON@UNBABEL.COM
y
Unbabel Lda, Rua Visconde de Santar´em, 67-B, 1000-286 Lisboa, Portugal
]
Instituto de Telecomunicac¸˜oes (IT), Instituto Superior T´ecnico, Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal

Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Rua Alves Redol, 9, 1000-029 Lisboa, Portugal
Abstract
We propose sparsemax, a new activation func-
tion similar to the traditional softmax, but able
to output sparse probabilities. After deriving
its properties, we show how its Jacobian can be
efciently computed, enabling its use in a net-
work trained with backpropagation. Then, we
propose a new smooth and convex loss function
which is the sparsemax analogue of the logis-
tic loss. We reveal an unexpected connection
between this new loss and the Huber classi-
cation loss. We obtain promising empirical re-
sults in multi-label classication problems and in
attention-based neural networks for natural lan-
guage inference. For the latter, we achieve a sim-
ilar performance as the traditional softmax, but
with a selective, more compact, attention focus.
1. Introduction
The softmax transformation is a key component of several
statistical learning models, encompassing multinomial lo-
gistic regression (McCullagh & Nelder, 1989), action se-
lection in reinforcement learning (Sutton & Barto, 1998),
and neural networks for multi-class classication (Bridle,
1990; Goodfellow et al., 2016). Recently, it has also been
used to design attention mechanisms in neural networks,
with important achievements in machine translation (Bah-
danau et al., 2015), image caption generation (Xu et al.,
2015), speech recognition (Chorowski et al., 2015), mem-
ory networks (Sukhbaatar et al., 2015), and various tasks
in natural language understanding (Hermann et al., 2015;
Rockt¨aschel et al., 2015; Rush et al., 2015) and computa-
tion learning (Graves et al., 2014; Grefenstette et al., 2015).
There are a number of reasons why the softmax transfor-
mation is so appealing. It is simple to evaluate and dif-
ferentiate, and it can be turned into the (convex) negative
log-likelihood loss function by taking the logarithm of its
output. Alternatives proposed in the literature, such as the
Bradley-Terry model (Bradley & Terry, 1952; Zadrozny,
2001; Menke & Martinez, 2008), the multinomial probit
(Albert & Chib, 1993), the spherical softmax (Ollivier,
2013; Vincent, 2015; de Br´ebisson & Vincent, 2015), or
softmax approximations (Bouchard, 2007), while theoret-
ically or computationally advantageous for certain scenar-
ios, lack some of the convenient properties of softmax.
In this paper, we propose thesparsemax transformation.
Sparsemax has the distinctive feature that it can return
sparse posterior distributions, that is, it may assign exactly
zero probability to some of its output variables. This prop-
erty makes it appealing to be used as a lter for large out-
put spaces, to predict multiple labels, or as a component to
identify which of a group of variables are potentially rele-
vant for a decision, making the model more interpretable.
Crucially, this is done while preserving most of the attrac-
tive properties of softmax: we show that sparsemax is also
simple to evaluate, it is even cheaper to differentiate, and
that it can be turned into a convex loss function.
To sum up, our contributions are as follows:
We formalize the new sparsemax transformation, de-
rive its properties, and show how it can be efciently
computed (x2.1–2.3). We show that in the binary case
sparsemax reduces to a hard sigmoid (x2.4).
We derive the Jacobian of sparsemax, comparing it to
the softmax case, and show that it can lead to faster
gradient backpropagation (x2.5).
We propose thesparsemax loss, a new loss function
that is the sparsemax analogue of logistic regression
(x3). We show that it is convex, everywhere differen-
tiable, and can be regarded as a multi-class general-
ization of the Huber classication loss, an important
tool in robust statistics (Huber, 1964; Zhang, 2004).

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
We apply the sparsemax loss to train multi-label linear
classiers (which predict asetof labels instead of a
single label) on benchmark datasets (x4.1–4.2).
Finally, we devise a neural selective attention mecha-
nism using the sparsemax transformation, evaluating
its performance on a natural language inference prob-
lem, with encouraging results (x4.3).
2. The Sparsemax Transformation
2.1. Denition
Let
K1
:=fp2R
K
j1
>
p= 1;p0gbe the(K
1)-dimensional simplex. We are interested in functions that
map vectors inR
K
to probability distributions in
K1
.
Such functions are useful for converting a vector of real
weights (e.g., label scores) to a probability distribution (e.g.
posterior probabilities of labels). The classical example is
thesoftmax function, dened componentwise as:
softmaxi(z) =
exp(zi)
P
j
exp(zj)
: (1)
A limitation of the softmax transformation is that the re-
sulting probability distribution always has full support,i.e.,
softmaxi(z)6= 0for everyzandi. This is a disadvan-
tage in applications where a sparse probability distribution
is desired, in which case it is common to dene a threshold
below which small probability values are truncated to zero.
In this paper, we propose as an alternative the following
transformation, which we callsparsemax:
sparsemax(z) := argmin
p2
K1
kpzk
2
: (2)
In words, sparsemax returns the Euclidean projection of the
input vectorzonto the probability simplex. This projection
is likely to hit the boundary of the simplex, in which case
sparsemax(z)becomes sparse. We will see that sparsemax
retains most of the important properties of softmax, having
in addition the ability of producing sparse distributions.
2.2. Closed-Form Solution
Projecting onto the simplex is a well studied problem, for
which linear-time algorithms are available (Michelot, 1986;
Pardalos & Kovoor, 1990; Duchi et al., 2008). We start by
recalling the well-known result that such projections corre-
spond to a soft-thresholding operation. Below, we use the
notation[K] :=f1; : : : ; Kgand[t]+:= maxf0; tg.
Proposition 1The solution of Eq. 2 is of the form:
sparsemax
i(z) = [zi(z)]+; (3)
where:R
K
!Ris the (unique) function that satis-
es
P
j
[zj(z)]+= 1for everyz. Furthermore,
Algorithm 1Sparsemax Evaluation
Input:z
Sortzasz
(1): : :z
(K)
Findk(z) := max
n
k2[K]j1 +kz
(k)>
P
jk
z
(j)
o
Dene(z) =
(
P
jk(z)
z
(j))1
k(z)
Output:ps.t.pi= [zi(z)]+.
can be expressed as follows. Letz
(1)z
(2): : :
z
(K)be the sorted coordinates ofz, and denek(z) :=
max
n
k2[K]j1 +kz
(k)>
P
jk
z
(j)
o
. Then,
(z) =

P
jk(z)
z
(j)

1
k(z)
=

P
j2S(z)
zj

1
jS(z)j
;(4)
whereS(z) :=fj2[K]jsparsemax
j(z)>0gis the
support ofsparsemax(z).
Proof:See App. A.1 in the supplemental material.
In essence, Prop. 1 states that all we need for evaluating
the sparsemax transformation is to compute the threshold
(z); all coordinates above this threshold (the ones in the
setS(z)) will be shifted by this amount, and the others will
be truncated to zero. We callin Eq. 4 thethreshold func-
tion. This piecewise linear function will play an important
role in the sequel. Alg. 1 illustrates a na¨veO(KlogK)
algorithm that uses Prop. 1 for evaluating the sparsemax.
1
2.3. Basic Properties
We now highlight some properties that are common to soft-
max and sparsemax. Letz
(1):= maxkzk, and denote by
A(z) :=fk2[K]jzk=z
(1)gthe set of maximal compo-
nents ofz. We dene the indicator vector1
A(z), whosekth
component is1ifk2A(z), and0otherwise. We further
denote by(z) :=z
(1)max
k =2A(z)zkthe gap between
the maximal components ofzand the second largest. We
let0and1be vectors of zeros and ones, respectively.
Proposition 2The following properties hold for2
fsoftmax;sparsemaxg.
1.(0) =1=Kandlim
!0
+(
1
z) =1
A(z)=jA(z)j
(uniform distribution, and distribution peaked on the
maximal components ofz, respectively). For sparse-
max, the last equality holds for any(z) jA(z)j.
2.(z) =(z+c1), for anyc2R(i.e.,is invariant
to adding a constant to each coordinate).
1
More elaborateO(K)algorithms exist based on linear-time
selection (Blum et al., 1973; Pardalos & Kovoor, 1990).

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication  3  2  1 0 1 2 3
t
0.0
0.2
0.4
0.6
0.8
1.0 softmax
1
([t,0])
sparsemax
1
([t,0]) t
1
 3
 2
 1
0
1
2
3
t 2
 3
 2
 1
0
1
2
3
sparsemax
1
([t
1
,t
2
,0])
0.0
0.2
0.4
0.6
0.8
1.0 t
1
− 3
− 2
− 1
0
1
2
3
t 2
− 3
− 2
− 1
0
1
2
3
softmax
1
([t
1
,t
2
,0])
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1.Comparison of softmax and sparsemax in 2D (left) and 3D (two righmost plots).
3.(Pz) =P(z)for any permutation matrixP(i.e.,
commutes with permutations).
4. zizj, then0j(z)i(z)(zjzi),
where=
1
2
for softmax, and= 1for sparsemax.
Proof:See App. A.2 in the supplemental material.
Interpretingas a “temperature parameter,” the rst part
of Prop. 2 shows that the sparsemax has the same “zero-
temperature limit” behaviour as the softmax, but without
the need of making the temperature arbitrarily small.
Prop. 2 is reassuring, since it shows that the sparsemax
transformation, despite being dened very differently from
the softmax, has a similar behaviour and preserves the same
invariances. Note that some of these properties are not sat-
ised by other proposed replacements of the softmax: for
example, the spherical softmax (Ollivier, 2013), dened as
i(z) :=z
2
i
=
P
j
z
2
j
, does not satisfy properties 2 and 4.
2.4. Two and Three-Dimensional Cases
For the two-class case, it is well known that the softmax
activation becomes the logistic (sigmoid) function. More
precisely, ifz= (t;0), thensoftmax1(z) =(t) :=
(1 + exp(t))
1
. We next show that the analogous in
sparsemax is the “hard” version of the sigmoid. In fact,
using Prop. 1, Eq. 4, we have that, forz= (t;0),
(z) =
8
<
:
t1; ift >1
(t1)=2;if1t1
1; ift <1;
(5)
and therefore
sparsemax
1(z) =
8
<
:
1; ift >1
(t+ 1)=2;if1t1
0; ift <1:
(6)
Fig. 1 provides an illustration for the two and three-
dimensional cases. For the latter, we parameterizez=
(t1; t2;0)and plotsoftmax1(z)andsparsemax
1(z)as a
function oft1andt2. We can see that sparsemax is piece-
wise linear, but asymptotically similar to the softmax.
2.5. Jacobian of Sparsemax
The Jacobian matrix of a transformation,J(z) :=
[@i(z)=@zj]i;j, is of key importance to train models with
gradient-based optimization. We next derive the Jacobian
of the sparsemax activation, but before doing so, let us re-
call how the Jacobian of the softmax looks like. We have
@softmaxi(z)
@zj
=
ije
zi
P
k
e
zk
e
zi
e
zj
(
P
k
e
zk)
2
= softmaxi(z)(ijsoftmaxj(z));(7)
whereijis the Kronecker delta, which evaluates to1if
i=jand0otherwise. Lettingp= softmax(z), the full
Jacobian can be written in matrix notation as
Jsoftmax(z) =Diag(p)pp
>
; (8)
whereDiag(p)is a matrix withpin the main diagonal.
Let us now turn to the sparsemax case. The rst thing to
note is that sparsemax is differentiable everywhere except
at splitting pointszwhere the support setS(z)changes,
i.e., whereS(z)6=S(z+d)for somedand innitesimal
.
2
From Eq. 3, we have that:
@sparsemax
i(z)
@zj
=
(
ij
@(z)
@zj
;ifzi> (z);
0; ifzi(z).
(9)
It remains to compute the gradient of the threshold function
. From Eq. 4, we have:
@(z)
@zj
=

1
jS(z)j
ifj2S(z);
0; ifj =2S(z):
(10)
Note thatj2S(z),zj> (z). Therefore we obtain:
@sparsemax
i(z)
@zj
=

ij
1
jS(z)j
;ifi; j2S(z);
0; otherwise.
(11)
2
For those points, we can take an arbitrary matrix in the set of
generalized Clarke's Jacobians (Clarke, 1983), the convex hull of
all points of the formlimt!1Jsparsemax(zt), wherezt!z.

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
Letsbe an indicator vector whoseith entry is1ifi2S(z),
and0otherwise. We can write the Jacobian matrix as
Jsparsemax(z) =Diag(s)ss
>
=jS(z)j: (12)
It is instructive to compare Eqs. 8 and 12. We may re-
gard the Jacobian of sparsemax as the Laplacian of a graph
whose elements ofS(z)are fully connected. To compute
it, we only needS(z), which can be obtained inO(K)time
with the same algorithm that evaluates the sparsemax.
Often,e.g., in the gradient backpropagation algorithm, it is
not necessary to compute the full Jacobian matrix, but only
the product between the Jacobian and a given vectorv. In
the softmax case, from Eq. 8, we have:
Jsoftmax(z)v=p (vv1);withv:=
P
j
pjvj;(13)
where denotes the Hadamard product; this requires a
linear-time computation. For the sparsemax case, we have:
Jsparsemax(z)v=s (v^v1);with^v:=
P
j2S(z)
vj
jS(z)j
.
(14)
Interestingly, ifsparsemax(z)has already been evaluated
(i.e., in the forward step), then so hasS(z), hence the
nonzeros ofJsparsemax(z)vcan be computed in only
O(jS(z)j)time, which can be sublinear. This can be an im-
portant advantage of sparsemax over softmax ifKis large.
3. A Loss Function for Sparsemax
Now that we have dened the sparsemax transformation
and established its main properties, we show how to use
this transformation to design a new loss function that re-
sembles the logistic loss, but can yield sparse posterior dis-
tributions. Later (inx4.1–4.2), we apply this loss to label
proportion estimation and multi-label classication.
3.1. Logistic Loss
Consider a datasetD:=f(xi; yi)g
N
i=1
, where eachxi2
R
D
is an input vector and eachyi2 f1; : : : ; Kgis a target
output label. We consider regularized empirical risk mini-
mization problems of the form
minimize

2
kWk
2
F+
1
N
N
X
i=1
L(Wxi+b;yi);
w:r:t:W2R
KD
;b2R
K
; (15)
whereLis a loss function,Wis a matrix of weights, and
bis a bias vector. The loss function associated with the
softmax is the logistic loss (or negative log-likelihood):
Lsoftmax(z;k) =log softmaxk(z)
=zk+ log
X
j
exp(zj);(16)
wherez=Wxi+b, andk=yiis the “gold” label. The
gradient of this loss is, invoking Eq. 7,
rzLsoftmax(z;k) =k+ softmax(z);(17)
wherekdenotes the delta distribution onk,[k]j= 1if
j=k, and0otherwise. This is a well-known result; when
plugged into a gradient-based optimizer, it leads to updates
that move probability mass from the distribution predicted
by the current model (i.e.,softmaxk(z)) to the gold label
(viak). Can we have something similar for sparsemax?
3.2. Sparsemax Loss
A nice aspect of the log-likelihood (Eq. 16) is that adding
up loss terms for several examples, assumed i.i.d, we obtain
the log-probability of the full training data. Unfortunately,
this idea cannot be carried out to sparsemax: now, some
labels may haveexactlyprobability zero, so any model that
assigns zero probability to a gold label would zero out the
probability of the entire training sample. This is of course
highly undesirable. One possible workaround is to dene
L

sparsemax(z;k) =log
+ sparsemax
k(z)
1 +K
;(18)
whereis a small constant, and
+sparsemax
k
(z)
1+K
is a “per-
turbed” sparsemax. However, this loss is non-convex, un-
like the one in Eq. 16.
Another possibility, which we explore here, is to construct
an alternative loss function whosegradientresembles the
one in Eq. 17. Note that the gradient is particularly im-
portant, since it is directly involved in the model updates
for typical optimization algorithms. Formally, we want
Lsparsemaxto be a differentiable function such that
rzLsparsemax(z;k) =k+ sparsemax(z):(19)
We show below that this property is fullled by the follow-
ing function, henceforth called thesparsemax loss:
Lsparsemax(z;k) =zk+
1
2
X
j2S(z)
(z
2
j
2
(z))+
1
2
;(20)
where
2
is the square of the threshold function in Eq. 4.
This loss, which has never been considered in the literature
to the best of our knowledge, has a number of interesting
properties, stated in the next proposition.
Proposition 3The following holds:
1.Lsparsemaxis differentiable everywhere, and its gradi-
ent is given by the expression in Eq. 19.
2.Lsparsemaxis convex.

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
3.Lsparsemax(z+c1;k) =Lsparsemax(z;k),8c2R.
4.Lsparsemax(z;k)0, for allzandk.
5.
Lsparsemax(z;k) = 0; (ii)sparsemax(z) =k; (iii)
margin separation holds,zk1 + maxj6=kzj.
Proof:See App. A.3 in the supplemental material.
Note that the rst four properties in Prop. 3 are also sat-
ised by the logistic loss, except that the gradient is given
by Eq. 17. The fth property is particularly interesting,
since it is satised by the hinge loss of support vector ma-
chines. However, unlike the hinge loss,Lsparsemaxis ev-
erywhere differentiable, hence amenable to smooth opti-
mization methods such as L-BFGS or accelerated gradient
descent (Liu & Nocedal, 1989; Nesterov, 1983).
3.3. Relation to the Huber Loss
Coincidentally, as we next show, the sparsemax loss in the
binary case reduces to the Huber classication loss, an im-
portant loss function in robust statistics (Huber, 1964).
Let us note rst that, from Eq. 20, we have, ifjS(z)j= 1,
Lsparsemax(z;k) =zk+z
(1); (21)
and, ifjS(z)j= 2,Lsparsemax(z;k) =
zk+
1 + (z
(1)z
(2))
2
4
+
z
(1)+z
(2)
2
; (22)
wherez
(1)z
(2): : :are the sorted components ofz.
Note that the second expression, whenz
(1)z
(2)= 1,
equals the rst one, which asserts the continuity of the loss
even thoughjS(z)jis non-continuous onz.
In the two-class case, we havejS(z)j= 1ifz
(1)1 +
z
(2)(unit margin separation), andjS(z)j= 2otherwise.
Assume without loss of generality that the correct label is
k= 1, and denet=z1z2. From Eqs. 21–22, we have
Lsparsemax(t) =
8
<
:
0 ift1
t ift 1
(t1)
2
4
if1< t <1,
(23)
whose graph is shown in Fig. 2. This loss is a variant of the
Huber loss adapted for classication, and has been called
“modied Huber loss” by Zhang (2004); Zou et al. (2006).
3.4. Generalization to Multi-Label Classication
We end this section by showing a generalization of the loss
functions in Eqs. 16 and 20 to multi-label classication,
i.e., problems in which the target is a non-empty set of la-
Figure 2.Comparison between the sparsemax loss and other com-
monly used losses for binary classication.
belsY 22
[K]
n f?grather than a single label.
3
Such prob-
lems have attracted recent interest (Zhang & Zhou, 2014).
More generally, we consider the problem of estimating
sparse label proportions, where the target is a probability
distributionq2
K1
, such thatY=fkjqk>0g. We
assume a training datasetD:=f(xi;q
i)g
N
i=1
, where each
xi2R
D
is an input vector and eachq
i2
K1
is a target
distribution over outputs, assumed sparse.
4
This subsumes
single-label classication, where allq
iare delta distribu-
tions concentrated on a single class. The generalization of
the multinomial logistic loss to this setting is
Lsoftmax(z;q) =KL(qksoftmax(z)) (24)
=H(q)q
>
z+ log
P
j
exp(zj);
whereKL(:k:)andH(:)denote the Kullback-Leibler di-
vergence and the Shannon entropy, respectively. Note that,
up to a constant, this loss is equivalent to standard logistic
regression with soft labels. The gradient of this loss is
rzLsoftmax(z;q) =q+ softmax(z):(25)
The corresponding generalization in the sparsemax case is:
Lsparsemax(z;q) =q
>
z+
1
2
X
j2S(z)
(z
2
j
2
(z))+
1
2
kqk
2
;
(26)
which satises the properties in Prop. 3 and has gradient
rzLsparsemax(z;q) =q+ sparsemax(z):(27)
We make use of these losses in our experiments (x4.1–4.2).
4. Experiments
We next evaluate empirically the ability of sparsemax for
addressing two classes of problems:
3
Not to be confused with “multi-class classication,” which
denotes problems whereY= [K]andK >2.
4
This scenario is also relevant for “learning with a proba-
bilistic teacher” (Agrawala, 1970) and semi-supervised learning
(Chapelle et al., 2006), as it can model label uncertainty.

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
1.
cation, via the sparsemax loss in Eq. 26 (x4.1–4.2).
2.
transformation of Eq. 2 (x4.3).
4.1. Label Proportion Estimation
We show simulation results for sparse label proportion es-
timation on synthetic data. Since sparsemax can predict
sparse distributions, we expect its superiority in this task.
We generated datasets with 1,200 training and 1,000 test
examples. Each example emulates a “multi-labeled doc-
ument”: a variable-length sequence of word symbols, as-
signed to multiple topics (labels). We pick the number of
labelsN2 f1; : : : ; Kgby sampling from a Poisson distri-
bution with rejection sampling, and draw theNlabels from
a multinomial. Then, we pick a document length from a
Poisson, and repeatedly sample its words from the mixture
of theNlabel-specic multinomials. We experimented
with two settings:uniform mixtures(qkn= 1=Nfor the
Nactive labelsk1; : : : ; kN) andrandom mixtures(whose
label proportionsqkn
were drawn from a at Dirichlet).
5
We set the vocabulary size to be equal to the number of
labelsK2 f10;50g, and varied the average document
length between 200 and 2,000 words. We trained mod-
els by optimizing Eq. 15 withL2 fLsoftmax; Lsparsemaxg
(Eqs. 24 and 26). We picked the regularization constant
2 f10
j
g
0
j=9
with 5-fold cross-validation.
Results are shown in Fig. 3. We report the mean squared er-
ror (average ofkqpk
2
on the test set, whereqandpare
respectively the target and predicted label posteriors) and
the Jensen-Shannon divergence (average ofJS(q;p) :=
1
2
KL(qk
p+q
2
)+
1
2
KL(pk
p+q
2
)).
6
We observe that the two
losses perform similarly for small document lengths (where
the signal is weaker), but as the average document length
exceeds 400, the sparsemax loss starts outperforming the
logistic loss consistently. This is because with a stronger
signal the sparsemax estimator manages to identify cor-
rectly the support of the label proportionsq, contributing to
reduce both the mean squared error and the JS divergence.
This occurs both for uniform and random mixtures.
4.2. Multi-Label Classication on Benchmark Datasets
Next, we ran experiments in ve benchmark multi-label
classication datasets: the four small-scale datasets used by
Koyejo et al. (2015),
7
and the much larger Reuters RCV1
5
Note that, with uniform mixtures, the problem becomes es-
sentially multi-label classication.
6
Note that the KL divergence is not an appropriate metric here,
since the sparsity ofqandpcould lead to1values.
7
Obtained fromhttp://mulan.sourceforge.net/
datasets-mlc.html.
Table 1.Statistics for the 5 multi-label classication datasets.
DATASET DESCR. #LABELS #TRAIN #TEST
SCENE IMAGES 6 1211 1196
EMOTIONS MUSIC 6 393 202
BIRDS AUDIO 19 323 322
CAL500 MUSIC 174 400 100
REUTERS TEXT 103 23,149 781,265
Table 2.Micro (left) and macro-averaged (right) F1scores for the
logistic, softmax, and sparsemax losses on benchmark datasets.
DATASET LOGISTIC SOFTMAX SPARSEMAX
SCENE 70.96 / 72.9574.01/75.0373.45 / 74.57
EMOTIONS 66.75 /68.56 67.34/ 67.51 66.38 / 66.07
BIRDS 45.78 / 33.77 48.67 / 37.0649.44/39.13
CAL500 48.88/ 24.49 47.46 / 23.51 48.47 /26.20
REUTERS 81.19/ 60.02 79.47 / 56.30 80.00 /61.27
v2 dataset of Lewis et al. (2004).
8
For all datasets, we re-
moved examples without labels (i.e.whereY=?). For all
but the Reuters dataset, we normalized the features to have
zero mean and unit variance. Statistics for these datasets
are presented in Table 1.
Recent work has investigated the consistency of multi-label
classiers for various micro and macro-averaged metrics
(Gao & Zhou, 2013; Koyejo et al., 2015), among which a
plug-in classier that trains independent binary logistic re-
gressors on each label, and then tunes a probability thresh-
old2[0;1]on validation data. At test time, those labels
whose posteriors are above the threshold are predicted to
be “on.” We used this procedure (called LOGISTIC) as a
baseline for comparison. Our second baseline (SOFTMAX)
is a multinomial logistic regressor, using the loss function
in Eq. 24, where the target distributionqis set to uniform
over the active labels. A similar probability thresholdp0
is used for prediction, above which a label is predicted to
be “on.” We compare these two systems with the sparse-
max loss function of Eq. 26. We found it benecial to scale
the label scoreszby a constantt1at test time, before
applying the sparsemax transformation, to make the result-
ing distributionp= sparsemax(tz)more sparse. We then
predict thekth label to be “on” ifpk6= 0.
We optimized the three losses with L-BFGS (for a maxi-
mum of 100 epochs), tuning the hyperparameters in a held-
out validation set (for the Reuters dataset) and with 5-fold
cross-validation (for the other four datasets). The hyperpa-
rameters are the regularization constant2 f10
j
g
2
j=8
,
the probability thresholds2 f:05ng
10
n=1for LOGISTIC
8
Obtained fromhttps://www.csie.ntu.edu.tw/
˜cjlin/libsvmtools/datasets/multilabel.html .

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
Figure 3.Simulation results for the
estimation of label posteriors, for uni-
form (top) and random mixtures (bot-
tom). Shown are the mean squared
error and the Jensen-Shannon diver-
gence as a function of the document
length, for the logistic and the sparse-
max estimators.
andp02 fn=Kg
10
n=1for SOFTMAX, and the coefcient
t2 f0:5ng
10
n=1for SPARSEMAX.
Results are shown in Table 2. Overall, the performances of
the three losses are all very similar, with a slight advantage
of SPARSEMAX, which attained the highest results in 4 out
of 10 experiments, while LOGISTICand SOFTMAXwon 3
times each. In particular, sparsemax appears better suited
for problems with larger numbers of labels.
4.3. Neural Networks with Attention Mechanisms
We now assess the suitability of the sparsemax transforma-
tion to construct a “sparse” neural attention mechanism.
We ran experiments on the task of natural language infer-
ence, using the recently released SNLI 1.0 corpus (Bow-
man et al., 2015), a collection of 570,000 human-written
English sentence pairs. Each pair consists of a premise and
an hypothesis, manually labeled with one the labelsEN-
TAILMENT,CONTRADICTION, orNEUTRAL. We used the
provided training, development, and test splits.
The architecture of our system, shown in Fig. 4, is the
same as the one proposed by Rockt¨aschel et al. (2015).
We compare the performance of four systems: NOATTEN-
TION, a (gated) RNN-based system similar to Bowman
et al. (2015); LOGISTICATTENTION, an attention-based
system with independent logistic activations; SOFTATTEN-
TION, a near-reproduction of the Rockt¨aschel et al. (2015)'s
attention-based system; and SPARSEATTENTION, which
replaces the latter softmax-activated attention mechanism
by a sparsemax activation.
We represent the words in the premise and in the hypothe-
sis with 300-dimensional GloVe vectors (Pennington et al.,
2014), not optimized during training, which we linearly
project onto aD-dimensional subspace (Astudillo et al.,
Figure 4.Network diagram for the NL inference problem. The
premise and hypothesis are both fed into (gated) RNNs. The
NOATTENTIONsystem replaces the attention part (in green) by a
direct connection from the last premise state to the output (dashed
violet line). The LOGISTICATTENTION, SOFTATTENTIONand
SPARSEATTENTIONsystems have respectively independent lo-
gistics, a softmax, and a sparsemax-activated attention mecha-
nism. In this example,L= 5andN= 9.
2015).
9
We denote byx1; : : : ;xLandxL+1; : : : ;xN, re-
spectively, the projected premise and hypothesis word vec-
tors. These sequences are then fed into two recurrent net-
works (one for each). Instead of long short-term memories,
as Rockt¨aschel et al. (2015), we used gated recurrent units
(GRUs, Cho et al. 2014), which behave similarly but have
fewer parameters. Our premise GRU generates a state se-
quenceH1:L:= [h1: : :hL]2R
DL
as follows:
zt=(W
xz
xt+W
hz
ht1+b
z
) (28)
rt=(W
xr
xt+W
hr
ht1+b
r
) (29)

ht= tanh(W
xh
xt+W
hh
(rt ht1) +b
h
)(30)
ht= (1zt)ht1+zt

ht; (31)
9
We used GloVe-840B embeddings trained on Common Crawl
(http://nlp.stanford.edu/projects/glove/ ).

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
Table 3.Accuracies for the natural language inference task.
Shown are our implementations of a system without attention, and
with logistic, soft, and sparse attentions.
DEVACC. TESTACC.
NOATTENTION 81.84 80.99
LOGISTICATTENTION 82.11 80.84
SOFTATTENTION 82.86 82.08
SPARSEATTENTION 82.52 82.20
with model parametersW
fxz;xr;xh;hz;hr;hhg
2R
DD
andb
fz;r;hg
2R
D
. Likewise, our hypothesis GRU
(with distinct parameters) generates a state sequence
[hL+1; : : : ;hN], being initialized with the last state from
the premise (hL). The NOATTENTIONsystem then com-
putes the nal stateubased on the last states from the
premise and the hypothesis as follows:
u= tanh(W
pu
hL+W
hu
hN+b
u
) (32)
whereW
pu
;W
hu
2R
DD
andb
u
2R
D
. Finally, it
predicts a labelbyfromuwith a standard softmax layer. The
SOFTATTENTIONsystem, instead of using the last premise
statehL, computes a weighted average of premise words
with an attention mechanism, replacing Eq. 32 by
zt=v
>
tanh(W
pm
ht+W
hm
hN+b
m
)(33)
p= softmax(z);wherez:= (z1; : : : ; zL)(34)
r=H1:Lp (35)
u= tanh(W
pu
r+W
hu
hN+b
u
); (36)
whereW
pm
;W
hm
2R
DD
andb
m
;v2R
D
. The LO-
GISTICATTENTIONsystem, instead of Eq. 34, computes
p= ((z1); : : : ; (zL)). Finally, the SPARSEATTENTION
system replaces Eq. 34 byp= sparsemax(z).
We optimized all the systems with Adam (Kingma & Ba,
2014), using the default parameters 1= 0:9, 2= 0:999,
and= 10
8
, and setting the learning rate to310
4
.
We tuned a`2-regularization coefcient inf0;10
4
;3
10
4
;10
3
gand, as Rockt¨aschel et al. (2015), a dropout
probability of0:1in the inputs and outputs of the network.
The results are shown in Table 3. We observe that the
soft and sparse-activated attention systems perform simi-
larly, the latter being slightly more accurate on the test set,
and that both outperform the NOATTENTIONand LOGIS-
TICATTENTIONsystems.
10
10
Rockt¨aschel et al. (2015) report scores slightly above ours:
they reached a test accuracy of 82.3% for their implementation of
SOFTATTENTION, and 83.5% with their best system, a more elab-
orate word-by-word attention model. Differences in the former
case may be due to distinct word vectors and the use of LSTMs
instead of GRUs.
Table 4.Examples of sparse attention for the natural language in-
ference task. Nonzero attention coefcients are markedin bold.
Our system classied all four examples correctly. The examples
were picked from Rockt¨aschel et al. (2015).
A boyrides onacamelin a crowded area while talking on his
cellphone.
Hypothesis:A boy is riding an animal. [entailment]
A young girl wearinga pink coatplays with ayellowtoy golf
club.
Hypothesis:A girl is wearing a blue jacket.[contradiction]
Two black dogs arefrolickingaround thegrass together.
Hypothesis:Two dogs swim in the lake. [contradiction]
A man wearing a yellow striped shirtlaughswhileseated next
to anothermanwho is wearing a light blue shirt andclaspinghis
hands together.
Hypothesis:Two mimes sit in complete silence.[contradiction]
Table 4 shows examples of sentence pairs, highlighting the
premise words selected by the SPARSEATTENTIONmech-
anism. We can see that, for all examples, only a small num-
ber of words are selected, which are key to making the -
nal decision. Compared to a softmax-activated mechanism,
which provides a dense distribution over all the words,
the sparsemax activation yields a compact and more inter-
pretable selection, which can be particularly useful in long
sentences such as the one in the bottom row.
5. Conclusions
We introduced the sparsemax transformation, which has
similar properties to the traditional softmax, but is able
to output sparse probability distributions. We derived
a closed-form expression for its Jacobian, needed for
the backpropagation algorithm, and we proposed a novel
“sparsemax loss” function, a sparse analogue of the logis-
tic loss, which is smooth and convex. Empirical results in
multi-label classication and in attention networks for nat-
ural language inference attest the validity of our approach.
The connection between sparse modeling and interpretabil-
ity is key in signal processing (Hastie et al., 2015). Our
approach is distinctive: it is not the model that is assumed
sparse, but the label posteriors that the model parametrizes.
Sparsity is also a desirable (and biologically plausible)
property in neural networks, present in rectied units (Glo-
rot et al., 2011) and maxout nets (Goodfellow et al., 2013).
There are several avenues for future research. The ability
of sparsemax-activated attention to select only a few vari-
ables to attend makes it potentially relevant to neural archi-
tectures with random access memory (Graves et al., 2014;
Grefenstette et al., 2015; Sukhbaatar et al., 2015), since

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
it offers a compromise between soft and hard operations,
maintaining differentiability. In fact, “harder” forms of at-
tention are often useful, arising as word alignments in ma-
chine translation pipelines, or latent variables as in Xu et al.
(2015). Sparsemax is also appealing for hierarchical atten-
tion: if we dene a top-down product of distributions along
the hierarchy, the sparse distributions produced by sparse-
max will automatically prune the hierarchy, leading to com-
putational savings. A possible disadvantage of sparsemax
over softmax is that it seems less GPU-friendly, since it re-
quires sort operations or linear-selection algorithms. There
is, however, recent work providing efcient implementa-
tions of these algorithms on GPUs (Alabi et al., 2012).
Acknowledgements
We would like to thank Tim Rockt¨aschel for answering
various implementation questions, and M´ario Figueiredo
and Chris Dyer for helpful comments on a draft of this
report. This work was partially supported by Fundac¸˜ao
para a Ciˆencia e Tecnologia (FCT), through contracts
UID/EEA/50008/2013 and UID/CEC/50021/2013.
References
Agrawala, Ashok K. Learning with a Probabilistic Teacher.
IEEE Transactions on Information Theory, 16(4):373–
379, 1970.
Alabi, Tolu, Blanchard, Jeffrey D, Gordon, Bradley, and
Steinbach, Russel. Fast k-Selection Algorithms for
Graphics Processing Units.Journal of Experimental Al-
gorithmics (JEA), 17:4–2, 2012.
Albert, James H and Chib, Siddhartha. Bayesian Analysis
of Binary and Polychotomous Response Data.Journal of
the American statistical Association, 88(422):669–679,
1993.
Astudillo, Ramon F, Amir, Silvio, Lin, Wang, Silva, M´ario,
and Trancoso, Isabel. Learning Word Representations
from Scarce and Noisy Data with Embedding Sub-
spaces. InProc. of the Association for Computational
Linguistics (ACL), Beijing, China, 2015.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,
Yoshua. Neural Machine Translation by Jointly Learn-
ing to Align and Translate. InInternational Conference
on Learning Representations, 2015.
Blum, Manuel, Floyd, Robert W, Pratt, Vaughan, Rivest,
Ronald L, and Tarjan, Robert E. Time Bounds for Se-
lection.Journal of Computer and System Sciences, 7(4):
448–461, 1973.
Bouchard, Guillaume. Efcient Bounds for the Softmax
Function and Applications to Approximate Inference
in Hybrid Models. InNIPS Workshop for Approxi-
mate Bayesian Inference in Continuous/Hybrid Systems,
2007.
Bowman, Samuel R, Angeli, Gabor, Potts, Christopher, and
Manning, Christopher D. A Large Annotated Corpus for
Learning Natural Language Inference. InProc. of Em-
pirical Methods in Natural Language Processing, 2015.
Bradley, Ralph Allan and Terry, Milton E. Rank Analy-
sis of Incomplete Block Designs: The Method of Paired
Comparisons.Biometrika, 39(3-4):324–345, 1952.
Bridle, John S. Probabilistic Interpretation of Feedforward
Classication Network Outputs, with Relationships to
Statistical Pattern Recognition. InNeurocomputing, pp.
227–236. Springer, 1990.
Chapelle, Olivier, Sch¨olkopf, Bernhard, and Zien, Alexan-
der.Semi-Supervised Learning. MIT Press Cambridge,
2006.
Cho, Kyunghyun, Van Merri¨enboer, Bart, Gulcehre,
Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk,
Holger, and Bengio, Yoshua. Learning Phrase Repre-
sentations Using RNN Encoder-Decoder for Statistical
Machine Translation. InProc. of Empirical Methods in
Natural Language Processing, 2014.
Chorowski, Jan K, Bahdanau, Dzmitry, Serdyuk, Dmitriy,
Cho, Kyunghyun, and Bengio, Yoshua. Attention-based
Models for Speech Recognition. InAdvances in Neural
Information Processing Systems, pp. 577–585, 2015.
Clarke, Frank H.Optimization and Nonsmooth Analysis.
New York, Wiley, 1983.
de Br´ebisson, Alexandre and Vincent, Pascal. An Explo-
ration of Softmax Alternatives Belonging to the Spher-
ical Loss Family.arXiv preprint arXiv:1511.05042,
2015.
Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T.
Efcient Projections onto the L1-Ball for Learning in
High Dimensions. InProc. of International Conference
of Machine Learning, 2008.
Gao, Wei and Zhou, Zhi-Hua. On the Consistency of Multi-
Label Learning.Articial Intelligence, 199:22–44, 2013.
Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua.
Deep Sparse Rectier Neural Networks. InInternational
Conference on Articial Intelligence and Statistics, pp.
315–323, 2011.
Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron.
Deep Learning. Book in preparation for MIT Press,
2016. URL http://goodfeli.github.io/
dlbook/.

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
Goodfellow, Ian J, Warde-Farley, David, Mirza, Mehdi,
Courville, Aaron, and Bengio, Yoshua. Maxout Net-
works. InProc. of International Conference on Machine
Learning, 2013.
Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neu-
ral Turing Machines.arXiv preprint arXiv:1410.5401,
2014.
Grefenstette, Edward, Hermann, Karl Moritz, Suleyman,
Mustafa, and Blunsom, Phil. Learning to Transduce with
Unbounded Memory. InAdvances in Neural Information
Processing Systems, pp. 1819–1827, 2015.
Hastie, Trevor, Tibshirani, Robert, and Wainwright, Mar-
tin.Statistical Learning with Sparsity: the Lasso and
Generalizations. CRC Press, 2015.
Hermann, Karl Moritz, Kocisky, Tomas, Grefenstette, Ed-
ward, Espeholt, Lasse, Kay, Will, Suleyman, Mustafa,
and Blunsom, Phil. Teaching Machines to Read and
Comprehend. InAdvances in Neural Information Pro-
cessing Systems, pp. 1684–1692, 2015.
Huber, Peter J. Robust Estimation of a Location Param-
eter.The Annals of Mathematical Statistics, 35(1):73–
101, 1964.
Kingma, Diederik and Ba, Jimmy. Adam: A Method for
Stochastic Optimization. InProc. of International Con-
ference on Learning Representations, 2014.
Koyejo, Sanmi, Natarajan, Nagarajan, Ravikumar,
Pradeep K, and Dhillon, Inderjit S. Consistent Multil-
abel Classication. InAdvances in Neural Information
Processing Systems, pp. 3303–3311, 2015.
Lewis, David D, Yang, Yiming, Rose, Tony G, and Li, Fan.
RCV1: A New Benchmark Collection for Text Catego-
rization Research.The Journal of Machine Learning Re-
search, 5:361–397, 2004.
Liu, Dong C and Nocedal, Jorge. On the Limited Memory
BFGS Method for Large Scale Optimization.Mathemat-
ical programming, 45(1-3):503–528, 1989.
McCullagh, Peter and Nelder, John A.Generalized Linear
Models, volume 37. CRC press, 1989.
Menke, Joshua E and Martinez, Tony R. A Bradley–Terry
Articial Neural Network Model for Individual Ratings
in Group Competitions.Neural Computing and Appli-
cations, 17(2):175–186, 2008.
Michelot, C. A Finite Algorithm for Finding the Projection
of a Point onto the Canonical Simplex ofR
n
.Journal of
Optimization Theory and Applications, 50(1):195–200,
1986.
Nesterov, Y. A Method of Solving a Convex Programming
Problem with Convergence RateO(1=k
2
).Soviet Math.
Doklady, 27:372–376, 1983.
Ollivier, Yann. Riemannian Metrics for Neural Networks.
arXiv preprint arXiv:1303.0818, 2013.
Pardalos, Panos M. and Kovoor, Naina. An Algorithm for a
Singly Constrained Class of Quadratic Programs Subject
to Upper and Lower Bounds.Mathematical Program-
ming, 46(1):321–328, 1990.
Pennington, Jeffrey, Socher, Richard, and Manning,
Christopher D. Glove: Global Vectors for Word Rep-
resentation.Proceedings of the Empiricial Methods in
Natural Language Processing (EMNLP 2014), 12:1532–
1543, 2014.
Penot, Jean-Paul. Conciliating Generalized Derivatives. In
Demyanov, Vladimir F., Pardalos, Panos M., and Bat-
syn, Mikhail (eds.),Constructive Nonsmooth Analysis
and Related Topics, pp. 217–230. Springer, 2014.
Rockt¨aschel, Tim, Grefenstette, Edward, Hermann,
Karl Moritz, Kocisky, Tom´as, and Blunsom, Phil. Rea-
soning about Entailment with Neural Attention.arXiv
preprint arXiv:1509.06664, 2015.
Rush, Alexander M, Chopra, Sumit, and Weston, Jason. A
Neural Attention Model for Abstractive Sentence Sum-
marization.Proc. of Empirical Methods in Natural Lan-
guage Processing, 2015.
Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and
Fergus, Rob. End-to-End Memory Networks. InAd-
vances in Neural Information Processing Systems, pp.
2431–2439, 2015.
Sutton, Richard S and Barto, Andrew G.Reinforcement
Learning: An Introduction, volume 1. MIT press Cam-
bridge, 1998.
Vincent, Pascal. Efcient Exact Gradient Update for Train-
ing Deep Networks with Very Large Sparse Targets.
InAdvances in Neural Information Processing Systems,
2015.
Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Courville, Aaron,
Salakhutdinov, Ruslan, Zemel, Richard, and Bengio,
Yoshua. Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention. InInternational Con-
ference on Machine Learning, 2015.
Zadrozny, Bianca. Reducing Multiclass to Binary by Cou-
pling Probability Estimates. InAdvances in Neural In-
formation Processing Systems, pp. 1041–1048, 2001.

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
Zhang, Min-Ling and Zhou, Zhi-Hua. A Review on Multi-
Label Learning Algorithms.Knowledge and Data Engi-
neering, IEEE Transactions on, 26(8):1819–1837, 2014.
Zhang, Tong. Statistical Behavior and Consistency of Clas-
sication Methods Based on Convex Risk Minimization.
Annals of Statistics, pp. 56–85, 2004.
Zou, Hui, Zhu, Ji, and Hastie, Trevor. The Margin Vector,
Admissible Loss and Multi-class Margin-Based Classi-
ers. Technical report, Stanford University, 2006.

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
A. Supplementary Material
A.1. Proof of Prop. 1
The Lagrangian of the optimization problem in Eq. 2 is:
L(z;; ) =
1
2
kpzk
2

>
p+(1
>
p1): (37)
The optimal(p

;

; 

)must satisfy the following Karush-Kuhn-Tucker conditions:
p

z

+

1=0; (38)
1
>
p

= 1;p

0;

0; (39)


ip

i= 0;8i2[K]: (40)
If fori2[K]we havep

i
>0, then from Eq. 40 we must have

i
= 0, which from Eq. 38 impliesp

i
=zi

. Let
S(z) =fj2[K]jp

j
>0g. From Eq. 39 we obtain
P
j2S(z)
(zj

) = 1, which yields the right hand side of Eq. 4.
Again from Eq. 40, we have that

i
>0impliesp

i
= 0, which from Eq. 38 implies

i
=

zi0, i.e.,zi

for
i =2S(z). Therefore we have thatk(z) =jS(z)j, which proves the rst equality of Eq. 4.
A.2. Proof of Prop. 2
We start with the third property, which follows from the coordinate-symmetry in the denitions in Eqs. 1–2. The same
argument can be used to prove the rst part of the rst property (uniform distribution).
Let us turn to the second part of the rst property (peaked distribution on the maximal components ofz), and dene
t=
1
. For the softmax case, this follows from
lim
t!+1
e
tzi
P
k
e
tzk
= lim
t!+1
e
tzi
P
k2A(z)
e
tzk
= lim
t!+1
e
t(ziz
(1))
jA(z)j
=

1=jA(z)j;ifi2A(z)
0; otherwise.
(41)
For the sparsemax case, we invoke Eq. 4 and the fact thatk(tz) =jA(z)jif(tz)1=jA(z)j. Since(tz) =t(z), the
result follows.
The second property holds for softmax, since(e
zi+c
)=
P
k
e
zk+c
=e
zi
=
P
k
e
zk
; and for sparsemax, since for anyp2

K1
we havekpzc1k
2
=kpzk
2
2c1
>
(pz) +kc1k
2
, which equalskpzk
2
plus a constant (because
1
>
p= 1).
Finally, let us turn to fourth property. The rst inequality states thatzizj)i(z)j(z)(i.e., coordinate
monotonicity). For the softmax case, this follows trivially from the fact that the exponential function is increasing. For the
sparsemax, we use a proof by contradiction. Supposezizjandsparsemax
i(z)>sparsemax
j(z). From the denition
in Eq. 2, we must havekpzk
2
 ksparsemax(z)zk
2
, for anyp2
K1
. This leads to a contradiction if we choose
pk= sparsemax
k(z)fork =2 fi; jg,pi= sparsemax
j(z), andpj= sparsemax
i(z). To prove the second inequality
in the fourth property for softmax, we need to show that, withzizj, we have(e
zj
e
zi
)=
P
k
e
zk
(zjzi)=2.
Since
P
k
e
zk
e
zj
+e
zi
, it sufces to consider the binary case,i.e., we need to prove thattanh((zjzi)=2) =
(e
zj
e
zi
)=(e
zj
+e
zi
)(zjzi)=2, that is,tanh(t)tfort0. This comes fromtanh(0) = 0andtanh
0
(t) =
1tanh
2
(t)1. For sparsemax, given two coordinatesi,j, three things can happen: (i) both are thresholded, in which
casej(z)i(z) =zjzi; (ii) the smaller (zi) is truncated, in which casej(z)i(z) =zj(z)zjzi; (iii)
both are truncated, in which casej(z)i(z) = 0zjzi.
A.3. Proof of Prop. 3
To prove the rst claim, note that, forj2S(z),
@
2
(z)
@zj
= 2(z)
@(z)
@zj
=
2(z)
jS(z)j
; (42)
where we used Eq. 10. We then have
@Lsparsemax(z;k)
@zj
=

k(j) +zj(z)ifj2S(z)
k(j) otherwise:

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
That is,rzLsparsemax(z;k) =k+ sparsemax(z).
To prove the second statement, from the expression for the Jacobian in Eq. 11, we have that the Hessian ofLsparsemax
(strictly speaking, a “sub-Hessian” (Penot, 2014), since the loss is not twice-differentiable everywhere) is given by
@
2
Lsparsemax(z;k)
@xi@xj
=

ij
1
jS(z)j
ifi; j2S(z)
0 otherwise.
(43)
This Hessian can be written in the formId11
>
=jS(z)jup to padding zeros (for the coordinates not inS(z)); hence it is
positive semi-denite (with rankjS(z)j 1), which establishes the convexity ofLsparsemax.
For the third claim, we haveLsparsemax(z+c1) =zkc+
1
2
P
j2S(z)
(z
2
j

2
(z) + 2c(zj)) +
1
2
=zkc+
1
2
P
j2S(z)
(z
2
j

2
(z) + 2cpj) +
1
2
=Lsparsemax(z), since
P
j2S(z)
pj= 1.
From the rst two claims, we have that the minima ofLsparsemaxhave zero gradient, i.e., satisfy the equation
sparsemax(z) =k. Furthemore, from Prop. 2, we have that the sparsemax never increases the distance between two co-
ordinates, i.e.,sparsemax
k(z)sparsemax
j(z)zkzj. Thereforesparsemax(z) =kimplieszk1 + maxj6=kzj.
To prove the converse statement, note that the distance above can only be decreased if the smallest coordinate is truncated
to zero. This establishes the equivalence between (ii) and (iii) in the fth claim. Finally, we have that the minimum loss
value is achieved whenS(z) =fkg, in which case(z) =zk1, leading to
Lsparsemax(z;k) =zk+
1
2
(z
2
k(zk1)
2
) +
1
2
= 0: (44)
This proves the equivalence with (i) and also the fourth claim.