Extracted Text
1602.02068v2.pdf
From Softmax to Sparsemax:
A Sparse Model of Attention and Multi-Label Classication
Andr´e F. T. Martins
y]
ANDRE.MARTINS@UNBABEL.COM
Ram´on F. Astudillo
y
RAMON@UNBABEL.COM
y
Unbabel Lda, Rua Visconde de Santar´em, 67-B, 1000-286 Lisboa, Portugal
]
Instituto de Telecomunicac¸oes (IT), Instituto Superior T´ecnico, Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal
Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Rua Alves Redol, 9, 1000-029 Lisboa, Portugal
Abstract
We propose sparsemax, a new activation func-
tion similar to the traditional softmax, but able
to output sparse probabilities. After deriving
its properties, we show how its Jacobian can be
efciently computed, enabling its use in a net-
work trained with backpropagation. Then, we
propose a new smooth and convex loss function
which is the sparsemax analogue of the logis-
tic loss. We reveal an unexpected connection
between this new loss and the Huber classi-
cation loss. We obtain promising empirical re-
sults in multi-label classication problems and in
attention-based neural networks for natural lan-
guage inference. For the latter, we achieve a sim-
ilar performance as the traditional softmax, but
with a selective, more compact, attention focus.
1. Introduction
The softmax transformation is a key component of several
statistical learning models, encompassing multinomial lo-
gistic regression (McCullagh & Nelder, 1989), action se-
lection in reinforcement learning (Sutton & Barto, 1998),
and neural networks for multi-class classication (Bridle,
1990; Goodfellow et al., 2016). Recently, it has also been
used to design attention mechanisms in neural networks,
with important achievements in machine translation (Bah-
danau et al., 2015), image caption generation (Xu et al.,
2015), speech recognition (Chorowski et al., 2015), mem-
ory networks (Sukhbaatar et al., 2015), and various tasks
in natural language understanding (Hermann et al., 2015;
Rockt¨aschel et al., 2015; Rush et al., 2015) and computa-
tion learning (Graves et al., 2014; Grefenstette et al., 2015).
There are a number of reasons why the softmax transfor-
mation is so appealing. It is simple to evaluate and dif-
ferentiate, and it can be turned into the (convex) negative
log-likelihood loss function by taking the logarithm of its
output. Alternatives proposed in the literature, such as the
Bradley-Terry model (Bradley & Terry, 1952; Zadrozny,
2001; Menke & Martinez, 2008), the multinomial probit
(Albert & Chib, 1993), the spherical softmax (Ollivier,
2013; Vincent, 2015; de Br´ebisson & Vincent, 2015), or
softmax approximations (Bouchard, 2007), while theoret-
ically or computationally advantageous for certain scenar-
ios, lack some of the convenient properties of softmax.
In this paper, we propose thesparsemax transformation.
Sparsemax has the distinctive feature that it can return
sparse posterior distributions, that is, it may assign exactly
zero probability to some of its output variables. This prop-
erty makes it appealing to be used as a lter for large out-
put spaces, to predict multiple labels, or as a component to
identify which of a group of variables are potentially rele-
vant for a decision, making the model more interpretable.
Crucially, this is done while preserving most of the attrac-
tive properties of softmax: we show that sparsemax is also
simple to evaluate, it is even cheaper to differentiate, and
that it can be turned into a convex loss function.
To sum up, our contributions are as follows:
We formalize the new sparsemax transformation, de-
rive its properties, and show how it can be efciently
computed (x2.12.3). We show that in the binary case
sparsemax reduces to a hard sigmoid (x2.4).
We derive the Jacobian of sparsemax, comparing it to
the softmax case, and show that it can lead to faster
gradient backpropagation (x2.5).
We propose thesparsemax loss, a new loss function
that is the sparsemax analogue of logistic regression
(x3). We show that it is convex, everywhere differen-
tiable, and can be regarded as a multi-class general-
ization of the Huber classication loss, an important
tool in robust statistics (Huber, 1964; Zhang, 2004).
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
We apply the sparsemax loss to train multi-label linear
classiers (which predict asetof labels instead of a
single label) on benchmark datasets (x4.14.2).
Finally, we devise a neural selective attention mecha-
nism using the sparsemax transformation, evaluating
its performance on a natural language inference prob-
lem, with encouraging results (x4.3).
2. The Sparsemax Transformation
2.1. Denition
Let
K 1
:=fp2R
K
j1
>
p= 1;p0gbe the(K
1)-dimensional simplex. We are interested in functions that
map vectors inR
K
to probability distributions in
K 1
.
Such functions are useful for converting a vector of real
weights (e.g., label scores) to a probability distribution (e.g.
posterior probabilities of labels). The classical example is
thesoftmax function, dened componentwise as:
softmaxi(z) =
exp(zi)
P
j
exp(zj)
: (1)
A limitation of the softmax transformation is that the re-
sulting probability distribution always has full support,i.e.,
softmaxi(z)6= 0for everyzandi. This is a disadvan-
tage in applications where a sparse probability distribution
is desired, in which case it is common to dene a threshold
below which small probability values are truncated to zero.
In this paper, we propose as an alternative the following
transformation, which we callsparsemax:
sparsemax(z) := argmin
p2
K 1
kp zk
2
: (2)
In words, sparsemax returns the Euclidean projection of the
input vectorzonto the probability simplex. This projection
is likely to hit the boundary of the simplex, in which case
sparsemax(z)becomes sparse. We will see that sparsemax
retains most of the important properties of softmax, having
in addition the ability of producing sparse distributions.
2.2. Closed-Form Solution
Projecting onto the simplex is a well studied problem, for
which linear-time algorithms are available (Michelot, 1986;
Pardalos & Kovoor, 1990; Duchi et al., 2008). We start by
recalling the well-known result that such projections corre-
spond to a soft-thresholding operation. Below, we use the
notation[K] :=f1; : : : ; Kgand[t]+:= maxf0; tg.
Proposition 1The solution of Eq. 2 is of the form:
sparsemax
i(z) = [zi (z)]+; (3)
where:R
K
!Ris the (unique) function that satis-
es
P
j
[zj (z)]+= 1for everyz. Furthermore,
Algorithm 1Sparsemax Evaluation
Input:z
Sortzasz
(1): : :z
(K)
Findk(z) := max
n
k2[K]j1 +kz
(k)>
P
jk
z
(j)
o
Dene(z) =
(
P
jk(z)
z
(j)) 1
k(z)
Output:ps.t.pi= [zi (z)]+.
can be expressed as follows. Letz
(1)z
(2): : :
z
(K)be the sorted coordinates ofz, and denek(z) :=
max
n
k2[K]j1 +kz
(k)>
P
jk
z
(j)
o
. Then,
(z) =
P
jk(z)
z
(j)
1
k(z)
=
P
j2S(z)
zj
1
jS(z)j
;(4)
whereS(z) :=fj2[K]jsparsemax
j(z)>0gis the
support ofsparsemax(z).
Proof:See App. A.1 in the supplemental material.
In essence, Prop. 1 states that all we need for evaluating
the sparsemax transformation is to compute the threshold
(z); all coordinates above this threshold (the ones in the
setS(z)) will be shifted by this amount, and the others will
be truncated to zero. We callin Eq. 4 thethreshold func-
tion. This piecewise linear function will play an important
role in the sequel. Alg. 1 illustrates a na¨veO(KlogK)
algorithm that uses Prop. 1 for evaluating the sparsemax.
1
2.3. Basic Properties
We now highlight some properties that are common to soft-
max and sparsemax. Letz
(1):= maxkzk, and denote by
A(z) :=fk2[K]jzk=z
(1)gthe set of maximal compo-
nents ofz. We dene the indicator vector1
A(z), whosekth
component is1ifk2A(z), and0otherwise. We further
denote by(z) :=z
(1) max
k =2A(z)zkthe gap between
the maximal components ofzand the second largest. We
let0and1be vectors of zeros and ones, respectively.
Proposition 2The following properties hold for2
fsoftmax;sparsemaxg.
1.(0) =1=Kandlim
!0
+(
1
z) =1
A(z)=jA(z)j
(uniform distribution, and distribution peaked on the
maximal components ofz, respectively). For sparse-
max, the last equality holds for any(z) jA(z)j.
2.(z) =(z+c1), for anyc2R(i.e.,is invariant
to adding a constant to each coordinate).
1
More elaborateO(K)algorithms exist based on linear-time
selection (Blum et al., 1973; Pardalos & Kovoor, 1990).
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication 3 2 1 0 1 2 3
t
0.0
0.2
0.4
0.6
0.8
1.0 softmax
1
([t,0])
sparsemax
1
([t,0]) t
1
3
2
1
0
1
2
3
t 2
3
2
1
0
1
2
3
sparsemax
1
([t
1
,t
2
,0])
0.0
0.2
0.4
0.6
0.8
1.0 t
1
− 3
− 2
− 1
0
1
2
3
t 2
− 3
− 2
− 1
0
1
2
3
softmax
1
([t
1
,t
2
,0])
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1.Comparison of softmax and sparsemax in 2D (left) and 3D (two righmost plots).
3.(Pz) =P(z)for any permutation matrixP(i.e.,
commutes with permutations).
4. zizj, then0j(z) i(z)(zj zi),
where=
1
2
for softmax, and= 1for sparsemax.
Proof:See App. A.2 in the supplemental material.
Interpretingas a temperature parameter, the rst part
of Prop. 2 shows that the sparsemax has the same zero-
temperature limit behaviour as the softmax, but without
the need of making the temperature arbitrarily small.
Prop. 2 is reassuring, since it shows that the sparsemax
transformation, despite being dened very differently from
the softmax, has a similar behaviour and preserves the same
invariances. Note that some of these properties are not sat-
ised by other proposed replacements of the softmax: for
example, the spherical softmax (Ollivier, 2013), dened as
i(z) :=z
2
i
=
P
j
z
2
j
, does not satisfy properties 2 and 4.
2.4. Two and Three-Dimensional Cases
For the two-class case, it is well known that the softmax
activation becomes the logistic (sigmoid) function. More
precisely, ifz= (t;0), thensoftmax1(z) =(t) :=
(1 + exp( t))
1
. We next show that the analogous in
sparsemax is the hard version of the sigmoid. In fact,
using Prop. 1, Eq. 4, we have that, forz= (t;0),
(z) =
8
<
:
t 1; ift >1
(t 1)=2;if 1t1
1; ift <
A Sparse Model of Attention and Multi-Label Classication
Andr´e F. T. Martins
y]
ANDRE.MARTINS@UNBABEL.COM
Ram´on F. Astudillo
y
RAMON@UNBABEL.COM
y
Unbabel Lda, Rua Visconde de Santar´em, 67-B, 1000-286 Lisboa, Portugal
]
Instituto de Telecomunicac¸oes (IT), Instituto Superior T´ecnico, Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal
Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Rua Alves Redol, 9, 1000-029 Lisboa, Portugal
Abstract
We propose sparsemax, a new activation func-
tion similar to the traditional softmax, but able
to output sparse probabilities. After deriving
its properties, we show how its Jacobian can be
efciently computed, enabling its use in a net-
work trained with backpropagation. Then, we
propose a new smooth and convex loss function
which is the sparsemax analogue of the logis-
tic loss. We reveal an unexpected connection
between this new loss and the Huber classi-
cation loss. We obtain promising empirical re-
sults in multi-label classication problems and in
attention-based neural networks for natural lan-
guage inference. For the latter, we achieve a sim-
ilar performance as the traditional softmax, but
with a selective, more compact, attention focus.
1. Introduction
The softmax transformation is a key component of several
statistical learning models, encompassing multinomial lo-
gistic regression (McCullagh & Nelder, 1989), action se-
lection in reinforcement learning (Sutton & Barto, 1998),
and neural networks for multi-class classication (Bridle,
1990; Goodfellow et al., 2016). Recently, it has also been
used to design attention mechanisms in neural networks,
with important achievements in machine translation (Bah-
danau et al., 2015), image caption generation (Xu et al.,
2015), speech recognition (Chorowski et al., 2015), mem-
ory networks (Sukhbaatar et al., 2015), and various tasks
in natural language understanding (Hermann et al., 2015;
Rockt¨aschel et al., 2015; Rush et al., 2015) and computa-
tion learning (Graves et al., 2014; Grefenstette et al., 2015).
There are a number of reasons why the softmax transfor-
mation is so appealing. It is simple to evaluate and dif-
ferentiate, and it can be turned into the (convex) negative
log-likelihood loss function by taking the logarithm of its
output. Alternatives proposed in the literature, such as the
Bradley-Terry model (Bradley & Terry, 1952; Zadrozny,
2001; Menke & Martinez, 2008), the multinomial probit
(Albert & Chib, 1993), the spherical softmax (Ollivier,
2013; Vincent, 2015; de Br´ebisson & Vincent, 2015), or
softmax approximations (Bouchard, 2007), while theoret-
ically or computationally advantageous for certain scenar-
ios, lack some of the convenient properties of softmax.
In this paper, we propose thesparsemax transformation.
Sparsemax has the distinctive feature that it can return
sparse posterior distributions, that is, it may assign exactly
zero probability to some of its output variables. This prop-
erty makes it appealing to be used as a lter for large out-
put spaces, to predict multiple labels, or as a component to
identify which of a group of variables are potentially rele-
vant for a decision, making the model more interpretable.
Crucially, this is done while preserving most of the attrac-
tive properties of softmax: we show that sparsemax is also
simple to evaluate, it is even cheaper to differentiate, and
that it can be turned into a convex loss function.
To sum up, our contributions are as follows:
We formalize the new sparsemax transformation, de-
rive its properties, and show how it can be efciently
computed (x2.12.3). We show that in the binary case
sparsemax reduces to a hard sigmoid (x2.4).
We derive the Jacobian of sparsemax, comparing it to
the softmax case, and show that it can lead to faster
gradient backpropagation (x2.5).
We propose thesparsemax loss, a new loss function
that is the sparsemax analogue of logistic regression
(x3). We show that it is convex, everywhere differen-
tiable, and can be regarded as a multi-class general-
ization of the Huber classication loss, an important
tool in robust statistics (Huber, 1964; Zhang, 2004).
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication
We apply the sparsemax loss to train multi-label linear
classiers (which predict asetof labels instead of a
single label) on benchmark datasets (x4.14.2).
Finally, we devise a neural selective attention mecha-
nism using the sparsemax transformation, evaluating
its performance on a natural language inference prob-
lem, with encouraging results (x4.3).
2. The Sparsemax Transformation
2.1. Denition
Let
K 1
:=fp2R
K
j1
>
p= 1;p0gbe the(K
1)-dimensional simplex. We are interested in functions that
map vectors inR
K
to probability distributions in
K 1
.
Such functions are useful for converting a vector of real
weights (e.g., label scores) to a probability distribution (e.g.
posterior probabilities of labels). The classical example is
thesoftmax function, dened componentwise as:
softmaxi(z) =
exp(zi)
P
j
exp(zj)
: (1)
A limitation of the softmax transformation is that the re-
sulting probability distribution always has full support,i.e.,
softmaxi(z)6= 0for everyzandi. This is a disadvan-
tage in applications where a sparse probability distribution
is desired, in which case it is common to dene a threshold
below which small probability values are truncated to zero.
In this paper, we propose as an alternative the following
transformation, which we callsparsemax:
sparsemax(z) := argmin
p2
K 1
kp zk
2
: (2)
In words, sparsemax returns the Euclidean projection of the
input vectorzonto the probability simplex. This projection
is likely to hit the boundary of the simplex, in which case
sparsemax(z)becomes sparse. We will see that sparsemax
retains most of the important properties of softmax, having
in addition the ability of producing sparse distributions.
2.2. Closed-Form Solution
Projecting onto the simplex is a well studied problem, for
which linear-time algorithms are available (Michelot, 1986;
Pardalos & Kovoor, 1990; Duchi et al., 2008). We start by
recalling the well-known result that such projections corre-
spond to a soft-thresholding operation. Below, we use the
notation[K] :=f1; : : : ; Kgand[t]+:= maxf0; tg.
Proposition 1The solution of Eq. 2 is of the form:
sparsemax
i(z) = [zi (z)]+; (3)
where:R
K
!Ris the (unique) function that satis-
es
P
j
[zj (z)]+= 1for everyz. Furthermore,
Algorithm 1Sparsemax Evaluation
Input:z
Sortzasz
(1): : :z
(K)
Findk(z) := max
n
k2[K]j1 +kz
(k)>
P
jk
z
(j)
o
Dene(z) =
(
P
jk(z)
z
(j)) 1
k(z)
Output:ps.t.pi= [zi (z)]+.
can be expressed as follows. Letz
(1)z
(2): : :
z
(K)be the sorted coordinates ofz, and denek(z) :=
max
n
k2[K]j1 +kz
(k)>
P
jk
z
(j)
o
. Then,
(z) =
P
jk(z)
z
(j)
1
k(z)
=
P
j2S(z)
zj
1
jS(z)j
;(4)
whereS(z) :=fj2[K]jsparsemax
j(z)>0gis the
support ofsparsemax(z).
Proof:See App. A.1 in the supplemental material.
In essence, Prop. 1 states that all we need for evaluating
the sparsemax transformation is to compute the threshold
(z); all coordinates above this threshold (the ones in the
setS(z)) will be shifted by this amount, and the others will
be truncated to zero. We callin Eq. 4 thethreshold func-
tion. This piecewise linear function will play an important
role in the sequel. Alg. 1 illustrates a na¨veO(KlogK)
algorithm that uses Prop. 1 for evaluating the sparsemax.
1
2.3. Basic Properties
We now highlight some properties that are common to soft-
max and sparsemax. Letz
(1):= maxkzk, and denote by
A(z) :=fk2[K]jzk=z
(1)gthe set of maximal compo-
nents ofz. We dene the indicator vector1
A(z), whosekth
component is1ifk2A(z), and0otherwise. We further
denote by(z) :=z
(1) max
k =2A(z)zkthe gap between
the maximal components ofzand the second largest. We
let0and1be vectors of zeros and ones, respectively.
Proposition 2The following properties hold for2
fsoftmax;sparsemaxg.
1.(0) =1=Kandlim
!0
+(
1
z) =1
A(z)=jA(z)j
(uniform distribution, and distribution peaked on the
maximal components ofz, respectively). For sparse-
max, the last equality holds for any(z) jA(z)j.
2.(z) =(z+c1), for anyc2R(i.e.,is invariant
to adding a constant to each coordinate).
1
More elaborateO(K)algorithms exist based on linear-time
selection (Blum et al., 1973; Pardalos & Kovoor, 1990).
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classication 3 2 1 0 1 2 3
t
0.0
0.2
0.4
0.6
0.8
1.0 softmax
1
([t,0])
sparsemax
1
([t,0]) t
1
3
2
1
0
1
2
3
t 2
3
2
1
0
1
2
3
sparsemax
1
([t
1
,t
2
,0])
0.0
0.2
0.4
0.6
0.8
1.0 t
1
− 3
− 2
− 1
0
1
2
3
t 2
− 3
− 2
− 1
0
1
2
3
softmax
1
([t
1
,t
2
,0])
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1.Comparison of softmax and sparsemax in 2D (left) and 3D (two righmost plots).
3.(Pz) =P(z)for any permutation matrixP(i.e.,
commutes with permutations).
4. zizj, then0j(z) i(z)(zj zi),
where=
1
2
for softmax, and= 1for sparsemax.
Proof:See App. A.2 in the supplemental material.
Interpretingas a temperature parameter, the rst part
of Prop. 2 shows that the sparsemax has the same zero-
temperature limit behaviour as the softmax, but without
the need of making the temperature arbitrarily small.
Prop. 2 is reassuring, since it shows that the sparsemax
transformation, despite being dened very differently from
the softmax, has a similar behaviour and preserves the same
invariances. Note that some of these properties are not sat-
ised by other proposed replacements of the softmax: for
example, the spherical softmax (Ollivier, 2013), dened as
i(z) :=z
2
i
=
P
j
z
2
j
, does not satisfy properties 2 and 4.
2.4. Two and Three-Dimensional Cases
For the two-class case, it is well known that the softmax
activation becomes the logistic (sigmoid) function. More
precisely, ifz= (t;0), thensoftmax1(z) =(t) :=
(1 + exp( t))
1
. We next show that the analogous in
sparsemax is the hard version of the sigmoid. In fact,
using Prop. 1, Eq. 4, we have that, forz= (t;0),
(z) =
8
<
:
t 1; ift >1
(t 1)=2;if 1t1
1; ift <