Extracted Text

1702.08591

The Shattered Gradients Problem:
If resnets are the answer, then what is the question?
David Balduzzi
1
Marcus Frean
1
Lennox Leary
1
JP Lewis
1 2
Kurt Wan-Duo Ma
1
Brian McWilliams
3
Abstract
A long-standing obstacle to progress in deep
learning is the problem of vanishing and ex-
ploding gradients. Although, the problem has
largely been overcome via carefully constructed
initializations and batch normalization, archi-
tectures incorporating skip-connections such as
highway and resnets perform much better than
standard feedforward architectures despite well-
chosen initialization and batch normalization. In
this paper, we identify the shattered gradients
problem. Specically, we show that the cor-
relation between gradients in standard feedfor-
ward networks decays exponentially with depth
resulting in gradients that resemble white noise
whereas, in contrast, the gradients in architec-
tures with skip-connections are far more resis-
tant to shattering, decaying sublinearly. Detailed
empirical evidence is presented in support of the
analysis, on both fully-connected networks and
convnets. Finally, we present a new looks lin-
ear (LL) initialization that prevents shattering,
with preliminary experiments showing the new
initialization allows to train very deep networks
without the addition of skip-connections.
1. Introduction
Deep neural networks have achieved outstanding perfor-
mance (Krizhevsky et al.,;,;
et al.,). Reducing the tendency of gradients to van-
ish or explode with depth (Hochreiter,;,
1994) has been essential to this progress.
Combining careful initialization (Glorot & Bengio,;
*
Equal contribution
1
Victoria University of Welling-
ton, New Zealand
2
SEED, Electronic Arts
3
Disney Re-
search, Z¨urich, Switzerland. Correspondence to: David
Balduzzi<dbalduzzi@gmail.com>, Brian McWilliams
<brian@disneyresearch.com>.
Proceedings of the34
th
International Conference on Machine
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017
by the author(s).
He et al.,) with batch normalization (Ioffe & Szegedy,
2015) bakes two solutions to the vanishing/exploding gra-
dient problem into a single architecture. The He initial-
ization ensures variance is preserved across rectier lay-
ers, and batch normalization ensures that backpropagation
through layers is unaffected by the scale of the weights
(Ioffe & Szegedy,).
It is perhaps surprising then that residual networks (resnets)
stillperform so much better than standard architectures
when networks are sufciently deep (He et al.,;b).
This raises the question:If resnets are the solution, then
what is the problem?We identify the shattered gradient
problem: a previously unnoticed difculty with gradients
in deep rectier networks that is orthogonal to vanishing
and exploding gradients. The shattering gradients problem
is that, as depth increases, gradients in standard feedfor-
ward networks increasingly resemble white noise. Resnets
dramatically reduce the tendency of gradients to shatter.
Our analysis applies at initialization. Shattering should de-
crease during training. Understanding how shattering af-
fects training is an important open problem.
Terminology.We refer to networks without skip con-
nections as feedforward netsin contrast to residual nets
(resnets) and highway nets. We distinguish between the
real-valuedoutputof a rectier and its binaryactivation:
the activation is 1 if the output is positive and 0 otherwise.
1.1. The Shattered Gradients Problem
The rst step is to simply look at the gradients of neural net-
works. Gradients are averaged over minibatches, depend
on both the loss and the random sample from the data, and
are extremely high-dimensional, which introduces multiple
confounding factors and makes visualization difcult (but
see section). We therefore construct a minimal model
designed to eliminate these confounding factors. The min-
imal model is a neural networkfW:R!Rtaking scalars
to scalars; each hidden layer containsN= 200rectier
neurons. The model is not intended to be applied to real
data. Rather, it is a laboratory where gradients can be iso-
lated and investigated.
We are interested in how the gradient varies, at initializa-

The Shattered Gradients Problem
Gradients2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
input
0.5
0.0
0.5
1.0
1.5
2.0
gradient 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
input
0.20
0.15
0.10
0.05
0.00
0.05
0.10
0.15
0.20
gradient 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
input
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
gradient Noise2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
3
2
1
0
1
2
3
4 Covariance matrices0 50 100 150 200 250
0
50
100
150
200
250 0 50 100 150 200 250
0
50
100
150
200
250 0 50 100 150 200 250
0
50
100
150
200
250 0 50 100 150 200 250
0
50
100
150
200
250 0 50 100 150 200 250
0
50
100
150
200
250
(a) 1-layer feedforward.
Figure 1:Comparison between noise and gradients of rectier nets with 200 neurons per hidden layer.Columns
de:brown and white noise.Columnsc: Gradients of neural nets plotted for inputs taken from a uniform grid. The
24-layer net uses mean-centering. The 50-layer net uses batch normalization with= 0:1, see Eq. (2).
tion, as a function of the input:
dfW
dx
(x
(i)
)wherex
(i)
2[2;2]is in a (1)
1-dim grid ofM= 256data points.
Updates during training depend on derivatives with respect
to weights, not inputs. Our results are relevant because, by
the chain rule,
@fW
@wij
=
@fW
@nj
@nj
@wij
. Weight updates thus de-
pend on
@fW
@nj
i.e. how the output of the network varies
with the output of neurons in one layer (which are just in-
puts to the next layer).
The top row of gure
dfW
dx
(x
(i)
)for each pointx
(i)
in the 1-dim grid. The bottom row shows the (absolute
value) of the covariance matrix:j(gg)(gg)
>
j=
2
g
wheregis the 256-vector of gradients,gthe mean, and
2
g
the variance.
If all the neurons were linear then the gradient would be
a horizontal line (i.e. the gradient would be constant as a
function ofx). Rectiers are not smooth, so the gradients
are discontinuous.
Gradients of shallow networks resemble brown noise.
Suppose the network has a single hidden layer:fw;b(x) =
w
>
(xvb). Following2010),
weightswand biasesbare sampled fromN(0;
2
)with

2
=
1
N
. Setv= (1; : : : ;1).
Figure
x2[2;2]and its covariance matrix. Figure
a discrete approximation to brownian motion:B
N
(t) =
P
t
s=1
WswhereWs N(0;
1
N
). The plots are strikingly
similar: both clearly exhibit spatial covariance structure.
The resemblance is not coincidental: section
Donsker's theorem to show the gradient converges to brow-
nian motion asN! 1.
Gradients of deep networks resemble white noise.Fig-
ure
tier network. Figure
plesWk N(0;1). Again, the plots are strikingly similar.
Since the inputs lie on a 1-dim grid, it makes sense to
compute the autocorrelation function (ACF) of the gradi-
ent. Figures
forward networks of different depth with white and brown
noise. The ACF for shallow networks resembles the ACF
of brown noise. As the network gets deeper, the ACF
quickly comes to resemble that of white noise.
Theorem corre-
lations between gradients decrease exponentially
1
2
Lwith
depth in feedforward rectier networks.
Training is difcult when gradients behave like white
noise.The shattered gradient problem is that the spatial
structure of gradients is progressively obliterated as neural
nets deepen. The problem is clearly visible when inputs
are taken from a one-dimensional grid, but is difcult to
observe when inputs are randomly sampled from a high-
dimensional dataset.
Shattered gradients undermine the effectiveness of algo-
rithms that assume gradients at nearby points are sim-
ilar such as momentum-based and accelerated methods
(Sutskever et al.,;,). If
dfW
dnj
be-
haves like white noise, then a neuron's effect on the output
of the network (whether increasing weights causes the net-
work to output more or less) becomes extremely unstable

The Shattered Gradients Problem5 10 15
Lag
0
0.5
1
Autocorrelation
2
4
10
24
50
Depth
(a) Feedforward nets.5 10 15
Lag
0
0.5
1
Autocorrelation (b) Resnets (= 1:0).5 10 15
Lag
0
0.5
1
Autocorrelation (c) Resnets (= 0:1).5 10 15
Lag
0
0.5
1
Autocorrelation
Brown noise
White noise (d) White and brown noise.
Figure 2:Autocorrelation Function (ACF).Comparison of the ACF between white and brown noise, and feedforward
and resnets of different depths. Average over 20 runs.
making learning difcult.
Gradients of deep resnets lie in between brown and
white noise.Introducing skip-connections allows much
deeper networks to be trained (Srivastava et al.,;
et al.,;a;,). Skip-connections signif-
icantly change the correlation structure of gradients. Fig-
ure
which has markedly more structure than the equivalent
feedforward net (gure). Figure
resnets of different depths. Although the gradients become
progressively less structured, they do not whiten to the ex-
tent of the gradients in standard feedforward networks
there are still correlations in the 50-layer resnet whereas in
the equivalent feedforward net, the gradients are indistin-
guishable from white noise. Figure
effect of recently proposed-rescaling (Szegedy et al.,
2016): the ACF of even the 50 layer network resemble
brown-noise.
Theorem
sublinearly with depth
1
p
L
for resnets with batch normal-
ization. We also show, corollary, that modied highway
networks (where the gates are scalars) can achieve adepth
independent correlation structure on gradients. The analy-
sis explains why skip-connections, combined with suitable
rescaling, preserve the structure of gradients.
1.2. Outline
Section
efciency. We explore how batch normalization behaves
differently in feedforward and resnets, and draw out facts
that are relevant to the main results.
The main results are in section. They explain why gra-
dients shatter and how skip-connections reduce shatter-
ing. The proofs are for a mathematically amenable model:
fully-connected rectier networks with the same number of
hidden neurons in each layer. Section
results which show gradients similarly shatter in convnets
for real data. It also shows that shattering causes average
gradients over minibatches to decrease with depth (relative
to the average variance of gradients).
Finally, section
ization) which eliminates shattering. Preliminary experi-
ments show the LL-init allows training of extremely deep
networks (200layers)withoutskip-connections.
1.3. Related work
Carefully initializing neural networks has led to a series
of performance breakthroughs dating back (at least) to the
unsupervised pretraining in2006);
et al.2006). The insight of2010) is
that controlling the variance of the distributions from which
weights are sampled allows to control how layers progres-
sively amplify or dampen the variance of activations and
error signals. More recently,2015) rened the
approach to take rectiers into account. Rectiers effec-
tively halve the variance since, at initialization and on av-
erage, they are active for half their inputs. Orthogonalizing
weight matrices can yield further improvements albeit at a
computational cost (Saxe et al.,;,
2016). The observation that the norms of weights form a
random walk was used by2015) to tune
the gains of neurons.
In short, it has proven useful to treat weights and gradients
as random variables, and carefully examine their effect on
the variance of the signals propagated through the network.
This paper presents a more detailed analysis that considers
correlationsbetween gradients at different datapoints.
The closest work to ours is2016), which shows
resnets behave like ensembles of shallow networks. We
provide a more detailed analysis of the effect of skip-
connections on gradients. A recent paper showed resnets
have universal nite-sample expressivity and may lack spu-
rious local optima (Hardt & Ma,) but does not explain
why deep feedforward nets are harder to train than resnets.
An interesting hypothesis is that skip-connections improve
performance by breaking symmetries (Orhan,).

The Shattered Gradients Problem10 20 30 40 50
layer
0.25
0.50
proportion
activations
coactivations
without batch normalization
with batch normalization
Figure 3:Activations and coactivations in feedforward
networks.Plots are averaged over 100 fully connected rec-
tier networks with 100 hidden units per layer.Without
BN:solid.With BN:dotted.Activations (green):Propor-
tion of inputs for which neurons in a given layer are active,
on average.Coactivations (blue):Proportion of distinct
pairs of inputs for which neurons are active, on average.
2. Observations on batch normalization
Batch normalization was introduced to reduce covariate
shift (Ioffe & Szegedy,). However, it has other ef-
fects that are less well-known and directly impact the
correlation structure of gradients. We investigate the effect
of batch normalization on neuronal activity at initialization
(i.e. when it mean-centers and rescales to unit variance).
We rst investigate batch normalization's effect on neural
activations. Neurons are active for half their inputs on aver-
age, gure, with or without batch normalization. Figure
also shows how often neurons are co-active for two inputs.
With batch normalization, neurons are co-active for
1
4
of
distinct pairs of inputs, which is what would happen if acti-
vations were decided by unbiased coin ips. Without batch
normalization, the co-active proportion climbs with depth,
suggesting neuronal responses are increasingly redundant.
Resnets with batch normalization behave the same as feed-
forward nets (not shown).
Figure
proportion of inputs causing neurons to be activeon av-
erageis misleading. The distribution becomes increasingly
bimodal with depth. In particular, neurons are either always
active or always inactive for layer 50 in the feedforward net
without batch normalization (blue histogram in gure).
Batch normalization causes most neurons to be active for
half the inputs, blue histograms in guresb,c.
Neurons that are always active may as well be linear. Neu-
rons that are always inactive may as well not exist. It fol-
lows that batch normalization increases theefciencywith
which rectier nonlinearities are utilized.
The increased efciency comes at a price. The raster plot
for feedforward networks resembles static television noise:
the spatial structure is obliterated. Resnets (Figure) ex-
hibit a compromise where neurons are utilized efciently
but the spatial structure is also somewhat preserved. The
preservation of spatial structure is quantied via theconti-
guity histogramswhich counts long runs of consistent acti-
vation. Resnets maintain a broad distribution of contiguity
even with deep networks whereas batch normalization on
feedforward nets shatters these into small sections.
3. Analysis
This section analyzes the correlation structure of gradients
in neural netsat initialization. The main ideas and results
are presented; the details provided in section.
Perhaps the simplest way to probe the structure of a ran-
dom process is to measure the rst few moments: the mean,
variance and covariance. We investigate how the correla-
tion betweentypical datapoints(dened below) changes
with network structure and depth. Weaker correlations
correspond to whiter gradients. The analysis is for fully-
connected networks. Extending to convnets involves (sig-
nicant) additional bookkeeping.
Proof strategy.The covariance denes aninner product
on the vector space of real-valued random variables with
mean zero and nite second moment. It was shown in
duzzi et al.2015);2016) that the gradients in
neural nets are sums ofpath-weightsover active paths, see
section. The rst step is to observe that path-weights are
orthogonalwith respect to the variance inner product. To
express gradients as linear combinations of path-weights is
thus to express them over an orthogonal basis.
Working in the path-weight basis reduces computing the
covariance between gradients at different datapoints to
counting the number of co-active paths through the net-
work. The second step is to count co-active paths and adjust
for rescaling factors (e.g. due to batch normalization).
The following assumption is crucial to the analysis:
Assumption 1(typical datapoints).We sayx
(i)
andx
(j)
aretypical datapointsif half of neurons per layer are active
for each and a quarter per layer are co-active for both. We
assume all pairs of datapoints are typical.
The assumption will not hold for every pair of datapoints.
Figure
batch normalization for both activations and coactivations.
The initialization in2015) assumes datapoints
activate half the neurons per layer. The assumption on
co-activations is implied by (and so weaker than) the as-
sumption in2015) that activations are
Bernoulli random variables independent of the inputs.

The Shattered Gradients Problem
Depth 21 M
hidden units
input 0.0 0.5 1.0
actvn/unit
1 M
contiguity 1 M
hidden units
input 0.0 0.5 1.0
actvn/unit
1 M
contiguity 1 M
hidden units
input 0.0 0.5 1.0
actvn/unit
1 M
contiguity Depth 10 Depth 50
(a) Feedforward net without batch norm.
Figure 4:Activation of rectiers in deep networks.Raster-plots:Activations of hidden units (y-axis) for inputs
indexed by thex-axis.Left histogram (activation per unit):distribution of average activation levels per neuron.Right
histogram (contiguity):distribution of contiguity (length of contiguous sequences of 0 or 1) along rows in the raster plot.
Correlations between gradients.Weight updates in a
neural network are proportional to
wjk/
#mb
X
i=1
P
X
p=1
@`
@fp
@fp
@nk
@nk
@wjk

x
(i)

:
wherefpis thep
th
coordinate of the output of the network
andnkis the output of thek
th
neuron. The derivatives
@`
@fp
and
@nk
@wjk
do not depend on the network's internal struc-
ture. We are interested in the middle term
@fp
@nk
, which
does. It is mathematically convenient to work with the sum
P
P
p=1
fpover output coordinates of the network. Section
shows that our results hold for convnets on real-data with
the cross-entropy loss. See also remark.
Denition 1.Letri:=
P
P
p=1
@fp
@n
(x
(i)
)be the deriva-
tive with respect to neuronngiven inputx
(i)
2 D. For
each inputx
(i)
, the derivativeriis a real-valued random
variable. It has mean zero since weights are sampled from
distributions with mean zero. Denote the covariance and
correlation of gradients by
C(i; j) =E[rirj]andR(i; j) =
E[rirj]
q
E[r
2
i]E[r
2
j]
;
where the expectations are w.r.t the distribution on weights.
3.1. Feedforward networks
Without loss of generality, pick a neuronnseparated from
the output byLlayers. The rst major result is
Theorem 1(covariance of gradients in feedforward nets).
Suppose weights are initialized with variance
2
=
2
N
fol-
lowing2015). Then
a) x
(i)
isC
fnn
(i) = 1.
b) C
fnn
(i; j) =
1
2
L.
Part (a) recovers the observation in2015) that
setting
2
=
2
N
preserves the variance across layers in rec-
tier networks. Part (b) is new. It explains the empirical
observation, gure, that gradients in feedforward nets
whiten with depth. Intuitively, gradients whiten because
the number of paths through the network grows exponen-
tially faster with depth than the fraction of co-active paths,
see section
3.2. Residual networks
The residual modules introduced in2016a) are
xl=xl1+W
l
BN

V
l
BN(xl1)

whereBN(a) =(BN(a))and(a) = max(0; a)is the
rectier. We analyse the stripped-down variant
xl=

xl1+W
l
BN(xl1)

(2)
whereandare rescaling factors. DroppingV
l
BN
makes no essential difference to the analysis. The-
rescaling was introduced in2016) where
it was observed setting2[0:1;0:3]reduces instability.
We includefor reasons of symmetry.
Theorem 2(covariance of gradients in resnets).Consider
a resnetwith batch normalization disabledand==
1. Suppose
2
=
2
N
as above. Then
a) x
(i)
isC
res
(i) = 2
L
.

The Shattered Gradients Problem
b) C
res
(i; j) =

3
2

L
.
The correlation isR
res
(i; j) =

3
4

L
.
The theorem implies there are two problems in resnets
without batch normalization: (i) the variance of gradients
grows and (ii) their correlation decays exponentially with
depth. Both problems are visible empirically.
3.3. Rescaling in Resnets
A solution to the exploding variance of resnets is to rescale
layers by=
1
p
2
which yields
C
res
=
p
2
(i) = 1andR
res
=
p
2
(i; j) =

3
4

L
and so controls the variance but the correlation between
gradients still decays exponentially with depth. Both theo-
retical predictions hold empirically.
In practice,-rescaling is not used. Instead, activations are
rescaled by batch normalization (Ioffe & Szegedy,)
and, more recently, setting2[0:1;0:3]per
(2016). The effect is dramatic:
Theorem 3(covariance of gradients in resnets with BN and
rescaling).Under the assumptions above, for resnets with
batch normalization and-rescaling,
a) C
res
;BN
(i) =
2
(L1) + 1;
b)
1
isC
res
;BN
(i; j)
p
L; and
the correlation isR
res
;BN
(i; j)
1

p
L
.
The theorem explains the empirical observation, gure,
that gradients in resnets whiten much more slowly with
depth than feedforward nets. It also explains why setting
near zero further reduces whitening.
Batch normalization changes the decay of the correlations
from
1
2
Lto
1
p
L
. Intuitively, the reason is that the variance
of the outputs of layers grows linearly, so batch normal-
ization rescales them by different amounts. Rescaling by
introduces a constant factor. Concretely, the model pre-
dicts using batch normalization with= 0:1on a 100-
layer resnet gives typical correlationR
res
0:1;BN
(i; j) = 0:7.
Setting= 1:0givesR
res
1:0;BN
(i; j) = 0:1. By contrast, a
100-layer feedforward net has correlation indistinguishable
from zero.
3.4. Highway networks
Highway networks can be thought of as a generalization of
resnets, that were in fact introduced slightly earlier (Srivas-
1
See section
tava et al.,;,). The standard highway
network has layers of the form
xl=

1T(xl1)

xl1+T(xl1)H(xl1)
whereT()andH()are learned gates and features respec-
tively. Consider the following modication where1and
2are scalars satisfying
2
1+
2
2= 1:
xl=1xl1+2W
l
(xl1)
The module can be recovered by judiciously choosing
andin equation (2). However, it is worth studying in its
own right:
Corollary 1(covariance of gradients in highway net-
works).Under the assumptions above, for modied high-
way networks with-rescaling,
a) C
HN
(i) = 1; and
b) R
HN
(i; j) =

2
1+
1
2

2
2

L
.
In particular, if1=
q
1
1
L
and2=
q
1
L
then the
correlation between gradients does not decay with depth
lim
L!1
R
HN
(i; j) =
1
p
e
:
The tradeoff is that the contributions of the layers becomes
increasingly trivial (i.e. close to the identity) asL! 1.
4. Gradients shatter in convnets
In this section we provide empirical evidence that the main
results also hold for deep convnets using the CIFAR-10
dataset. We instantiate feedforward and resnets with 2,
4, 10, 24 and 50 layers of equivalent size. Using a slight
modication of the bottleneck architecture in
(2016a), we introduce one skip-connection for every two
convolutional layers and both network architectures use
batch normalization.
Figuresa
the rst layer of feedforward and resnets (= 0:1) with a
minibatch of 256 random samples from CIFAR-10 for net-
works of depth 2 and 50. To highlight the spatial structure
of the gradients, the indices of the minibatches were re-
ordered according to ak-means clustering (k= 10) applied
to the gradients of the two-layer networks. The same per-
mutation is used for all networks within a row. The spatial
structure is visible in both two-layer networks, although it
is more apparent in the resnet. In the feedforward network
the structure quickly disappears with depth. In the resnet,
the structure remains apparent at 50 layers.