Extracted Text

1602.07360.pdf

Under review as a conference paper at ICLR 2017
SQUEEZENET: ALEXNET-LEVEL ACCURACY WITH
50X FEWER PARAMETERS AND <0.5MBMODEL SIZE
Forrest N. Iandola
1
, Song Han
2
, Matthew W. Moskewicz
1
, Khalid Ashraf
1
,
William J. Dally
2
, Kurt Keutzer
1
1
DeepScale

& UC Berkeley
2
Stanford University
fforresti, moskewcz, kashraf, keutzer g@eecs.berkeley.edu
fsonghan, dallyg@stanford.edu
ABSTRACT
Recent research on deep convolutional neural networks (CNNs) has focused pri-
marily on improving accuracy. For a given accuracy level, it is typically possi-
ble to identify multiple CNN architectures that achieve that accuracy level. With
equivalent accuracy, smaller CNN architectures offer at least three advantages: (1)
Smaller CNNs require less communication across servers during distributed train-
ing. (2) Smaller CNNs require less bandwidth to export a new model from the
cloud to an autonomous car. (3) Smaller CNNs are more feasible to deploy on FP-
GAs and other hardware with limited memory. To provide all of these advantages,
we propose a small CNN architecture called SqueezeNet. SqueezeNet achieves
AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally,
with model compression techniques, we are able to compress SqueezeNet to less
than 0.5MB (510smaller than AlexNet).
The SqueezeNet architecture is available for download here:
https://github.com/DeepScale/SqueezeNet
1 INTRODUCTION AND MOTIVATION
Much of the recent research on deep convolutional neural networks (CNNs) has focused on increas-
ing accuracy on computer vision datasets. For a given accuracy level, there typically exist multiple
CNN architectures that achieve that accuracy level. Given equivalent accuracy, a CNN architecture
with fewer parameters has several advantages:
More efcient distributed training.Communication among servers is the limiting factor
to the scalability of distributed CNN training. For distributed data-parallel training, com-
munication overhead is directly proportional to the number of parameters in the model (Ian-
dola et al., 2016). In short, small models train faster due to requiring less communication.
Less overhead when exporting new models to clients.For autonomous driving, compa-
nies such as Tesla periodically copy new models from their servers to customers' cars. This
practice is often referred to as anover-the-airupdate. Consumer Reports has found that
the safety of Tesla'sAutopilotsemi-autonomous driving functionality has incrementally
improved with recent over-the-air updates (Consumer Reports, 2016). However, over-the-
air updates of today's typical CNN/DNN models can require large data transfers. With
AlexNet, this would require 240MB of communication from the server to the car. Smaller
models require less communication, making frequent updates more feasible.
Feasible FPGA and embedded deployment.FPGAs often have less than 10MB
1
of on-
chip memory and no off-chip memory or storage. For inference, a sufciently small model
could be stored directly on the FPGA instead of being bottlenecked by memory band-
width (Qiu et al., 2016), while video frames stream through the FPGA in real time. Further,
when deploying CNNs on Application-Specic Integrated Circuits (ASICs), a sufciently
small model could be stored directly on-chip, and smaller models may enable the ASIC to
t on a smaller die.

http://deepscale.ai
1
For example, the Xilinx Vertex-7 FPGA has a maximum of 8.5 MBytes (i.e. 68 Mbits) of on-chip memory
and does not provide off-chip memory.
1

Under review as a conference paper at ICLR 2017
As you can see, there are several advantages of smaller CNN architectures. With this in mind, we
focus directly on the problem of identifying a CNN architecture with fewer parameters but equivalent
accuracy compared to a well-known model. We have discovered such an architecture, which we call
SqueezeNet.In addition, we present our attempt at a more disciplined approach to searching the
design space for novel CNN architectures.
The rest of the paper is organized as follows. In Section 2 we review the related work. Then, in
Sections 3 and 4 we describe and evaluate the SqueezeNet architecture. After that, we turn our
attention to understanding how CNN architectural design choices impact model size and accuracy.
We gain this understanding by exploring the design space of SqueezeNet-like architectures. In
Section 5, we do design space exploration on theCNN microarchitecture, which we dene as the
organization and dimensionality of individual layers and modules. In Section 6, we do design space
exploration on theCNN macroarchitecture, which we dene as high-level organization of layers in
a CNN. Finally, we conclude in Section 7. In short, Sections 3 and 4 are useful for CNN researchers
as well as practitioners who simply want to apply SqueezeNet to a new application. The remaining
sections are aimed at advanced researchers who intend to design their own CNN architectures.
2 RELATEDWORK
2.1 MODELCOMPRESSION
The overarching goal of our work is to identify a model that has very few parameters while preserv-
ing accuracy. To address this problem, a sensible approach is to take an existing CNN model and
compress it in a lossy fashion. In fact, a research community has emerged around the topic ofmodel
compression, and several approaches have been reported. A fairly straightforward approach by Den-
tonet al. is to apply singular value decomposition (SVD) to a pretrained CNN model (Denton et al.,
2014). Hanet al. developed Network Pruning, which begins with a pretrained model, then replaces
parameters that are below a certain threshold with zeros to form a sparse matrix, and nally performs
a few iterations of training on the sparse CNN (Han et al., 2015b). Recently, Hanet al. extended their
work by combining Network Pruning with quantization (to 8 bits or less) and huffman encoding to
create an approach called Deep Compression (Han et al., 2015a), and further designed a hardware
accelerator called EIE (Han et al., 2016a) that operates directly on the compressed model, achieving
substantial speedups and energy savings.
2.2 CNN MICROARCHITECTURE
Convolutions have been used in articial neural networks for at least 25 years; LeCunet al. helped
to popularize CNNs for digit recognition applications in the late 1980s (LeCun et al., 1989). In
neural networks, convolution lters are typically 3D, with height, width, and channels as the key
dimensions. When applied to images, CNN lters typically have 3 channels in their rst layer (i.e.
RGB), and in each subsequent layerLithe lters have the same number of channels asLi1has
lters. The early work by LeCunet al. (LeCun et al., 1989) uses 5x5xChannels
2
lters, and the
recent VGG (Simonyan & Zisserman, 2014) architectures extensively use 3x3 lters. Models such
as Network-in-Network (Lin et al., 2013) and the GoogLeNet family of architectures (Szegedy et al.,
2014; Ioffe & Szegedy, 2015; Szegedy et al., 2015; 2016) use 1x1 lters in some layers.
With the trend of designing very deep CNNs, it becomes cumbersome to manually select lter di-
mensions for each layer. To address this, various higher level building blocks, ormodules, comprised
of multiple convolution layers with a specic xed organization have been proposed. For example,
the GoogLeNet papers proposeInception modules, which are comprised of a number of different di-
mensionalities of lters, usually including 1x1 and 3x3, plus sometimes 5x5 (Szegedy et al., 2014)
and sometimes 1x3 and 3x1 (Szegedy et al., 2015). Many such modules are then combined, perhaps
with additionalad-hoclayers, to form a complete network. We use the termCNN microarchitecture
to refer to the particular organization and dimensions of the individual modules.
2.3 CNN MACROARCHITECTURE
While the CNN microarchitecture refers to individual layers and modules, we dene theCNN
macroarchitectureas the system-level organization of multiple modules into an end-to-end CNN
architecture.
2
From now on, we will simply abbreviate HxWxChannels to HxW.
2

Under review as a conference paper at ICLR 2017
Perhaps the mostly widely studied CNN macroarchitecture topic in the recent literature is the impact
ofdepth(i.e. number of layers) in networks. Simoyan and Zisserman proposed the VGG (Simonyan
& Zisserman, 2014) family of CNNs with 12 to 19 layers and reported that deeper networks produce
higher accuracy on the ImageNet-1k dataset (Deng et al., 2009). K. Heet al. proposed deeper CNNs
with up to 30 layers that deliver even higher ImageNet accuracy (He et al., 2015a).
The choice of connections across multiple layers or modules is an emerging area of CNN macroar-
chitectural research. Residual Networks (ResNet) (He et al., 2015b) and Highway Networks (Sri-
vastava et al., 2015) each propose the use of connections that skip over multiple layers, for example
additively connecting the activations from layer 3 to the activations from layer 6. We refer to these
connections asbypass connections. The authors of ResNet provide an A/B comparison of a 34-layer
CNN with and without bypass connections; adding bypass connections delivers a 2 percentage-point
improvement on Top-5 ImageNet accuracy.
2.4 NEURALNETWORKDESIGNSPACEEXPLORATION
Neural networks (including deep and convolutional NNs) have a large design space, with numerous
options for microarchitectures, macroarchitectures, solvers, and other hyperparameters. It seems
natural that the community would want to gain intuition about how these factors impact a NN's
accuracy (i.e. theshapeof the design space). Much of the work on design space exploration (DSE)
of NNs has focused on developing automated approaches for nding NN architectures that deliver
higher accuracy. These automated DSE approaches include bayesian optimization (Snoek et al.,
2012), simulated annealing (Ludermir et al., 2006), randomized search (Bergstra & Bengio, 2012),
and genetic algorithms (Stanley & Miikkulainen, 2002). To their credit, each of these papers pro-
vides a case in which the proposed DSE approach produces a NN architecture that achieves higher
accuracy compared to a representative baseline. However, these papers make no attempt to provide
intuition about the shape of the NN design space. Later in this paper, we eschew automated ap-
proaches instead, we refactor CNNs in such a way that we can do principled A/B comparisons to
investigate how CNN architectural decisions inuence model size and accuracy.
In the following sections, we rst propose and evaluate the SqueezeNet architecture with and with-
out model compression. Then, we explore the impact of design choices in microarchitecture and
macroarchitecture for SqueezeNet-like CNN architectures.
3 SQUEEZENET:PRESERVING ACCURACY WITH FEW PARAMETERS
In this section, we begin by outlining our design strategies for CNN architectures with few param-
eters. Then, we introduce theFire module, our new building block out of which to build CNN
architectures. Finally, we use our design strategies to constructSqueezeNet, which is comprised
mainly of Fire modules.
3.1 ARCHITECTURAL DESIGNSTRATEGIES
Our overarching objective in this paper is to identify CNN architectures that have few parameters
while maintaining competitive accuracy. To achieve this, we employ three main strategies when
designing CNN architectures:
Strategy 1.Replace 3x3 lters with 1x1 lters.Given a budget of a certain number of convolution
lters, we will choose to make the majority of these lters 1x1, since a 1x1 lter has 9X fewer
parameters than a 3x3 lter.
Strategy 2.Decrease the number of input channels to 3x3 lters.Consider a convolution layer
that is comprised entirely of 3x3 lters. The total quantity of parameters in this layer is (number of
input channels) * (number of lters) * (3*3). So, to maintain a small total number of parameters
in a CNN, it is important not only to decrease the number of 3x3 lters (see Strategy 1 above), but
also to decrease the number ofinput channelsto the 3x3 lters. We decrease the number of input
channels to 3x3 lters usingsqueeze layers, which we describe in the next section.
Strategy 3.Downsample late in the network so that convolution layers have large activation
maps.In a convolutional network, each convolution layer produces an output activation map with
a spatial resolution that is at least 1x1 and often much larger than 1x1. The height and width of
these activation maps are controlled by: (1) the size of the input data (e.g. 256x256 images) and (2)
3

Under review as a conference paper at ICLR 2017
Figure 1: Microarchitectural view: Organization of convolution lters in the Fire module . In this
example, s
1 x 1
= 3 , e
1 x 1
= 4 , and e
3 x 3
= 4 . We illustrate the convolution lters but not the
activations.
the choice of layers in which to downsample in the CNN architecture. Most commonly, downsam-
pling is engineered into CNN architectures by setting the (stride > 1) in some of the convolution or
pooling layers (e.g. (Szegedy et al., 2014; Simonyan & Zisserman, 2014; Krizhevsky et al., 2012)).
If early
3
layers in the network have large strides, then most layers will have small activation maps.
Conversely, if most layers in the network have a stride of 1, and the strides greater than 1 are con-
centrated toward the end
4
of the network, then many layers in the network will have large activation
maps. Our intuition is that large activation maps (due to delayed downsampling) can lead to higher
classication accuracy, with all else held equal. Indeed, K. He and H. Sun applied delayed down-
sampling to four different CNN architectures, and in each case delayed downsampling led to higher
classication accuracy (He & Sun, 2015).
Strategies 1 and 2 are about judiciously decreasing the quantity of parameters in a CNN while
attempting to preserve accuracy. Strategy 3 is about maximizing accuracy on a limited budget of
parameters. Next, we describe the Fire module, which is our building block for CNN architectures
that enables us to successfully employ Strategies 1, 2, and 3.
3 . 2 T H E F I R E M O D U L E
We dene the Fire module as follows. A Fire module is comprised of: a squeeze convolution layer
(which has only 1x1 lters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution
lters; we illustrate this in Figure 1. The liberal use of 1x1 lters in Fire modules is an application
of Strategy 1 from Section 3.1. We expose three tunable dimensions (hyperparameters) in a Fire
module: s
1 x 1
, e
1 x 1
, and e
3 x 3
. In a Fire module, s
1 x 1
is the number of lters in the squeeze layer
(all 1x1), e
1 x 1
is the number of 1x1 lters in the expand layer, and e
3 x 3
is the number of 3x3 lters
in the expand layer. When we use Fire modules we set s
1 x 1
to be less than ( e
1 x 1
+ e
3 x 3
), so the
squeeze layer helps to limit the number of input channels to the 3x3 lters, as per Strategy 2 from
Section 3.1.
3 . 3 T H E S Q U E E Z E N E T A R C H I T E C T U R E
We now describe the SqueezeNet CNN architecture. We illustrate in Figure 2 that SqueezeNet
begins with a standalone convolution layer (conv1), followed by 8 Fire modules (re2-9), ending
with a nal conv layer (conv10). We gradually increase the number of lters per re module from
the beginning to the end of the network. SqueezeNet performs max-pooling with a stride of 2 after
layers conv1, re4, re8, and conv10; these relatively late placements of pooling are per Strategy 3
from Section 3.1. We present the full SqueezeNet architecture in Table 1.
3
In our terminology, an early layer is close to the input data.
4
In our terminology, the end of the network is the classier.
4

Under review as a conference paper at ICLR 2017
ported the SqueezeNet CNN architecture for compatibility with a number of other CNN software
frameworks:
MXNet (Chen et al., 2015a) port of SqueezeNet: (Haria, 2016)
Chainer (Tokui et al., 2015) port of SqueezeNet: (Bell, 2016)
Keras (Chollet, 2016) port of SqueezeNet: (DT42, 2016)
Torch (Collobert et al., 2011) port of SqueezeNet's Fire Modules: (Waghmare, 2016)
4 EVALUATION OFSQUEEZENET
We now turn our attention to evaluating SqueezeNet. In each of the CNN model compression papers
reviewed in Section 2.1, the goal was to compress an AlexNet (Krizhevsky et al., 2012) model
that was trained to classify images using the ImageNet (Deng et al., 2009) (ILSVRC 2012) dataset.
Therefore, we use AlexNet
5
and the associated model compression results as a basis for comparison
when evaluating SqueezeNet.
Table 1: SqueezeNet architectural dimensions. (The formatting of this table was inspired by the
Inception2 paper (Ioffe & Szegedy, 2015).)
In Table 2, we review SqueezeNet in the context of recent model compression results. The SVD-
based approach is able to compress a pretrained AlexNet model by a factor of 5x, while diminishing
top-1 accuracy to 56.0% (Denton et al., 2014). Network Pruning achieves a 9x reduction in model
size while maintaining the baseline of 57.2% top-1 and 80.3% top-5 accuracy on ImageNet (Han
et al., 2015b). Deep Compression achieves a 35x reduction in model size while still maintaining the
baseline accuracy level (Han et al., 2015a). Now, with SqueezeNet, we achieve a 50X reduction in
model size compared to AlexNet, whilemeeting or exceeding the top-1 and top-5 accuracy of
AlexNet.We summarize all of the aforementioned results in Table 2.
It appears that we have surpassed the state-of-the-art results from the model compression commu-
nity: even when using uncompressed 32-bit values to represent the model, SqueezeNet has a1:4
smaller model size than the best efforts from the model compression community while maintain-
ing or exceeding the baseline accuracy. Until now, an open question has been:are small models
amenable to compression, or do small models need all of the representational power afforded
by dense oating-point values?To nd out, we applied Deep Compression (Han et al., 2015a)
5
Our baseline isbvlcalexnetfrom the Caffe codebase (Jia et al., 2014).
6

Under review as a conference paper at ICLR 2017
Table 2: Comparing SqueezeNet to model compression approaches. Bymodel size, we mean the
number of bytes required to store all of the parameters in the trained model.
CNN architectureCompression ApproachData
Type
Original!
Compressed Model
Size
Reduction in
Model Size
vs. AlexNet
Top-1
ImageNet
Accuracy
Top-5
ImageNet
Accuracy
AlexNet None (baseline) 32 bit 240MB 1x 57.2% 80.3%
AlexNet SVD (Denton et al.,
2014)
32 bit 240MB!48MB 5x 56.0% 79.4%
AlexNet Network Pruning (Han
et al., 2015b)
32 bit 240MB!27MB 9x 57.2% 80.3%
AlexNet Deep
Compression (Han
et al., 2015a)
5-8 bit240MB!6.9MB 35x 57.2% 80.3%
SqueezeNet (ours) None 32 bit 4.8MB 50x 57.5% 80.3%
SqueezeNet (ours) Deep Compression 8 bit 4.8MB!0.66MB 363x 57.5% 80.3%
SqueezeNet (ours) Deep Compression 6 bit 4.8MB!0.47MB 510x 57.5% 80.3%
to SqueezeNet, using 33% sparsity
6
and 8-bit quantization. This yields a 0.66 MB model (363
smaller than 32-bit AlexNet) with equivalent accuracy to AlexNet. Further, applying Deep Compres-
sion with 6-bit quantization and 33% sparsity on SqueezeNet, we produce a 0.47MB model (510
smaller than 32-bit AlexNet) with equivalent accuracy.Our small model is indeed amenable to
compression.
In addition, these results demonstrate that Deep Compression (Han et al., 2015a) not only works
well on CNN architectures with many parameters (e.g. AlexNet and VGG), but it is also able to
compress the already compact, fully convolutional SqueezeNet architecture. Deep Compression
compressed SqueezeNet by10while preserving the baseline accuracy. In summary: by combin-
ing CNN architectural innovation (SqueezeNet) with state-of-the-art compression techniques (Deep
Compression), we achieved a510reduction in model size with no decrease in accuracy compared
to the baseline.
Finally, note that Deep Compression (Han et al., 2015b) uses acodebookas part of its scheme for
quantizing CNN parameters to 6- or 8-bits of precision. Therefore, on most commodity processors,
it isnottrivial to achieve a speedup of
32
8
= 4xwith 8-bit quantization or
32
6
= 5:3xwith 6-bit
quantization using the scheme developed in Deep Compression. However, Hanet al. developed
custom hardware Efcient Inference Engine (EIE) that can compute codebook-quantized CNNs
more efciently (Han et al., 2016a). In addition, in the months since we released SqueezeNet,
P. Gysel developed a strategy calledRistrettofor linearly quantizing SqueezeNet to 8 bits (Gysel,
2016). Specically, Ristretto does computation in 8 bits, and it stores parameters and activations in
8-bit data types. Using the Ristretto strategy for 8-bit computation in SqueezeNet inference, Gysel
observed less than 1 percentage-point of drop in accuracy when using 8-bit instead of 32-bit data
types.
5 CNN MICROARCHITECTURE DESIGNSPACEEXPLORATION
So far, we have proposed architectural design strategies for small models, followed these principles
to create SqueezeNet, and discovered that SqueezeNet is 50x smaller than AlexNet with equivalent
accuracy. However, SqueezeNet and other models reside in a broad and largely unexplored design
space of CNN architectures. Now, in Sections 5 and 6, we explore several aspects of the design
space. We divide this architectural exploration into two main topics:microarchitectural exploration
(per-module layer dimensions and congurations) andmacroarchitectural exploration(high-level
end-to-end organization of modules and other layers).
In this section, we design and execute experiments with the goal of providing intuition about the
shape of the microarchitectural design space with respect to the design strategies that we proposed
in Section 3.1. Note that our goal here isnotto maximize accuracy in every experiment, but rather
to understand the impact of CNN architectural choices on model size and accuracy.
6
Note that, due to the storage overhead of storing sparse matrix indices, 33% sparsity leads to somewhat
less than a3decrease in model size.
7

Under review as a conference paper at ICLR 201713#MB#of#
weights#
85.3%#
accuracy#
86.0%#
accuracy#
19#MB#of#
weights#
4.8#MB#of#
weights#
80.3%#
accuracy#
SqueezeNet#
(a) Exploring the impact of the squeeze ratio (SR)
on model size and accuracy.21#MB#of#
weights#
13#MB#of#
weights#5.7#MB#of#
weights#
85.3%#
accuracy#
85.3%#
accuracy#76.3%#
accuracy#
(b) Exploring the impact of the ratio of 3x3 lters in
expand layers (pct3x3) on model size and accuracy.
Figure 3: Microarchitectural design space exploration.
5.1 CNN MICROARCHITECTURE METAPARAMETERS
In SqueezeNet, each Fire module has three dimensional hyperparameters that we dened in Sec-
tion 3.2:s1x1,e1x1, ande3x3. SqueezeNet has 8 Fire modules with a total of 24 dimensional
hyperparameters. To do broad sweeps of the design space of SqueezeNet-like architectures, we
dene the following set of higher levelmetaparameterswhich control the dimensions of all Fire
modules in a CNN. We denebaseeas the number ofexpandlters in the rst Fire module in a
CNN. After everyfreqFire modules, we increase the number of expand lters byincre. In other
words, for Fire modulei, the number of expand lters isei=basee+ (incre
j
i
f req
k
). In the
expand layer of a Fire module, some lters are 1x1 and some are 3x3; we deneei=ei;1x1+ei;3x3
withpct3x3(in the range[0;1], shared over all Fire modules) as the percentage of expand lters that
are 3x3. In other words,ei;3x3=eipct3x3, andei;1x1=ei(1pct3x3). Finally, we dene
the number of lters in the squeeze layer of a Fire module using a metaparameter called thesqueeze
ratio (SR)(again, in the range[0;1], shared by all Fire modules):si;1x1=SRei(or equivalently
si;1x1=SR(ei;1x1+ei;3x3)). SqueezeNet (Table 1) is an example architecture that we gen-
erated with the aforementioned set of metaparameters. Specically, SqueezeNet has the following
metaparameters:basee= 128,incre= 128,pct3x3= 0:5,freq= 2, andSR= 0:125.
5.2 SQUEEZERATIO
In Section 3.1, we proposed decreasing the number of parameters by usingsqueeze layersto decrease
the number of input channels seen by 3x3 lters. We dened thesqueeze ratio (SR)as the ratio
between the number of lters insqueezelayers and the number of lters inexpandlayers. We now
design an experiment to investigate the effect of the squeeze ratio on model size and accuracy.
In these experiments, we use SqueezeNet (Figure 2) as a starting point. As in SqueezeNet, these
experiments use the following metaparameters:basee= 128,incre= 128,pct3x3= 0:5, and
freq= 2. We train multiple models, where each model has a different squeeze ratio (SR)
7
in the
range [0.125, 1.0]. In Figure 3(a), we show the results of this experiment, where each point on the
graph is an independent model that was trained from scratch. SqueezeNet is the SR=0.125 point
in this gure.
8
From this gure, we learn that increasing SR beyond 0.125 can further increase
ImageNet top-5 accuracy from 80.3% (i.e. AlexNet-level) with a 4.8MB model to 86.0% with a
19MB model. Accuracy plateaus at 86.0% with SR=0.75 (a 19MB model), and setting SR=1.0
further increases model size without improving accuracy.
5.3 TRADING OFF1X1AND3X3FILTERS
In Section 3.1, we proposed decreasing the number of parameters in a CNN by replacing some 3x3
lters with 1x1 lters. An open question is,how important is spatial resolution in CNN lters?
7
Note that, for a given model, all Fire layers share the same squeeze ratio.
8
Note that we named itSqueezeNet because it has a low squeeze ratio (SR). That is, the squeeze layers in
SqueezeNet have 0.125x the number of lters as the expand layers.
8

Under review as a conference paper at ICLR 2017
The VGG (Simonyan & Zisserman, 2014) architectures have 3x3 spatial resolution in most layers'
lters; GoogLeNet (Szegedy et al., 2014) and Network-in-Network (NiN) (Lin et al., 2013) have
1x1 lters in some layers. In GoogLeNet and NiN, the authors simply propose a specic quantity of
1x1 and 3x3 lters without further analysis.
9
Here, we attempt to shed light on how the proportion
of 1x1 and 3x3 lters affects model size and accuracy.
We use the following metaparameters in this experiment:basee=incre= 128,freq= 2,SR=
0:500, and we varypct3x3from 1% to 99%. In other words, each Fire module's expand layer has a
predened number of lters partitioned between 1x1 and 3x3, and here we turn the knob on these
lters from mostly 1x1 to mostly 3x3. As in the previous experiment, these models have 8
Fire modules, following the same organization of layers as in Figure 2. We show the results of this
experiment in Figure 3(b). Note that the 13MB models in Figure 3(a) and Figure 3(b) are the same
architecture:SR= 0:500andpct3x3= 50%. We see in Figure 3(b) that the top-5 accuracy plateaus
at 85.6% using 50% 3x3 lters, and further increasing the percentage of 3x3 lters leads to a larger
model size but provides no improvement in accuracy on ImageNet.
6 CNN MACROARCHITECTURE DESIGNSPACEEXPLORATION
So far we have explored the design space at the microarchitecture level, i.e. the contents of individual
modules of the CNN. Now, we explore design decisions at the macroarchitecture level concerning
the high-level connections among Fire modules. Inspired by ResNet (He et al., 2015b), we explored
three different architectures:
Vanilla SqueezeNet (as per the prior sections).
SqueezeNet with simple bypass connections between some Fire modules. (Inspired by (Sri-
vastava et al., 2015; He et al., 2015b).)
SqueezeNet with complex bypass connections between the remaining Fire modules.
We illustrate these three variants of SqueezeNet in Figure 2.
Oursimple bypassarchitecture adds bypass connections around Fire modules 3, 5, 7, and 9, requiring
these modules to learn a residual function between input and output. As in ResNet, to implement
a bypass connection around Fire3, we set the input to Fire4 equal to (output of Fire2 + output of
Fire3), where the + operator is elementwise addition. This changes the regularization applied to the
parameters of these Fire modules, and, as per ResNet, can improve the nal accuracy and/or ability
to train the full model.
One limitation is that, in the straightforward case, the number of input channels and number of
output channels has to be the same; as a result, only half of the Fire modules can have simple
bypass connections, as shown in the middle diagram of Fig 2. When the same number of channels
requirement can't be met, we use acomplex bypassconnection, as illustrated on the right of Figure 2.
While a simple bypass is just a wire, we dene a complex bypass as a bypass that includes a 1x1
convolution layer with the number of lters set equal to the number of output channels that are
needed. Note that complex bypass connections add extra parameters to the model, while simple
bypass connections do not.
In addition to changing the regularization, it is intuitive to us that adding bypass connections would
help to alleviate the representational bottleneck introduced by squeeze layers. In SqueezeNet, the
squeeze ratio (SR) is 0.125, meaning that every squeeze layer has 8x fewer output channels than the
accompanying expand layer. Due to this severe dimensionality reduction, a limited amount of in-
formation can pass through squeeze layers. However, by adding bypass connections to SqueezeNet,
we open up avenues for information to owaroundthe squeeze layers.
We trained SqueezeNet with the three macroarchitectures in Figure 2 and compared the accuracy
and model size in Table 3. We xed the microarchitecture to match SqueezeNet as described in
Table 1 throughout the macroarchitecture exploration. Complex and simple bypass connections
both yielded an accuracy improvement over the vanilla SqueezeNet architecture. Interestingly, the
simple bypass enabled a higher accuracy accuracy improvement than complex bypass. Adding the
9
To be clear, each lter is 1x1xChannels or 3x3xChannels, which we abbreviate to 1x1 and 3x3.
9

Under review as a conference paper at ICLR 2017
Table 3: SqueezeNet accuracy and model size using different macroarchitecture congurations
Architecture Top-1 AccuracyTop-5 AccuracyModel Size
Vanilla SqueezeNet 57.5% 80.3% 4.8MB
SqueezeNet + Simple Bypass 60.4% 82.5% 4.8MB
SqueezeNet + Complex Bypass 58.8% 82.0% 7.7MB
simple bypass connections yielded an increase of 2.9 percentage-points in top-1 accuracy and 2.2
percentage-points in top-5 accuracy without increasing model size.
7 CONCLUSIONS
In this paper, we have proposed steps toward a more disciplined approach to the design-space explo-
ration of convolutional neural networks. Toward this goal we have presented SqueezeNet, a CNN
architecture that has50fewer parameters than AlexNet and maintains AlexNet-level accuracy on
ImageNet. We also compressed SqueezeNet to less than 0.5MB, or510smaller than AlexNet
without compression. Since we released this paper as a technical report in 2016, Song Han and
his collaborators have experimented further with SqueezeNet and model compression. Using a new
approach calledDense-Sparse-Dense (DSD)(Han et al., 2016b), Hanet al. use model compres-
sion during training as a regularizer to further improve accuracy, producing a compressed set of
SqueezeNet parameters that is 1.2 percentage-points more accurate on ImageNet-1k, and also pro-
ducing an uncompressed set of SqueezeNet parameters that is 4.3 percentage-points more accurate,
compared to our results in Table 2.
We mentioned near the beginning of this paper that small models are more amenable to on-chip
implementations on FPGAs. Since we released the SqueezeNet model, Gschwend has developed
a variant of SqueezeNet and implemented it on an FPGA (Gschwend, 2016). As we anticipated,
Gschwend was able to able to store the parameters of a SqueezeNet-like model entirely within the
FPGA and eliminate the need for off-chip memory accesses to load model parameters.
In the context of this paper, we focused on ImageNet as a target dataset. However, it has become
common practice to apply ImageNet-trained CNN representations to a variety of applications such
as ne-grained object recognition (Zhang et al., 2013; Donahue et al., 2013), logo identication in
images (Iandola et al., 2015), and generating sentences about images (Fang et al., 2015). ImageNet-
trained CNNs have also been applied to a number of applications pertaining to autonomous driv-
ing, including pedestrian and vehicle detection in images (Iandola et al., 2014; Girshick et al.,
2015; Ashraf et al., 2016) and videos (Chen et al., 2015b), as well as segmenting the shape of the
road (Badrinarayanan et al., 2015). We think SqueezeNet will be a good candidate CNN architecture
for a variety of applications, especially those in which small model size is of importance.
SqueezeNet is one of several new CNNs that we have discovered while broadly exploring the de-
sign space of CNN architectures. We hope that SqueezeNet will inspire the reader to consider and
explore the broad range of possibilities in the design space of CNN architectures and to perform that
exploration in a more systematic manner.
REFERENCES
Khalid Ashraf, Bichen Wu, Forrest N. Iandola, Matthew W. Moskewicz, and Kurt Keutzer. Shallow
networks for high-accuracy road object-detection.arXiv:1606.01561, 2016.
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A deep convolutional encoder-
decoder architecture for image segmentation.arxiv:1511.00561, 2015.
Eddie Bell. A implementation of squeezenet in chainer.https://github.com/ejlb/
squeezenet-chainer, 2016.
J. Bergstra and Y. Bengio. An optimization methodology for neural network weights and architec-
tures.JMLR, 2012.
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu,
Chiyuan Zhang, and Zheng Zhang. Mxnet: A exible and efcient machine learning library for
heterogeneous distributed systems.arXiv:1512.01274, 2015a.
10

Under review as a conference paper at ICLR 2017
Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and
Raquel Urtasun. 3d object proposals for accurate object class detection. InNIPS, 2015b.
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catan-
zaro, and Evan Shelhamer. cuDNN: efcient primitives for deep learning.arXiv:1410.0759,
2014.
Francois Chollet. Keras: Deep learning library for theano and tensorow.https://keras.io,
2016.
Ronan Collobert, Koray Kavukcuoglu, and Clement Farabet. Torch7: A matlab-like environment
for machine learning. InNIPS BigLearn Workshop, 2011.
Consumer Reports. Teslas new autopilot: Better but still needs
improvement. http://www.consumerreports.org/tesla/
tesla-new-autopilot-better-but-needs-improvement , 2016.
Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidyanathan, Srinivas Srid-
haran, Dhiraj D. Kalamkar, Bharat Kaul, and Pradeep Dubey. Distributed deep learning using
synchronous stochastic gradient descent.arXiv:1602.06709, 2016.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical
image database. InCVPR, 2009.
E.L Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within
convolutional networks for efcient evaluation. InNIPS, 2014.
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor
Darrell. Decaf: A deep convolutional activation feature for generic visual recognition.
arXiv:1310.1531, 2013.
DT42. Squeezenet keras implementation.https://github.com/DT42/squeezenet_
demo, 2016.
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao,
Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From
captions to visual concepts and back. InCVPR, 2015.
Ross B. Girshick, Forrest N. Iandola, Trevor Darrell, and Jitendra Malik. Deformable part models
are convolutional neural networks. InCVPR, 2015.
David Gschwend. Zynqnet: An fpga-accelerated embedded convolutional neural network. Master's
thesis, Swiss Federal Institute of Technology Zurich (ETH-Zurich), 2016.
Philipp Gysel. Ristretto: Hardware-oriented approximation of convolutional neural networks.
arXiv:1605.06402, 2016.
S. Han, H. Mao, and W. Dally. Deep compression: Compressing DNNs with pruning, trained
quantization and huffman coding.arxiv:1510.00149v3, 2015a.
S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efcient neural
networks. InNIPS, 2015b.
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J
Dally. Eie: Efcient inference engine on compressed deep neural network.International Sympo-
sium on Computer Architecture (ISCA), 2016a.
Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John
Tran, and William J. Dally. Dsd: Regularizing deep neural networks with dense-sparse-dense
training ow.arXiv:1607.04381, 2016b.
Guo Haria. convert squeezenet to mxnet.https://github.com/haria/SqueezeNet/
commit/0cf57539375fd5429275af36fc94c774503427c3 , 2016.
11

Under review as a conference paper at ICLR 2017
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiers: Surpassing human-level perfor-
mance on imagenet classication. InICCV, 2015a.
Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost. InCVPR, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition.arXiv:1512.03385, 2015b.
Forrest N. Iandola, Matthew W. Moskewicz, Sergey Karayev, Ross B. Girshick, Trevor Darrell, and
Kurt Keutzer. Densenet: Implementing efcient convnet descriptor pyramids.arXiv:1404.1869,
2014.
Forrest N. Iandola, Anting Shen, Peter Gao, and Kurt Keutzer. DeepLogo: Hitting logo recognition
with the deep neural network hammer.arXiv:1510.02131, 2015.
Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. FireCaffe: near-linear
acceleration of deep neural network training on compute clusters. InCVPR, 2016.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift.JMLR, 2015.
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Ser-
gio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embed-
ding.arXiv:1408.5093, 2014.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classication with Deep Con-
volutional Neural Networks. InNIPS, 2012.
Y. LeCun, B.Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Back-
propagation applied to handwritten zip code recognition.Neural Computation, 1989.
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network.arXiv:1312.4400, 2013.
T.B. Ludermir, A. Yamazaki, and C. Zanchettin. An optimization methodology for neural network
weights and architectures.IEEE Trans. Neural Networks, 2006.
Dmytro Mishkin, Nikolay Sergievskiy, and Jiri Matas. Systematic evaluation of cnn advances on
the imagenet.arXiv:1606.02228, 2016.
Vinod Nair and Geoffrey E. Hinton. Rectied linear units improve restricted boltzmann machines.
InICML, 2010.
Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang,
Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. Going deeper with embedded fpga platform
for convolutional neural network. InACM International Symposium on FPGA, 2016.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition.arXiv:1409.1556, 2014.
J. Snoek, H. Larochelle, and R.P. Adams. Practical bayesian optimization of machine learning
algorithms. InNIPS, 2012.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: a simple way to prevent neural networks from overtting.JMLR, 2014.
R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. InICML Deep Learning
Workshop, 2015.
K.O. Stanley and R. Miikkulainen. Evolving neural networks through augmenting topologies.Neu-
rocomputing, 2002.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du-
mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.
arXiv:1409.4842, 2014.
12

Under review as a conference paper at ICLR 2017
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Re-
thinking the inception architecture for computer vision.arXiv:1512.00567, 2015.
Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the
impact of residual connections on learning.arXiv:1602.07261, 2016.
S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for
deep learning. InNIPS Workshop on Machine Learning Systems (LearningSys), 2015.
Sagar M Waghmare. FireModule.lua.https://github.com/Element-Research/dpnn/
blob/master/FireModule.lua , 2016.
Ning Zhang, Ryan Farrell, Forrest Iandola, and Trevor Darrell. Deformable part descriptors for
ne-grained recognition and attribute prediction. InICCV, 2013.
13