Extracted Text

8d4f27771fa0d882feaf780ae2b1d1dfb5b9b66f.pdf

Published as a conference paper at ICLR 2019
PROXYLESSNAS: DIRECTNEURALARCHITECTURE
SEARCH ONTARGETTASK ANDHARDWARE
Han Cai, Ligeng Zhu, Song Han
Massachusetts Institute of Technology
fhancai, ligeng, songhan g@mit.edu
ABSTRACT
Neural architecture search (NAS) has a great impact by automatically designing
effective neural network architectures. However, the prohibitive computational
demand of conventional NAS algorithms (e.g.10
4
GPU hours) makes it difcult
todirectlysearch the architectures on large-scale tasks (e.g. ImageNet). Differen-
tiable NAS can reduce the cost of GPU hours via a continuous representation of
network architecture but suffers from the high GPU memory consumption issue
(grow linearly w.r.t. candidate set size). As a result, they need to utilizeproxy
tasks, such as training on a smaller dataset, or learning with only a few blocks,
or training just for a few epochs. These architectures optimized on proxy tasks
are not guaranteed to be optimal on the target task. In this paper, we present
ProxylessNASthat candirectlylearn the architectures for large-scale target tasks
and target hardware platforms. We address the high memory consumption issue
of differentiable NAS and reduce the computational cost (GPU hours and GPU
memory) to the same level of regular training while still allowing a large candi-
date set. Experiments on CIFAR-10 and ImageNet demonstrate the effectiveness
of directness and specialization. On CIFAR-10, our model achieves 2.08% test
error with only 5.7M parameters, better than the previous state-of-the-art architec-
ture AmoebaNet-B, while using 6fewer parameters. On ImageNet, our model
achieves 3.1% better top-1 accuracy than MobileNetV2, while being 1.2faster
with measured GPU latency. We also apply ProxylessNAS to specialize neural
architectures for hardware with direct hardware metrics (e.g. latency) and provide
insights for efcient CNN architecture design.
1
1 INTRODUCTION
Neural architecture search (NAS) has demonstrated much success in automating neural network ar-
chitecture design for various deep learning tasks, such as image recognition (Zoph et al.,;
et al.,;,;,) and language modeling (Zoph & Le,). De-
spite the remarkable results, conventional NAS algorithms are prohibitively computation-intensive,
requiring to train thousands of models on the target task in a single experiment. Therefore, directly
applying NAS to a large-scale task (e.g. ImageNet) is computationally expensive or impossible,
which makes it difcult for making practical industry impact. As a trade-off,2018)
propose to search for building blocks on proxy tasks, such as training for fewer epochs, starting
with a smaller dataset (e.g. CIFAR-10), or learning with fewer blocks. Then top-performing blocks
are stacked and transferred to the large-scale target task. This paradigm has been widely adopted
in subsequent NAS algorithms (Liu et al.,;b;,;,;,
2018c;,;,).
However, these blocks optimized on proxy tasks are not guaranteed to be optimal on the target
task, especially when taking hardware metrics such as latency into consideration. More importantly,
to enable transferability, such methods need to search for only a few architectural motifs and then
repeatedly stack the same pattern, which restricts the block diversity and thereby harms performance.
In this work, we propose a simple and effective solution to the aforementioned limitations, called
ProxylessNAS, which directly learns the architectures on the target task and hardware instead of with
1
Pretrained models and evaluation code are released at.
1

Published as a conference paper at ICLR 2019Proxy
Task
Learner
Target Task
&
Hardware
Transfer
(1) Previous proxy-based approach (2) Our proxy-less approach
Architecture
Updates
Architecture
Updates
Target Task
&
Hardware
Learner
Normal Train NAS  
Need Meta Controller
Need Proxy
DARTS & One-shot
No Meta Controller
Need Proxy
Proxyless (Ours) 
No Meta Controller
No Proxy
GPU Hours GPU Memory
Figure 1: ProxylessNAS directly optimizes neural network architectures on target task and hard-
ware. Beneting from the directness and specialization, ProxylessNAS can achieve remarkably
better results than previous proxy-based approaches. On ImageNet, with only 200 GPU hours (200
fewer than MnasNet (Tan et al.,)), our searched CNN model for mobile achieves the same
level of top-1 accuracy as MobileNetV2 1.4 while being 1.8faster.
proxy (Figure). We also remove the restriction of repeating blocks in previous NAS works (Zoph
et al.,;,) and allow all of the blocks to be learned and specied. To achieve
this, we reduce the computational cost (GPU hours and GPU memory) of architecture search to the
same level of regular training in the following ways.
GPU hour-wise, inspired by recent works (Liu et al.,;,), we formulate
NAS as a path-level pruning process. Specically, we directly train an over-parameterized network
that contains all candidate paths (Figure). During training, we explicitly introduce architecture
parameters to learn which paths are redundant, while these redundant paths are pruned at the end of
training to get a compact optimized architecture. In this way, we only need to train a single network
without any meta-controller (or hypernetwork) during architecture search.
However, naively including all the candidate paths leads to GPU memory explosion (Liu et al.,
2018c;,), as the memory consumption grows linearly w.r.t. the number of choices.
Thus, GPU memory-wise, we binarize the architecture parameters (1 or 0) and force only one path
to be active at run-time, which reduces the required memory to the same level of training a compact
model. We propose a gradient-based approach to train these binarized parameters based on Bina-
ryConnect (Courbariaux et al.,). Furthermore, to handle non-differentiable hardware objectives
(using latency as an example) for learning specialized network architectures on target hardware, we
model network latency as a continuous function and optimize it as regularization loss. Addition-
ally, we also present a REINFORCE-based (Williams,) algorithm as an alternative strategy to
handle hardware metrics.
In our experiments on CIFAR-10 and ImageNet, beneting from the directness and specialization,
our method can achieve strong empirical results. On CIFAR-10, our model reaches 2.08% test error
with only 5.7M parameters. On ImageNet, our model achieves 75.1% top-1 accuracy which is 3.1%
higher than MobileNetV2 (Sandler et al.,) while being 1.2 faster. Our contributions can be
summarized as follows:
ProxylessNAS is the rst NAS algorithm that directly learns architectures on the large-
scale dataset (e.g. ImageNet) without any proxy while still allowing a large candidate set
and removing the restriction of repeating blocks. It effectively enlarged the search space
and achieved better performance.
We provide a new path-level pruning perspective for NAS, showing a close connection
between NAS and model compression (Han et al.,). We save memory consumption
by one order of magnitude by using path-level binarization.
We propose a novel gradient-based approach (latency regularization loss) for handling
hardware objectives (e.g. latency). Given different hardware platforms: CPU/GPU/Mobile,
ProxylessNAS enables hardware-aware neural network specialization that's exactly opti-
mized for the target hardware. To our best knowledge, it is the rst work to study special-
izedneural network architecturesfor differenthardware architectures.
Extensive experiments showed the advantage of the directness property and the special-
ization property of ProxylessNAS. It achieved state-of-the-art accuracy performances on
CIFAR-10 and ImageNet under latency constraints on different hardware platforms (GPU,
CPU and mobile phone). We also analyze the insights of efcient CNN models specialized
for different hardware platforms and raise the awareness that specialized neural network
architecture is needed on different hardware architectures for efcient inference.
2

Published as a conference paper at ICLR 2019
2 RELATEDWORK
The use of machine learning techniques, such as reinforcement learning or neuro-evolution, to re-
place human experts in designing neural network architectures, usually referred to as neural archi-
tecture search, has drawn an increasing interest (Zoph & Le,;,;b;c;,
2018a;b;,;,;,;,;;
math et al.,). In NAS, architecture search is typically considered as a meta-learning process,
and a meta-controller (e.g. a recurrent neural network (RNN)), is introduced to explore a given
architecture space with training a network in the inner loop to get an evaluation for guiding explo-
ration. Consequently, such methods are computationally expensive to run, especially on large-scale
tasks, e.g. ImageNet.
Some recent works (Brock et al.,;,) try to improve the efciency of this
meta-learning process by reducing the cost of getting an evaluation. In2018), a hy-
pernetwork is utilized to generate weights for each sampled network and hence can evaluate the
architecture without training it. Similarly,2018) propose to share weights among all
sampled networks under the standard NAS framework (Zoph & Le,). These methods speed
up architecture search by orders of magnitude, however, they require a hypernetwork or an RNN
controller and mainly focus on small-scale tasks (e.g. CIFAR) rather than large-scale tasks (e.g.
ImageNet).
Our work is most closely related to One-Shot (Bender et al.,) and DARTS (Liu et al.,),
both of which get rid of the meta-controller (or hypernetwork) by modeling NAS as a single training
process of an over-parameterized network that comprises all candidate paths. Specically, One-
Shot trains the over-parameterized network with DropPath (Zoph et al.,) that drops out each
path with some xed probability. Then they use the pre-trained over-parameterized network to
evaluate architectures, which are sampled by randomly zeroing out paths. DARTS additionally
introduces a real-valued architecture parameter for each path and jointly train weight parameters
and architecture parameters via standard gradient descent. However, they suffer from the large GPU
memory consumption issue and hence still need to utilize proxy tasks. In this work, we address the
large memory issue in these two methods through path binarization.
Another relevant topic is network pruning (Han et al.,) that aim to improve the efciency of
neural networks by removing insignicant neurons (Han et al.,) or channels (Liu et al.,).
Similar to these works, we start with an over-parameterized network and then prune the redundant
parts to derive the optimized architecture. The distinction is that they focus on layer-level pruning
that only modies the lter (or units) number of a layer but can not change the topology of the
network, while we focus on learning effective network architectures throughpath-level pruning. We
also allow both pruning and growing the number of layers.
3 METHOD
We rst describe the construction of the over-parameterized network with all candidate paths, then
introduce how we leverage binarized architecture parameters to reduce the memory consumption
of training the over-parameterized network to the same level as regular training. We propose a
gradient-based algorithm to train these binarized architecture parameters. Finally, we present two
techniques to handle non-differentiable objectives (e.g. latency) for specializing neural networks on
target hardware.
3.1 CONSTRUCTION OF OVER-PARAMETERIZED NETWORK
Denote a neural network asN(e; ; en)whereeirepresents a certain edge in the directed acyclic
graph (DAG). LetO=foigbe the set ofNcandidate primitive operations (e.g. convolution, pool-
ing, identity, zero, etc). To construct the over-parameterized network that includes any architecture
in the search space, instead of setting each edge to be a denite primitive operation, we set each
edge to be a mixed operation that hasNparallel paths (Figure), denoted as mO. As such, the
over-parameterized network can be expressed asN(e=m
1
O
; ; en=m
n
O
).
Given inputx, the output of a mixed operationmOis dened based on the outputs of itsNpaths. In
One-Shot,mO(x)is the sum offoi(x)g, while in DARTS,mO(x)is weighted sum offoi(x)gwhere
3

Published as a conference paper at ICLR 2019(1) Update weight parameters
Architecture Parameters
Binary Gate (0:prune, 1:keep)
OUTPUT
α β σ … δ
1 0 0 … 0
(2) Update architecture parameters
INPUT
α β σ … δ
0 1 0 … 0
update
fmap not in memory
fmap in memory
INPUT CONV
5x5 POOL
3x3
...
Weight  
Parameters
CONV
3x3 Identity
CONV
5x5 POOL
3x3
... Identity
update CONV
3x3
OUTPUT
Table 1
MIT Red
Trainer
Latency
Model
Direct measurement:
expensive and slow
Latency modeling:
cheap, fast and differentiable
MIT Red-1 Learnable Block
i - 1
Learnable Block
i
…… Learnable Block
i + 1
……
INPUT
OUTPUT
...
α β σ … ζ CONV
5x5 POOL
3x3
...
CONV
3x3 Identity
E[latency] =
X
i
E[latencyi]
<latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit><latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit><latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit><latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit>
Loss=LossCE+!1||w||
2
2+!2E[latency]
<latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit><latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit><latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit><latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit>
E[Latency] =↵⇥F(conv3x3)+
"⇥F(conv5x5)+
#⇥F(identity)+
......
⇣⇥F(pool3x3)
<latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit><latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit><latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit><latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit>
Figure 2: Learning both weight parameters and binarized architecture parameters.
the weights are calculated by applying softmax toNreal-valued architecture parametersfig:
m
One-Shot
O (x) =
N
X
i=1
oi(x); m
DARTS
O(x) =
N
X
i=1
pioi(x) =
N
X
i=1
exp(i)
P
j
exp(j)
oi(x):(1)
As shown in Eq. (1), the output feature maps of all N paths are calculated and stored in the memory,
while training a compact model only involves one path. Therefore, One-Shot and DARTS roughly
needNtimes GPU memory and GPU hours compared to training a compact model. On large-
scale dataset, this can easily exceed the memory limits of hardware with large design space. In the
following section, we solve this memory issue based on the idea of path binarization.
3.2 LEARNINGBINARIZEDPATH
To reduce memory footprint, we keep only one path when training the over-parameterized network.
Unlike2015) which binarize individual weights, we binarize entire paths. We in-
troduceNreal-valued architecture parametersfigand then transforms the real-valued path weights
to binary gates:
g=binarize(p1; ; pN) =
8
<
:
[1;0; ;0]with probabilityp1;

[0;0; ;1]with probabilitypN:
(2)
Based on the binary gatesg, the output of the mixed operation is given as:
m
Binary
O
(x) =
N
X
i=1
gioi(x) =
8
<
:
o1(x)with probabilityp1

oN(x)with probabilitypN:
: (3)
As illustrated in Eq. (3) and Figure, by using the binary gates rather than real-valued path weights
(Liu et al.,), only one path of activation is active in memory at run-time and the memory
requirement of training the over-parameterized network is thus reduced to the same level of training
a compact model. That's more than an order of magnitude memory saving.
3.2.1 TRAININGBINARIZEDARCHITECTUREPARAMETERS
Figure
rameters in the over-parameterized network. When training weight parameters, we rst freeze the
architecture parameters and stochastically sample binary gates according to Eq. (2) for each batch
of input data. Then the weight parameters of active paths are updated via standard gradient descent
on the training set (Figure
frozen, then we reset the binary gates and update the architecture parameters on the validation set
(Figure
ing of architecture parameters is nished, we can then derive the compact architecture by pruning
redundant paths. In this work, we simply choose the path with the highest path weight.
Unlike weight parameters, the architecture parameters are not directly involved in the computation
graph and thereby cannot be updated using the standard gradient descent. In this section, we intro-
duce a gradient-based approach to learn the architecture parameters.
4

Published as a conference paper at ICLR 2019
In BinaryConnect (Courbariaux et al.,), the real-valued weight is updated using the gradient
w.r.t. its corresponding binary gate. In our case, analogously, the gradient w.r.t. architecture param-
eters can be approximately estimated using@L=@giin replace of@L=@pi:
@L
@i
=
N
X
j=1
@L
@pj
@pj
@i

N
X
j=1
@L
@gj
@pj
@i
=
N
X
j=1
@L
@gj
@

exp(j)
P
k
exp(k)

@i
=
N
X
j=1
@L
@gj
pj(ijpi);(4)
whereij= 1ifi=jandij= 0ifi6=j. Since the binary gatesgare involved in the com-
putation graph, as shown in Eq. (3), @L=@gjcan be calculated through backpropagation. However,
computing@L=@gjrequires to calculate and storeoj(x). Therefore, directly using Eq. (4) to update
the architecture parameters would also require roughlyNtimes GPU memory compared to training
a compact model.
To address this issue, we consider factorizing the task of choosing one path out of N candidates
into multiple binary selection tasks. The intuition is that if a path is the best choice at a particular
position, it should be the better choice when solely compared to any other path.
2
Following this idea, within an update step of the architecture parameters, we rst sample two paths
according to the multinomial distribution(p1; ; pN)and mask all the other paths as if they do not
exist. As such the number of candidates temporarily decrease fromNto 2, while the path weights
fpigand binary gatesfgigare reset accordingly. Then we update the architecture parameters of
these two sampled paths using the gradients calculated via Eq. (4). Finally, as path weights are
computed by applying softmax to the architecture parameters, we need to rescale the value of these
two updated architecture parameters by multiplying a ratio to keep the path weights of unsampled
paths unchanged. As such, in each update step, one of the sampled paths is enhanced (path weight
increases) and the other sampled path is attenuated (path weight decreases) while all other paths
keep unchanged. In this way, regardless of the value ofN, only two paths are involved in each
update step of the architecture parameters, and thereby the memory requirement is reduced to the
same level of training a compact model.
3.3 HANDLINGNON-DIFFERENTIABLEHARDWAREMETRICS
Besides accuracy, latency (not FLOPs) is another very important objective when designing efcient
neural network architectures for hardware. Unfortunately, unlike accuracy that can be optimized
using the gradient of the loss function, latency is non-differentiable. In this section, we present two
algorithms to handle the non-differentiable objectives.
3.3.1 MAKINGLATENCYDIFFERENTIABLE
To make latency differentiable, we model the latency of a network as a continuous function of the
neural network dimensions
3
. Consider a mixed operation with a candidate setfojgand eachojis
associated with a path weightpjwhich represents the probability of choosingoj. As such, we have
the expected latency of a mixed operation (i.e. a learnable block) as:
E[latency
i
] =
X
j
p
i
jF(o
i
j); (5)
whereE[latency
i
]is the expected latency of thei
th
learnable block,F()denotes the latency predic-
tion model andF(o
i
j
)is the predicted latency ofo
i
j
. The gradient ofE[latency
i
]w.r.t. architecture
parameters can thereby be given as:@E[latency
i
]= @p
i
j
=F(o
i
j
).
For the whole network with a sequence of mixed operations (Figure
are executed sequentially during inference, the expected latency of the network can be expressed
with the sum of these mixed operations' expected latencies:
E[latency] =
X
i
E[latency
i
]; (6)
2
In Appendix, we provide another solution to this issue that does not require the approximation.
3
Details of the latency prediction model are provided in Appendix.
5

Published as a conference paper at ICLR 2019(1) Update weight parameters
Architecture Parameters
Binary Gate (0:prune, 1:keep)
OUTPUT
α β σ … δ
1 0 0 … 0
(2) Update architecture parameters
INPUT
α β σ … δ
0 1 0 … 0
update
fmap not in memory
fmap in memory
INPUT CONV
5x5 POOL
3x3
...
Weight  
Parameters
CONV
3x3 Identity
CONV
5x5 POOL
3x3
... Identity
update CONV
3x3
OUTPUT
Table 1
MIT Red
Trainer
Latency
Model
Direct measurement:
expensive and slow
Latency modeling:
cheap, fast and differentiable
MIT Red-1 Learnable Block
i - 1
Learnable Block
i
…… Learnable Block
i + 1
……
INPUT
OUTPUT
...
α β σ … ζ CONV
5x5 POOL
3x3
...
CONV
3x3 Identity
E[latency] =
X
i
E[latencyi]
<latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit><latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit><latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit><latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit>
Loss=LossCE+!1||w||
2
2+!2E[latency]
<latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit><latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit><latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit><latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit>
E[Latency] =↵⇥F(conv3x3)+
"⇥F(conv5x5)+
#⇥F(identity)+
......
⇣⇥F(pool3x3)
<latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit><latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit><latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit><latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit>
Figure 3: Making latency differentiable by introducing latency regularization loss.
We incorporate the expected latency of the network into the normal loss function by multiplying a
scaling factor2(>0)which controls the trade-off between accuracy and latency. The nal loss
function is given as (also shown in Figure
Loss=LossCE+1jjwjj
2
2+2E[latency]; (7)
whereLossCEdenotes the cross-entropy loss and1jjwjj
2
2is the weight decay term.
3.3.2 REINFORCE- BASEDAPPROACH
As an alternative to BinaryConnect, we can utilize REINFORCE to train binarized weights as well.
Consider a network that has binarized parameters, the goal of updating binarized parameters is to
nd the optimal binary gatesgthat maximizes a certain reward, denoted asR(). Here we assume the
network only has one mixed operation for ease of illustration. Therefore, according to REINFORCE
(Williams,), we have the following updates for binarized parameters:
J() =Eg[R(Ng)] =
X
i
piR(N(e=oi));
rJ() =
X
i
R(N(e=oi))rpi=
X
i
R(N(e=oi))pirlog(pi);
=Eg[R(Ng)rlog(p(g))]
1
M
M
X
i=1
R(N
g
i)rlog(p(g
i
)); (8)
whereg
i
denotes thei
th
sampled binary gates,p(g
i
)denotes the probability of samplingg
i
accord-
ing to Eq. (2) and N
g
iis the compact network according to the binary gatesg
i
. Since Eq. (8) does
not requireR(Ng)to be differentiable w.r.t.g, it can thus handle non-differentiable objectives. An
interesting observation is that Eq. (8) has a similar form to the standard NAS (Zoph & Le,),
while it is not a sequential decision-making process and no RNN meta-controller is used in our case.
Furthermore, since both gradient-based updates and REINFORCE-based updates are essentially two
different update rules to the same binarized architecture parameters, it is possible to combine them
to form a new update rule for the architecture parameters.
4 EXPERIMENTS AND RESULTS
We demonstrate the effectiveness of our proposed method on two benchmark datasets (CIFAR-10
and ImageNet) for the image classication task. Unlike previous NAS works (Zoph et al.,;
et al.,) that rst learn CNN blocks on CIFAR-10 under small-scale setting (e.g. fewer blocks),
then transfer the learned block to ImageNet or CIFAR-10 under large-scale setting by repeatedly
stacking it, we directly learn the architectures on the target task (either CIFAR-10 or ImageNet) and
target hardware (GPU, CPU and mobile phone) while allowing each block to be specied.
4.1 EXPERIMENTS ON CIFAR-10
Architecture Space.For CIFAR-10 experiments, we use the tree-structured architecture space that
is introduced by2018b) with PyramidNet (Han et al.,) as the backbone
4
. Specically,
4
The list of operations in the candidate set is provided in the appendix.
6

Published as a conference paper at ICLR 2019
Model ParamsTest error (%)
DenseNet-BC (Huang et al.,) 25.6M 3.46
PyramidNet (Han et al.,) 26.0M 3.31
Shake-Shake + c/o (DeVries & Taylor,) 26.2M 2.56
PyramidNet + SD (Yamada et al.,) 26.0M 2.31
ENAS + c/o (Pham et al.,) 4.6M 2.89
DARTS + c/o (Liu et al.,) 3.4M 2.83
NASNet-A + c/o (Zoph et al.,) 27.6M 2.40
PathLevel EAS + c/o (Cai et al.,) 14.3M 2.30
AmoebaNet-B + c/o (Real et al.,) 34.9M 2.13
Proxyless-R + c/o (ours) 5.8M 2.30
Proxyless-G + c/o (ours) 5.7M 2.08
Table 1: ProxylessNAS achieves state-of-the-art performance on CIFAR-10.
we replace all33convolution layers in the residual blocks of a PyramidNet with tree-structured
cells, each of which has a depth of 3 and the number of branches is set to be 2 at each node (except the
leaf nodes). For further details about the tree-structured architecture space, we refer to the original
paper (Cai et al.,). Additionally, we use two hyperparameters to control the depth and width
of a network in this architecture space, i.e.BandF, which respectively represents the number of
blocks at each stage (totally 3 stages) and the number of output channels of the nal block.
Training Details.We randomly sample 5,000 images from the training set as a validation set for
learning architecture parameters which are updated using the Adam optimizer with an initial learn-
ing rate of 0.006 for the gradient-based algorithm (Section) and 0.01 for the REINFORCE-
based algorithm (Section). In the following discussions, we refer to these two algorithms as
Proxyless-G(gradient) andProxyless-R(REINFORCE) respectively.
After the training process of the over-parameterized network completes, a compact network is de-
rived according to the architecture parameters, as discussed in Section. Next, we train the
compact network using the same training settings except that the number of training epochs in-
creases from 200 to 300. Additionally, when the DropPath regularization (Zoph et al.,;
et al.,) is adopted, we further increase the number of training epochs to 600 (,).
Results.We apply the proposed method to learn architectures in the tree-structured architecture
space withB= 18andF= 400. Since we do not repeat cells and each cell has 12 learnable edges,
totally12183 = 648decisions are required to fully determine the architecture.
The test error rate results of our proposed method and other state-of-the-art architectures on CIFAR-
10 are summarized in Table, where c/o indicates the use of Cutout (DeVries & Taylor,).
Compared to these state-of-the-art architectures, our proposed method can achieve not only lower
test error rate but also better parameter efciency. Specically, Proxyless-G reaches a test error rate
of 2.08% which is slightly better than AmoebaNet-B (Real et al.,) (the previous best archi-
tecture on CIFAR-10). Notably, AmoebaNet-B uses 34.9M parameters while our model only uses
5.7M parameters which is6fewer than AmoebaNet-B. Furthermore, compared with PathLevel
EAS (Cai et al.,) that also explores the tree-structured architecture space, both Proxyless-G
and Proxyless-R achieves similar or lower test error rate results with half fewer parameters. The
strong empirical results of our ProxylessNAS demonstrate the benets of directly exploring a large
architecture space instead of repeatedly stacking the same block.
4.2 EXPERIMENTS ON IMAGENET
For ImageNet experiments, we focus on learning efcient CNN architectures (Iandola et al.,;
Howard et al.,;,;,) that have not only high accuracy but
also low latency on specic hardware platforms. Therefore, it is a multi-objective NAS task (Hsu
et al.,;,;,;,;,;,
2018), where one of the objectives is non-differentiable (i.e. latency). We use three different hard-
ware platforms, including mobile phone, GPU and CPU, in our experiments. The GPU latency is
measured on V100 GPU with a batch size of 8 (single batch makes GPU severely under-utilized).
The CPU latency is measured under batch size 1 on a server with two 2.40GHz Intel(R) Xeon(R)
7

Published as a conference paper at ICLR 2019
Model Top-1Top-5
MobileHardwareNo No Search cost
Latency-awareProxyRepeat(GPU hours)
MobileNetV1 [16] 70.689.5113ms - - 7 Manual
MobileNetV2 [30] 72.091.075ms - - 7 Manual
NASNet-A [38] 74.091.3183ms 7 7 7 48;000
AmoebaNet-A [29] 74.592.0190ms 7 7 7 75;600
MnasNet [31] 74.091.876ms 3 7 7 40;000
MnasNet (our impl.)74.091.879ms 3 7 7 40;000
Proxyless-G (mobile)71.890.383ms 7 3 3 200
Proxyless-G + LL74.291.779ms 3 3 3 200
Proxyless-R (mobile)74.692.278ms 3 3 3 200
Table 2: ProxylessNAS achieves state-of-the art accuracy (%) on ImageNet (under mobile latency
constraint80ms) with 200less search cost in GPU hours. LL indicates latency regularization
loss. Details of MnasNet's search cost are provided in appendix.74.6
76.7
68.2
65.4
72.0
74.7
1.83x faster
Figure 4: ProxylessNAS consistently
outperforms MobileNetV2 under vari-
ous latency settings.80 90 100 110 120 130
Estimated (ms)
80
90
100
110
120
130
Real(ms)
y=x
Figure 5: Our mobile latency model is
close toy=x. The latency RMSE is
0.75ms.
CPU E5-2640 v4. The mobile latency is measured on Google Pixel 1 phone with a batch size of
1. For Proxyless-R, we useACC(m)[LAT(m)=T]
w
as the optimization goal, whereACC(m)
denotes the accuracy of modelm,LAT(m)denotes the latency ofm,Tis the target latency andw
is a hyperparameter for controlling the trade-off between accuracy and latency.
Additionally, on mobile phone, we use the latency prediction model (Appendix) during architec-
ture search. As illustrated in Figure, we observe a strong correlation between the predicted latency
and real measured latency on the test set, suggesting that the latency prediction model can be used
to replace the expensive mobile farm infrastructure (Tan et al.,) with little error introduced.
Architecture Space.We use MobileNetV2 (Sandler et al.,) as the backbone to build the archi-
tecture space. Specically, rather than repeating the same mobile inverted bottleneck convolution
(MBConv), we allow a set of MBConv layers with various kernel sizesf3;5;7gand expansion ratios
f3;6g. To enable a direct trade-off between width and depth, we initiate a deeper over-parameterized
network and allow a block with the residual connection to be skipped by adding the zero operation to
the candidate set of its mixed operation. In this way, with a limited latency budget, the network can
either choose to be shallower and wider by skipping more blocks and using larger MBConv layers
or choose to be deeper and thinner by keeping more blocks and using smaller MBConv layers.
Training Details.We randomly sample 50,000 images from the training set as a validation set
during the architecture search. The settings for updating architecture parameters are the same as
CIFAR-10 experiments except the initial learning rate is 0.001. The over-parameterized network is
trained on the remaining training images with batch size 256.
8

Published as a conference paper at ICLR 2019
Model Top-1Top-5GPU latency
MobileNetV2 (Sandler et al.,) 72.091.0 6.1ms
ShufeNetV2 (1.5) (Ma et al.,) 72.6 - 7.3ms
ResNet-34 (He et al.,) 73.391.4 8.0ms
NASNet-A (Zoph et al.,) 74.091.3 38.3ms
DARTS (Liu et al.,) 73.191.0 -
MnasNet (Tan et al.,) 74.091.8 6.1ms
Proxyless (GPU) 75.192.5 5.1ms
Table 3: ImageNet Accuracy (%) and GPU latency (Tesla V100) on ImageNet.
ImageNet Classication Results.We rst apply our ProxylessNAS to learn specialized CNN
models on the mobile phone. The summarized results are reported in Table. Compared to Mo-
bileNetV2, our model improves the top-1 accuracy by 2.6% while maintaining a similar latency on
the mobile phone. Furthermore, by rescaling the width of the networks using a multiplier (San-
dler et al.,;,), it is shown in Figure
MobileNetV2 by a signicant margin under all latency settings. Specically, to achieve the same
level of top-1 accuracy performance (i.e. around 74.6%),MobileNetV2 has 143ms latency while
our model only needs 78ms (1.83faster). While compared with MnasNet (Tan et al.,), our
model can achieve 0.6% higher top-1 accuracy with slightly lower mobile latency. More importantly,
we are much more resource efcient: the GPU-hour is200fewer than MnasNet (Table).
Additionally, we also observe that Proxyless-G has no incentive to choose computation-cheap op-
erations if were not for the latency regularization loss. Its resulting architecture initially has 158ms
latency on Pixel 1. After rescaling the network using the multiplier, its latency reduces to 83ms.
However, this model can only achieve 71.8% top-1 accuracy on ImageNet, which is 2.4% lower
than the result given by Proxyless-G with latency regularization loss. Therefore, we conclude that it
is essential to take latency as a direct objective when learning efcient neural networks.
Besides the mobile phone, we also apply our ProxylessNAS to learn specialized CNN models on
GPU and CPU. Table
still achieve superior performances compared to both human-designed and automatically searched
architectures. Specically, compared to MobileNetV2 and MnasNet, our model improves the top-1
accuracy by 3.1% and 1.1% respectively while being 1.2faster. Table
results of our searched models on three different platforms. An interesting observation is that models
optimized for GPU do not run fast on CPU and mobile phone, vice versa. Therefore, it is essential to
learn specialized neural networks for different hardware architectures to achieve the best efciency
on different hardware.
Specialized Models for Different Hardware.Figure
our searched CNN models on three hardware platforms: GPU/CPU/Mobile. We notice that the ar-
chitecture shows different preferences when targeting different platforms: (i) The GPU model is
shallower and wider, especially in early stages where the feature map has higher resolution; (ii) The
GPU model prefers large MBConv operations (e.g. 77 MBConv6), while the CPU model would
go for smaller MBConv operations. This is because GPU has much higher parallelism than CPU
so it can take advantage of large MBConv operations. Another interesting observation is that our
searched models on all platforms prefer larger MBConv operations in the rst block within each
stage where the feature map is downsampled. We suppose it might because larger MBConv oper-
ations are benecial for the network to preserve more information when downsampling. Notably,
such kind of patterns cannot be captured in previous NAS methods as they force the blocks to share
the same structure (Zoph et al.,;,).
5 CONCLUSION
We introduced ProxylessNAS that can directly learn neural network architectures on the target task
and target hardware without any proxy. We also reduced the search cost (GPU-hours and GPU
memory) of NAS to the same level of normal training using path binarization. Beneting from
the direct search, we achieve strong empirical results on CIFAR-10 and ImageNet. Furthermore,
9

Published as a conference paper at ICLR 2019April May June July
Region 1 Region 2
MB1 3x3 MB3 5x5 MB3 7x7 MB6 7x7 MB3 5x5 MB6 5x5 MB3 3x3 MB3 5x5 MB6 7x7 MB6 7x7 MB6 7x7 MB6 5x5 MB6 7x7 Conv 3x3
Pooling FC
MB3 3x3
40x112x11224x112x1123x224x224
32x56x5656x28x2856x28x28
112x14x14112x14x14128x14x14128x14x14128x14x14
256x7x7256x7x7256x7x7256x7x7432x7x7
Conv 3x3MB1 3x3 MB3 5x5MB3 3x3 MB3 7x7 MB3 3x3 MB3 5x5 MB3 5x5 MB6 7x7
32x112x11232x112x1123x224x224
32x56x5632x56x5640x28x2840x28x2840x28x2840x28x28
MB3 5x5 MB3 5x5
80x14x1480x14x14
MB6 5x5 MB3 5x5 MB3 5x5 MB3 5x5 MB6 7x7 MB3 7x7 MB6 7x7
Pooling FC
80x14x1496x14x1496x14x1496x14x14
192x7x7192x7x7192x7x7192x7x7320x7x7
MB3 5x5
80x14x14
MB6 7x7 MB3 7x7
96x14x14
Conv 3x3MB1 3x3 MB6 3x3 MB3 3x3 MB3 3x3 MB3 3x3 MB6 3x3 MB3 3x3 MB3 3x3
40x112x11224x112x1123x224x224
32x56x5632x56x5632x56x5632x56x5648x28x2848x28x28
MB6 3x3 MB3 5x5
48x28x2848x28x28
MB6 5x5 MB3 3x3 MB3 3x3 MB3 3x3 MB6 5x5 MB3 3x3 MB6 5x5
Pooling FC
88x14x14
104x14x14104x14x14104x14x14
216x7x7216x7x7216x7x7216x7x7360x7x7
MB3 3x3
88x14x14
MB3 5x5 MB3 5x5
104x14x14
(1) Efficient mobile architecture found by ProxylessNAS.
(2) Efficient CPU architecture found by ProxylessNAS.
(3) Efficient GPU architecture found by ProxylessNAS.
MIT Red
(a) Efcient GPU model found by ProxylessNAS.April May June July
Region 1 Region 2
MB1 3x3 MB3 5x5 MB3 7x7 MB6 7x7 MB3 5x5 MB6 5x5 MB3 3x3 MB3 5x5 MB6 7x7 MB6 7x7 MB6 7x7 MB6 5x5 MB6 7x7 Conv 3x3
Pooling FC
MB3 3x3
40x112x11224x112x1123x224x224
32x56x5656x28x2856x28x28
112x14x14112x14x14128x14x14128x14x14128x14x14
256x7x7256x7x7256x7x7256x7x7432x7x7
Conv 3x3MB1 3x3 MB3 5x5MB3 3x3 MB3 7x7 MB3 3x3 MB3 5x5 MB3 5x5 MB6 7x7
32x112x11232x112x1123x224x224
32x56x5632x56x5640x28x2840x28x2840x28x2840x28x28
MB3 5x5 MB3 5x5
80x14x1480x14x14
MB6 5x5 MB3 5x5 MB3 5x5 MB3 5x5 MB6 7x7 MB3 7x7 MB6 7x7
Pooling FC
80x14x1496x14x1496x14x1496x14x14
192x7x7192x7x7192x7x7192x7x7320x7x7
MB3 5x5
80x14x14
MB6 7x7 MB3 7x7
96x14x14
Conv 3x3MB1 3x3 MB6 3x3 MB3 3x3 MB3 3x3 MB3 3x3 MB6 3x3 MB3 3x3 MB3 3x3
40x112x11224x112x1123x224x224
32x56x5632x56x5632x56x5632x56x5648x28x2848x28x28
MB6 3x3 MB3 5x5
48x28x2848x28x28
MB6 5x5 MB3 3x3 MB3 3x3 MB3 3x3 MB6 5x5 MB3 3x3 MB6 5x5
Pooling FC
88x14x14
104x14x14104x14x14104x14x14
216x7x7216x7x7216x7x7216x7x7360x7x7
MB3 3x3
88x14x14
MB3 5x5 MB3 5x5
104x14x14
(1) Efficient mobile architecture found by ProxylessNAS.
(2) Efficient CPU architecture found by ProxylessNAS.
(3) Efficient GPU architecture found by ProxylessNAS.
MIT Red (b) Efcient CPU model found by ProxylessNAS.April May June July
Region 1 Region 2
MB1 3x3 MB3 5x5 MB3 7x7 MB6 7x7 MB3 5x5 MB6 5x5 MB3 3x3 MB3 5x5 MB6 7x7 MB6 7x7 MB6 7x7 MB6 5x5 MB6 7x7 Conv 3x3
PoolingFC
MB3 3x3
40x112x112 24x112x1123x224x224
32x56x56 56x28x28 56x28x28
112x14x14 112x14x14 128x14x14 128x14x14 128x14x14
256x7x7 256x7x7 256x7x7 256x7x7 432x7x7
Conv 3x3 MB1 3x3 MB3 5x5 MB3 3x3 MB3 7x7 MB3 3x3 MB3 5x5 MB3 5x5 MB6 7x7
32x112x112 16x112x1123x224x224
32x56x56 32x56x56 40x28x28 40x28x28 40x28x28 40x28x28
MB3 5x5 MB3 5x5
80x14x14 80x14x14
MB6 5x5 MB3 5x5 MB3 5x5 MB3 5x5 MB6 7x7 MB3 7x7 MB6 7x7
Pooling FC
80x14x14 96x14x14 96x14x14 96x14x14
192x7x7 192x7x7 192x7x7 192x7x7 320x7x7
MB3 5x5
80x14x14
MB6 7x7 MB3 7x7
96x14x14
Conv 3x3 MB1 3x3 MB6 3x3 MB3 3x3 MB3 3x3 MB3 3x3 MB6 3x3 MB3 3x3 MB3 3x3
40x112x112 24x112x1123x224x224
32x56x56 32x56x56 32x56x56 32x56x56 48x28x28 48x28x28
MB6 3x3 MB3 5x5
48x28x28 48x28x28
MB6 5x5 MB3 3x3 MB3 3x3 MB3 3x3 MB6 5x5 MB3 3x3 MB6 5x5
PoolingFC
88x14x14
104x14x14 104x14x14 104x14x14
216x7x7 216x7x7 216x7x7 216x7x7 360x7x7
MB3 3x3
88x14x14
MB3 5x5 MB3 5x5
104x14x14
(1)Efficient mobile architecture found by ProxylessNAS.
(2) Efficient CPU architecture found by ProxylessNAS.
(3) Efficient GPU architecture found by ProxylessNAS.
MIT Red (c) Efcient mobile model found by ProxylessNAS.
Figure 6: Efcient models optimized for different hardware. MBConv3 and MBConv6 denote
mobile inverted bottleneck convolution layer with an expansion ratio of 3 and 6 respectively. In-
sights: GPU prefers shallow and wide model with early pooling; CPU prefers deep and narrow
model with late pooling. Pooling layers prefer large and wide kernel. Early layers prefer small
kernel. Late layers prefer large kernel.
Model Top-1 (%)GPU latencyCPU latencyMobile latency
Proxyless (GPU) 75.1 5.1ms 204.9ms 124ms
Proxyless (CPU) 75.3 7.4ms 138.7ms 116ms
Proxyless (mobile)74.6 7.2ms 164.1ms 78ms
Table 4: Hardware prefers specialized models. Models optimized for GPU does not run fast on CPU
and mobile phone, vice versa. ProxylessNAS provides an efcient solution to search a specialized
neural network architecture for a target hardware architecture, while cutting down the search cost by
200compared with state-of-the-arts (Zoph & Le,;,).
we allow specializing network architectures for different platforms by directly incorporating the
measured hardware latency into optimization objectives. We compared the optimized models on
CPU/GPU/mobile and raised the awareness of the needs of specializing neural network architecture
for different hardware architectures.
ACKNOWLEDGMENTS
We thank MIT Quest for Intelligence, MIT-IBM Watson AI lab, SenseTime, Xilinx, Snap Research
for supporting this work. We also thank AWS Cloud Credits for Research Program providing us the
cloud computing resources.
REFERENCES
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understand-
ing and simplifying one-shot architecture search. InICML, 2018.
Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model archi-
tecture search through hypernetworks. InICLR, 2018.
Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efcient architecture search by
network transformation. InAAAI, 2018a.
10

Published as a conference paper at ICLR 2019
Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level network transformation
for efcient architecture search. InICML, 2018b.
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural
networks with binary weights during propagations. InNIPS, 2015.
Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks
with cutout.arXiv preprint arXiv:1708.04552, 2017.
Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. Dpp-net: Device-aware
progressive search for pareto-optimal neural architectures. InECCV, 2018.
Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple and efcient architecture search for
convolutional neural networks.arXiv preprint arXiv:1711.04528, 2017.
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Multi-objective architecture search for
cnns.arXiv preprint arXiv:1804.09081, 2018a.
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey.arXiv
preprint arXiv:1808.05377, 2018b.
Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. InCVPR, 2017.
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for
efcient neural network. InNIPS, 2015.
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. InICLR, 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. InCVPR, 2016.
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
compression and acceleration on mobile devices. InECCV, 2018.
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam. Mobilenets: Efcient convolutional neural networks for
mobile vision applications.arXiv preprint arXiv:1704.04861, 2017.
Chi-Hung Hsu, Shu-Huan Chang, Da-Cheng Juan, Jia-Yu Pan, Yu-Ting Chen, Wei Wei, and Shih-
Chieh Chang. Monas: Multi-objective neural architecture search using reinforcement learning.
arXiv preprint arXiv:1806.10332, 2018.
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with
stochastic depth. InECCV, 2016.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. InCVPR, 2017.
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt
Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size.
arXiv preprint arXiv:1602.07360, 2016.
Purushotham Kamath, Abhishek Singh, and Debo Dutta. Neural architecture construction using
envelopenets.arXiv preprint arXiv:1803.06744, 2018.
Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan
Huang, and Kevin Murphy. Progressive neural architecture search. InECCV, 2018a.
Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hi-
erarchical representations for efcient architecture search. InICLR, 2018b.
Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search.arXiv
preprint arXiv:1806.09055, 2018c.
11

Published as a conference paper at ICLR 2019
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learn-
ing efcient convolutional networks through network slimming. InICCV, 2017.
Renqian Luo, Fei Tian, Tao Qin, and Tie-Yan Liu. Neural architecture optimization.arXiv preprint
arXiv:1808.07233, 2018.
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufenet v2: Practical guidelines for
efcient cnn architecture design. InECCV, 2018.
Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efcient neural architecture
search via parameter sharing. InICML, 2018.
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image
classier architecture search.arXiv preprint arXiv:1802.01548, 2018.
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mo-
bilenetv2: Inverted residuals and linear bottlenecks. InCVPR, 2018.
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-
aware neural architecture search for mobile.arXiv preprint arXiv:1807.11626, 2018.
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quan-
tization.arXiv, 2018.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. InReinforcement Learning. 1992.
Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization.arXiv preprint
arXiv:1802.02375, 2018.
Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise neural
network architecture generation. InCVPR, 2018.
Ligeng Zhu, Ruizhi Deng, Michael Maire, Zhiwei Deng, Greg Mori, and Ping Tan. Sparsely aggre-
gated convolutional networks. InProceedings of the European Conference on Computer Vision
(ECCV), pp. 186201, 2018.
Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. InICLR, 2017.
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures
for scalable image recognition. InCVPR, 2018.
12

Published as a conference paper at ICLR 2019
A THELIST OFCANDIDATEOPERATIONSUSED ONCIFAR-10
We adopt the following7operations in our CIFAR-10 experiments:
33dilated depthwise-separable convolution
Identity
33depthwise-separable convolution
55depthwise-separable convolution
77depthwise-separable convolution
33average pooling
33max pooling
B MOBILELATENCYPREDICTION
Measuring the latency on-device is accurate but not ideal for scalable neural architecture search.
There are two reasons: (i)Slow.As suggested in TensorFlow-Lite, we need to average hundreds
of runs to produce a precise measurement, approximately 20 seconds. This is far more slower
than a single forward / backward execution. (ii)Expensive.A lot of mobile devices and software
engineering work are required to build an automatic pipeline to gather the latency from a mobile
farm. Instead of direct measurement, we build a model to estimate the latency. We need only 1
phone rather than a farm of phones, which has only 0.75ms latency RMSE. We use thelatency
modelto search, and we use themeasured latencyto report the nal model's latency.
We sampled 5k architectures from our candidate space, where 4k architectures are used to build the
latency model and the rest are used for test. We measured the latency on Google Pixel 1 phone using
TensorFlow-Lite. The features include (i) type of the operator (ii) input and output feature map size
(iii) other attributes like kernel size, stride for convolution and expansion ratio.
C DETAILS OFMNASNET'SSEARCHCOST
Mnas (Tan et al.,) trains 8,000 mobile-sized models on ImageNet, each of which is trained
for 5 epochs for learning architectures. If these models are trained on V100 GPUs, as done in our
experiments, the search cost is roughly 40,000 GPU hours.
D IMPLEMENTAION OF THE GRADIENT-BASEDALGORITHM
A naive implementation of the gradient-based algorithm (see Eq. (4)) is calculating and storing oj(x)
in the forward step to later compute@L=@gjin the backward step:
@L=@gj=reducesum(ryLoj(x)); (9)
whereryLdenotes the gradient w.r.t. the output of the mixed operationy, denotes the element-
wise product, and reducesum() denotes the sum of all elements.
Notice thatoj(x)is only used for calculating@L=@gjwhenj
th
path is not active (i.e. not involved
in calculatingy). So we do not need to actually allocate GPU memory to storeoj(x). Instead, we
can calculateoj(x)after gettingryLin the backward step, useoj(x)to compute@L=@gjfollowing
Eq. (9), then release the occupied GPU memory. In this way, without the approximation discussed
in Section, we can reduce the GPU memory cost to the same level of training a compact model.
13