Extracted Text

1901.02878v1.pdf

A Constructive Approach for One-Shot Training of Neural Networks Using
Hypercube-Based Topological Coverings
W. Brent Daniel, Enoch Yeung
Abstract In this paper we presented a novel constructive
approach for training deep neural networks using geometric
approaches. We show that a topological covering can be used
to dene a class of distributed linear matrix inequalities, which
in turn directly specify the shape and depth of a neural network
architecture. The key insight is a fundamental relationship
between linear matrix inequalities and their ability to bound the
shape of data, and the rectied linear unit (ReLU) activation
function employed in modern neural networks. We show that
unit cover geometry and cover porosity are two design variables
in cover-constructive learning that play a critical role in dening
the complexity of the model and generalizability of the resulting
neural network classier. In the context of cover-constructive
learning, these ndings underscore the age old trade-off be-
tween model complexity and overtting (as quantied by the
number of elements in the data cover) and generalizability on
test data. Finally, we benchmark on algorithm on the Iris,
MNIST, and Wine dataset and show that the constructive
algorithm is able to train a deep neural network classier in
one shot, achieving equal or superior levels of training and test
classication accuracy with reduced training time.
I. INTRODUCTION
Articial neural networks have proven themselves to be
useful, highly exible tools for addressing many complex
problems where rst-principles solutions are infeasible, im-
practical, or undesirable. They have been used to address
challenging classication problems ranging from wine typ-
ing to complex image analysis, voice recognition, language
translation, and beyond.
The same exibility that allows neural networks to be
applied in such disparate contexts, however, can also lead to
ambiguity in their appropriate denition and training. Deep
neural networks, for example, are composed of multiple
hidden layers with each hidden layer containing many nodes,
each completely connected to the nodes in the preceding
layer by a set of weightsWh. Historically there has not been
a functional relationship or algorithmic approach that allows
researchers to dene or derive a neural network's structural
characteristics from either the problem specication or the
associated training data. A neural network's topologycanbe
optimized for a given problem, but this effectively results in
a nested series of optimizations with the outer optimization
steps tasked with incrementally assessing the most effective
network topology [1].
Similarly, there has been noa prioriway to specify
reasonable initial values for the weights,Wh. In practice
these weights are stochastically initialized. The training pro-
cess then optimizes their values so as to maximize network
performance against a specied metric or cost function.
This cost function can have complex dependencies on the
input parameters. Many algorithms, especially those that
rely on gradient information, can become stuck in local
minima, limiting the predictive quality of a network for a
given training instance. The result is that the same structural
topology and training data can yield neural networks with a
broad distribution of predictive qualities from one training
run to the next. These stochastic effects can be most marked
when the volume of training data is relatively small, yield-
ing an optimization problem with relatively few constraints
compared to the dimensionality of the parameter space. Such
effects can be reduced by the choice of training algorithm,
its parameterization, or by repeated training restarts, but this
correspondingly increases the computational complexity and
training time. Additionally, it's typically impossible to unam-
biguously specify a nite stop condition for training. This is
the result of three factors: the training process is stochastic,
the metric space has the potential for local minima, and the
global minimum value is unknown beforehand.
In this paper we introduce a constructive method for
the design of neural networks that rely on geometric rep-
resentation of the data. The algorithm directly addresses
the issues outlined above, including, 1) providing a concise
structural denition of the neural network and its topology,
2) assigning network connection weights deterministically,
3) incorporating approximations that allow the algorithm to
construct neural networks that in many cases have greater
mean accuracy and better precision than traditionally trained
networks, especially when training data is relatively sparse,
4) having a well-dened stop condition for training, and 5)
inherently providing a clear interpretation of what and how
information is encoded within the resulting neural network.
II. CONSTRUCTIVELEARNINGUSINGTOPOLOGICAL
COVERS
In what follows, we introduce a three-step approach for
constructive training of a ReLU neural network:
One or more topological covering maps are dened
between a network's desired input and output spaces;
These covering spaces are encoded as a series of linear
matrix inequalities; and, nally,
The series of linear matrix inequalities is translated
into a neural network topology with corresponding
connection weights.
Many schemes for dening topological covering maps can
be imagined, each with their own strengths and weaknesses.
In the present work a relatively simple approach has been de-
vised that prizes computational efciency and classication
efcacy under sparse training data. The resulting algorithm
1
arXiv:1901.02878v1 [cs.LG] 9 Jan 2019

scales roughly linearly in the number of training data points
and only weakly with the number of input dimensions. This
need not be the case. We'll rst discuss a similar, but brute
force approach, where memory requirements and computa-
tional complexity scale exponentially with input dimension.
A. Covering with Uniform Hypercubes
Let's letX2R
n
andY2R
m
denote the potential input
and output spaces of a neural network, wherenis the input
dimension andmis the number of classication categories.
The neural network then performs a mapping/transformation,
f:X!Y, in such a way that an output vector,y2
Y, is, hopefully, more clearly and trivially indicative of
membership within the appropriate category than the original
point,x2X. Let's assume that each potential training
vector,xi, has an associated label,Li, that indicates its
membership in a category,c2C, whereC=fc1; : : : ; cmg.
The set of points belonging to category,c, is then given by
X
c
=fxi2XjLi=cg.
The set of training data corresponding to a single category
of input,X
c
, can be thought of as a point cloud in ann-
dimensional space. This point cloud can be characterized by
its maximal and minimal extent,b
<
j
andb
>
j
, in each of then
dimensions, wherej2 f1: : : ng. This set of2nconstraints
is sufcient to dene a bounding hypercube represented by
a collection of(n1)-dimensional hyperplanes, each plane
being orthogonal to one of thendirections. Deningmsuch
bounding hypercubes would allow one to rapidly characterize
regions of the input parameter space corresponding to each
category.
For complex classication problems, however, it's likely
that portions of these bounding hypercubes may overlap,
leading to apparent ambiguities in the classication of both
training and evaluation data. This scheme could, however, be
rened to reduce the potential for ambiguous classications.
Instead of dening one hypercube per classication category,
let's dene a more general set of hypercubes that will
be assigned a classication category based ona posteriori
point membership. Let's letNTdenote the total number of
training points across all categories. The full set of points,
X
T
=fxiji2 f1: : : NTgg, has boundsb
T <
j
andb
T >
j
for
j2 f1: : : ng. These are is sufcient to dene a bounding
hypercube,HT.
Let's assume that the smallest point cloud feature that
should be resolved is of length scalel, whereldenotes a
distance in the parameter space. The full hypercube,HT,
may then be subdivided into a set of smaller hypercubes,H,
by subdividing each axis ofHTinto bins of width,l, with
each of the resulting hypercubes having volumel
n
. The set
of cubes,H, represents a complete covering ofHTwith no
overlapping regions and no empty spaces. The set of points
that lies within one of these hypercubes,Hk, is given by:
Xk=fxij 8j(b
<
kj
xij< b
>
kj
)g (1)
where, for example,b
<
kj
denotes the lower boundary in the
j-th direction of thek-th hypercube. The corresponding set
of categories represented by the points inHkis:
Ck=fLij 8j(b
<
kj
xij< b
>
kj
)g (2)
A category may then be assigned to eachHkbased on
the class membership,Ck, of the points that lie within its
bounds. If the hypercube contains no points,Ck=;, it
is marked as empty; if all points are of the same class,
(9c)(8i)(Cki=c), the hypercube is assigned the category,
c; otherwise, if the hypercube contains points of multiple
categories, its classication remains ambiguous.
If the length scale,l, is chosen to be sufciently small,
each member,Hk, of the set of hypercubes can be classied
unambiguously as either empty or containing points from
a single class. Let's letZ
c
indicate the indices of points
in categoryc,Z
c
=fijLi=cg. The minimum distance
between two points of different categories is, then:
l

= min
a2C;b2C;a6=b

min
i2Z
a
;j2Z
b
jxixjj

(3)
So long aslis chosen roughly such thatl < l

, an
unambiguous covering can be dened.
Operationally, the challenge with this approach is that the
number of hypercubes,K, required to uniformly coverHT
scales poorly with both minimum feature size,l, and the
dimension of the parameter space,n; that is:K/(1=l)
n
,
where(1=l)is proportional to the number of divisions along
each axis ofHT. For example, even if(8j) [(b
T >
j
b
T <
j
)=l
10]andn= 32, both relatively small numbers for real-
world problems, this brute force approach is essentially
computationally intractable, requiringO(10
32
)hypercubes.
B. Hypercube Bisection
The inefciency of the above approach arises from the
need to coverHTcompletely with hypercubes of uniform
dimension. Even if there are features as small as length
scale,l

, it doesn't necessarily follow that that level of
delity is required everywhere. In fact, one could argue that
such a level of delity is only critical where it is required
to delineate the boundary between different classication
regions.
An alternative approach would, thus, be to adaptively
subsect the hypercube,HT, in such a way as to create a
covering of the point cloud regions with a minimalor at
least a relatively smallnumber of subsections. A practi-
cable, reasonably efcient algorithm for accomplishing this
is presented here. The bounding hypercube,HT, containing
the full set of training points,X
T
, is rst determined. This
hypercube is then recursively bisected following a simple
set of rules until no more inhomogeneous hypercubes that
are unconstrained remain. As the algorithm proceeds, four
different sets of hypercubes are tracked: those that contain
a homogeneous distribution of point types,H
H
; those that
contain an inhomogeneous distribution of point types whose
further bisection would not violate any constraints,H
I
; those
inhomogeneous cubes whose further bisection would violate
a constraint,H
V
; and those that are empty,H
E
.
2

The algorithm is initiated withH
I
=fHTgandH
H
=
H
V
=H
E
=;. The set of inhomogeneous cubes,H
I
,
represents the algorithm's working stack. During each iter-
ation, whileH
I
6=;, a hypercube with an inhomogeneous
distribution of point categories is pulled fromH
I
. The axis
along which the bisection of this hypercube would provide
the most benet,

, is determined subject to a constraint
on the minimum length scale of the daughter hypercubes.
Bisection along an axis,, is forbidden if the corresponding
dimension of the daughter hypercubes,H
(a);andH
(b);,
would be of length less than,l.
The `best' bisection axis is determined by the degree
to which bisection along that axis would homogenize the
distribution of the point classes in the resulting daughter
hypercubes. To determine the best bisection axis a test
bisection is performed along each axis,, and the points
of the parent hypercube assigned to each of the daughters
based on their spatial location. The degree of homogeneity
of the points within each daughter is then computed. Let's
denote the indices of points in the class,c, within a daughter
hypercube,H
(a);, byZ
c
(a)
, and the number of such points
byjZ
c
(a)
j. A minimum homogeneity among point class pairs
can be dened as:
h
<
(a);
= min
c2C;c
0
2C;c6=c
0
f

Z
c
(a)
; Z
c
0
(a)

(4)
where
f

Z
c
(a)
; Z
c
0
(a)

= 0; (5)
ifjZ
c
(a)
j=jZ
c
0
(a)
j= 0; and
f

Z
c
(a)
; Z
c
0
(a)

= (jZ
c
(a)
j jZ
c
0
(a)
j)=(jZ
c
(a)
j+jZ
c
0
(a)
j);(6)
otherwise. The greater minimum homogeneity of the two
daughter hypercubes is, then, computed for each bisection
axis, resulting in the following set of tuples:
H=f(;max(h
<
(a);
; h
<
(b);
))j2 f1: : : ngg(7)
A function,gi(:), is then dened that extracts thei-th element
of a tuple. The maximum benet of bisection is,h

=
maxfg2(h)jh2 Hg, and the axes of all bisections that
would yield the maximum benet,
fg1(h)jh2 H ^g2(h) =h

g (8)
If the cardinality of this resulting set is one, that axis is
selected for bisection. If the cardinality is greater than one,
an axis of bisection,

, is chosen at random from the set.
Note that by settingf

Z
c
(a)
; Z
c
0
(a)

= 0whenjZ
c
(a)
j=
jZ
c
0
(a)
j= 0, we have chosen to deemphasize the segrega-
tion of empty regions as a determining factor in choosing
bisection axes. This need not be the case, but will largely
be rendered a moot point when hypercube aspect ratio
constraints are added in a following section.
If bisection along all potential axes would run afoul of
the length scale constraint, the hypercube is placed into the
H
V
set and the algorithm repeated on the next member of
H
I
. Otherwise, the parent hypercube is bisected into two-4 -2 0 2 4
-2
-1
0
1
2
PCA Axis 1
PCA Axis 2
Iris Classification
Fig. 1. The base bisection algorithm applied to point clouds of the rst two
principal component analysis (PCA) directions for parameters related to the
classication of three categories of iris: setosa, versicolor, and virginia. Note
that algorithm is unconstrained by hypercube aspect ratio, here, allowing
for the development of narrow pancake-shaped hypercubes that may not
generalize well to the classication of previously unseen data points.
daughter hypercubes along axis,

. Each of the daughter
hypercubes is, then, assigned to one of the sets,H
H
,H
I
, or
H
E
, depending on the classes of the points that lie within
it.
The algorithm's use is illustrated in Fig. 1, where it has
been applied to a standard benchmark problem classifying
iris species based on their morphological features [2]. A
principle components analysis (PCA) was done on the orig-
inal four series included in the data set and only the two
most important components retained for analysis. Two useful
observations are readily made: rst, the algorithm achieves
100% accuracy in classifying the training data; second, one
might suspect that the resulting covering would generalize
poorly when used to classify previously unseen evaluation
data. The latter is the result of two characteristics of the
algorithm. First, because of the algorithm's sole focus on ho-
mogenization of the data points rather than regularization of
the covering shapes, narrow peninsulas of one classication
category's covering can often penetrate into other regions of
the parameter space. Second, there are also large, sometimes
inter-penetrating, regions of the relevant parameter space
that have not been assigned a classication category at all.
Both of these potential shortcomings will be addressed in a
following section.
C. Direct Restrictions on Cover Geometry to Avoid Overt-
ting
As noted, the algorithm outlined above can lead to hyper-
cubes with large aspect ratios, essentially high-dimensional
pancakes whose extent along some direction(s) is vastly
greater than along others. This is typically reective of some
underlying feature of the data that emphasizes bisection
along a particular discriminatory axis. While this is of little
consequence from the point of view of characterizing the
training data, it can lead to poor performance when the
resulting network is, then, asked to extrapolate to unseen
data points.
One mechanism to address this issue is to limit the
maximum aspect ratio that can be attained by a hypercube.
3

-4 -2 0 2 4
-2
-1
0
1
2
PCA Axis 1
PCA Axis 2
Iris Classification Fig. 2. Constraining daughter hypercube aspect ratio during each choice of
bisection direction yields coverings that are more likely to generalize well,
but which may still have holes in the covering.
Ana posterioriapproach would be to have any bisection that
results in the generation of daughter cubes with aspect ratios
exceeding some threshold recursively trigger the bisection
of those daughter hypercubes along orthogonal directions
until the aspect ratio of every resulting daughter is within
bounds. Unfortunately, this leads to nearly the same explo-
sion in hypercube number that results from the initial uniform
covering approach. Alternatively, the aspect ratio may be
constraineda prioriso that bisection can occur only along
directions such that the resulting daughters do not exceed a
maximum aspect ratio,r

. This alternative approach scales
far better. It's slightly less efcient at developing coverings
for training data than unconstrained bisection, in the sense
that more optimal bisections are sometimes prevented. It
results, however, in coverings with far fewer artifacts that
could subsequently lead to odd generalizations, and does so
with minimal additional computational burden.
If the breadths of a hypercube along each of its axes,j,
are enumerated in a set,B=f(b
>
j
b
<
j
)jj2 f1: : : ngg,
the hypercube's corresponding maximum aspect ratio isr=
maxB=minB. The algorithm above is, thus, updated such
that viable bisection axes are limited to those along which
bisection would not violate the constraint,rr

, for the
daughter hypercubes. The impact of incorporating such a
constraint is illustrated in Fig. 2. It should be noted that, in
generalas well as specically in the context of aspect ratio
constraintsdata sets should likely be normalized along each
axis before analysis such that size and aspect ratio constraints
are applied uniformly in all directions.
D. Reducing Model Complexity via Classication Porosity
As part of the bisection algorithm's implementation, the
set of daughter hypercubes which are empty,H
E
, is tracked,
and for efciency reasons, removed from consideration for
further bisection. As these cells contain no members of
the original point cloud from which to assign a type, they
essentially represent holes in the classication space. These
holes partly reect the underlying geometry of the original
point cloud, but also contain relics of the bisection algorithm,
including those resulting from random choices of bisection
direction in the daughter hypercube's parental lineage. Both-4 -2 0 2 4
-2
-1
0
1
2
PCA Axis 1
PCA Axis 2
Iris Classification
Fig. 3. Porous, empty regions of the classication/parameter space can
be lled in using a recursive cellular automata based approach. The entire
bisection algorithm scales well as the dimension,d, is increased. A 3D
covering for iris classication based on the rst three PCA directions is
illustrated in the lower panel. The algorithm has been tested up to 32
dimensions. Computation time scales with the number of points in the point
cloud rather than with the dimension.
reduce the resulting neural network's ability to generalize
classication knowledge beyond regions contained within the
set of non-empty hypercubes in the topological covering.
How these empty regions are handled has a direct impact
on network generalization. There is unlikely to be a single
best approach, due to the no free lunch theorem. Currently,
empty cells are recursively lled in using, what is effectively,
a cellular automata schema. Empty cells bordering non-
empty cells are assigned a classication category based on
their total area of contact with non-empty cells of each type.
The classication with maximum boundary area is assigned
to the empty cell. As some empty cells may initially touch
no non-empty cells, this process is repeated until all empty
cells withinH
T
have been assigned a classication category,
yielding a non-porous covering. The covering that results
after a category is assigned to empty hypercubes is shown in
Fig. 3 for both two- and three-dimensional versions of the
iris classication problem.
E. Hypercubes As Linear Inequalities
An advantage to using hypercubes as the building blocks
of a topological covering is that they are convexn-gons
with simple bounding hyperplanes. This convexity allows
their interior to be delineated by a logical combination of
4

inequalities, one per bound. Because the bounds are simple
hyperplanes, the inequalities are all linear. For a given
hypercube,Hk, the set of boundaries,b
<
kj
andb
>
kj
, can, thus,
be translated into a matrix of weights and a corresponding
vector of biases:
Wkx+vk0; (9)
whereWk2R
2nn
andvk2R
2n
. Thenconstraints corre-
sponding to the upper bounds,b
>
kj
, have weightsW
>
kij
= 1
wherei; j2 f1: : : ng ^i=j,0fori6=j, and biases,
v
>
kj
=b
>
kj
. The corresponding entries for the lower bounds
areW
<
kij
=1wherei; j2 f1: : : ng ^i=j,0fori6=j,
and,v
<
kj
=b
<
kj
. We then have:
Wk=

W
>
k
W
<
k

;vk=

v
>
k
v
<
k

: (10)
The region of space composed of points,x, that satisfy all
of the inequalities corresponds to the interior of hypercube,
Hk. If one or more of the inequalities is unsatised, the point
lies on the exterior ofHk.
F. Linear Matrix Inequalities Directly Specify RELU Layers
in a Neural Network
This system of inequalities should look suspiciously like
the set of equations governing the input to a hidden layer
within a neural network,Wkx+vk. And, in fact, using a
ReLU activation function, the output of said layer would take
the form:
z=ReLU(Wkx+vk); (11)
where ReLU(x) =x (x)with(x)being the Heaviside step
function. Note that byz=ReLU(:), we mean to imply the
element-wise application of the ReLU function.
The nonlinearity of the ReLU function is similar to the
boolean behavior of an inequality: yielding zero if the
corresponding inequality is satised, and a nite value if
it is not. Further, the sum,
P
i
zi, will take on a value of
zero ifallof the constraints are satised, that is, if a point
lies within the interior of the corresponding hypercube, and
a nite value if it does not. Let's, then, dene a function:
k(x) =
X
i
ReLU(Wkx+vk)

i
; (12)
which indicates whether a point,x, lies within the bounds
of a hypercube,Hk. This function has a simple, one-to-
one, correspondence with a neural network that hasninput
nodes, represented byx, and2nhidden layer nodes, one
corresponding to each constraint. The output of the hidden
layer nodes is, then, summed,i.e., fed into a second hidden
layer with a single node having a ReLU activation function;
uniform weights,1; and bias 0.
If there areK
c
hypercubes that cover a region in parameter
space corresponding to classication category,c, then there
areK
c
corresponding functions,
c
k
, each of which, in turn,
corresponds to an independent hidden layer topology with
uniquely dened weights. TheseK
c
outputs can then be
combined in such a way as to yield
c
(x)0if the point,Single hyper-plane constraint
Input Layer
Set of constraints
defning a single hypercube
Approximate
logical AND
Approximate
logical NEGATION
Approximate
logical OR
Single hypercube
Single Approximate Boolean Classifer
Output
Fig. 4. A schematic illustrating the realization of a neural network
architecture generated by the constructive learning algorithm on the Iris
training data set. Nodes represent variables in input, hidden, or output
layers while edges represent the multiplicative action of rows in the weight
matrix of a given layer. Bias terms are not represented in this diagram. The
realized neural network architecture spatially organizes nodes that represent
individual hypercubes in the topological covering as collections of nodes.
Each node and layer can be directly linked to an aspect of the underlying
topological cover of the data, resulting in an interpretable deep neural
network.
x, lies on the exterior ofallK
c
hypercubes, and
c
(x)1
if it lies within one of them:

c
(x) =
X
k
1
c
k(x): (13)
This corresponds to an indication by the corresponding
neural network of class membership. ThejCjoutputs,
c
(x),
wherej:jindicates cardinality, can then be combined with
a softmax function to yield a traditional, one-hot neural
network. The resulting neural network has the relatively
shallow topology indicated in Fig. 4, where the nodes and
connections corresponding to a classier for a single cate-
gory are illustrated. The outputs of each classier are then
combined, as suggested just above, using a softmax function.
Notice that there is some ambiguity in the resulting classi-
cation, indicated by the approximate equalities,
c
(x)0
and
c
(x)1, above. This ambiguity results from the fact
that a logical negation is necessary; that is, the pointxmust
be determined to lie withinat leastone hypercube, which is
to say that it mustnotlie outside of all of the corresponding
hypercubes. In the denition of the ReLU function, however,
(x)yields a value of 0 both for values ofx <0(points on
the interior of a hypercube) and for values ofx= 0(points
on, or innitesimally close, to a hypercube boundary from
the exterior). This ambiguity nds its way into
c
(x). What
is needed, instead, is a function,
c
(x;B) =
c
(jxBj),
with the following behavior:
lim
jxBj"0

c
(jxBj) = 0whilelim
jxBj#0

c
(jxBj) = 1;
(14)
wherejxBjindicates a distance to the nearest boundary,
and"and#indicate an approach to that boundary from
the hypercube's interior and exterior, respectively. This can
5

effectively be achieved with a minor change to our previous
denition ofk(x), by substituting:

c
(x;) =
X
k
1
c
k(x)=; (15)
whererepresents a normalized length scale, assuming the
data, itself, has been normalized. In the limit that!0,
the desired functional behavior is achieved. We note that
for nite values of, say=l, the boundary of the
hypercubes is essentially softened over a length scale,l. In
some cases this has yielded better accuracy with respect to
the evaluation data, effectively providing a smoother, more
natural generalization of the training data.
III. NUMERICALRESULTS
The performance of correct-by-construction neural net-
works, developed using the bisection algorithm described
above, was compared to standard neural networks using
a number of simple benchmark datasets. As noted in the
introduction, a direct comparison is hindered by a couple of
challenges associated with the standard approach to neural
network development. These challenges highlight potential
benets of the approach that's been laid out herein. Both
represent ambiguous areas in traditional network develop-
ment and training. The rst arises in the determination of
an appropriate neural network topology. Traditionally, this
is not learned as part of the training process, but must be
specied statically beforehand. Yet, there is noa priorirule
for specifying network topology. It's historically been more
art and experimentation than science. For the purposes of
comparison here, many different topologies were explored
ranging from 8 to 128 nodes in each of between one and
three hidden layers. For the simple classication problems
explored here, neither variable appeared to affect outcomes
strongly. As such, all reported accuracies were extracted from
a network with a single hidden layer of 32 nodes.
The second ambiguity concerns a determination of when a
network has been trained enough. In this particular study,
evaluation accuracy appeared to depend far more strongly
on the randomized weights initially assigned to the network
than on the training time provided to it. The initial results
presented here are a bit heuristic as the number of training
iterations for each benchmark was selected by `eye' when
training performance appeared to plateau. Future work will
included a quantitative metric for training closure. As can
be seen in the last two rows of the second table, however,
increasing training time by an order of magnitude had
minimal impact improving accuracy.
Notably, both of these ambiguities are addressed by the
correct-by-construction algorithm. The determination of a
hypercube covering uniquely species a resulting neural net-
work topology. Additionally, the bisection algorithm simply
runs to a deterministic completion,i.e., to the point that all
training data are correctly classied, or a minimal scale has
been reached, removing the ambiguity surrounding training
time. Further, the performance of the bisection algorithm is
far more repeatable than that of a traditionally trained neural
network, especially when the number of data points available
for training is relatively small.
Quantitative testing was performed on a number of rel-
atively simple, standard classication test problems. The
rst of these was the Iris Data Set [2]. The Iris Data Set
contains measurements of sepal length, sepal width, petal
length, and petal width for 50 ower specimens from each
of three iris species (for a total of 150 data points). A
principle component analysis (PCA) was rst done on these
four parameters. Three different classication problems were
then attempted using the rst two, three, and four PCA
directions. In each case a random selection of 70% of the
points was used for training and the remaining 30% were
used for evaluation. This process was replicated 20 times as
the random selection of training and evaluation sets as
well as stochastic processes within the training algorithms
can impact performance. The results are reported in the
tables below, rst for the correct-by-construction algorithm,
then for a traditionally-trained neural network.
There are two critical points to note when comparing the
two. First, the bisection algorithm is able to achieve roughly
96% accuracy when using two, three, or four PCA directions
as compared to the roughly 75% mean accuracy achieved by
a traditionally trained neural network. Second, this accuracy
is repeatably achieved by the correct-by-construction algo-
rithm, with standard deviations of only 3% or so from run
to run. By comparison, the traditionally trained network has
much more scatter shot performance, with a typical standard
deviation of between 19% and 24% between runs.
A similar comparison was made using a Wine Data Set [3].
In this data set, thirteen chemical and physical measurements
were reported for each of 178 different wines, each wine an
example of one of three different varietals. Measurements
included such things as the alcohol content, non-avanoid
phenol levels, and visual hue. PCA was, again, done on
the measurements before classication. The Hypercube Bi-
section Algorithm was able to achieve a mean accuracy of
87% with a standard deviation of 4.4%. The traditionally
trained neural network yielded a mean accuracy of 82%
with a standard deviation of 17%. Again, the bisection
algorithm yielded better accuracy with signicantly better
reproducibility. Note that this isn't meant to imply that a
traditionally-trained neural network can't perform at better
than 82% accuracy on this classication problem, only that
there is distribution of performance outcomes, the mean of
which is characterized by an 82% accuracy.
Finally, the two approaches were compared using the
MNIST Digits data set. Each member of the data set rep-
resents a 28 x 28 pixel gray scale image of a handwritten
digit, resulting in a total of 784 potential input values. In
this comparison, rather than using a convolutional approach,
PCA was performed on the data set and the rst ten (10)
PCA directions selected as inputs. Comparisons were made
using both 200 and 2,000 point subsets of the data. Using
200 points, the mean accuracy of the bisection algorithm and
traditionally trained network were fairly similar, 62% and
64%, respectively. Though, again, the bisection algorithm
6

yielded more repeatable results, 5.0% standard deviation
versus 11% standard deviation. With 2,000 training points
the mean accuracies were again similar, in the mid- to upper-
seventies, though the standard deviation of the bisection
algorithm accuracy was only 1.9%, while it was 8%-10%
for the traditionally trained network.
These results suggest that for relatively simple clas-
sication problems the correct-by-construction, bisection-
algorithm-based neural networks described above may re-
quire less computational complexity (for small numbers
of training points), yield comparable or greater accuracies,
and provide signicantly more reproducible results than
traditionally-trained neural networks. They also remove am-
biguities surrounding network topology and training closure
that are inherent in traditional network denition and training
approaches.
TABLE I
BENCHMARKS OF HYPERCUBEBISECTIONALGORITHM. ALGORITHM
WAS TESTED USING A RANDOM SELECTION OF 70%OF THE DATA FOR
TRAINING AND30%FOR EVALUATION. RESULTS REPRESENT THE MEAN
ACCURACY OVER 20SUCH TESTS.
Dataset Accuracy Std Dev Clock Time
Iris 2D PCA (150 pts) 95% 3.6% 32ms
Iris 3D PCA (150 pts) 96% 2.8% 38ms
Iris 4D PCA (150 pts) 96% 2.3% 45ms
Wine PCA (178 pts) 87% 4.4% 199ms
MNIST 10D PCA (200 pts) 62% 5.0% 512ms
MNIST 10D PCA (2000 pts) 76% 1.9% 22s
TABLE II
BENCHMARKS FOR TRADITIONALLY TRAINED NEURAL NETWORK .
ALGORITHM WAS TESTED USING A RANDOM SELECTION OF 70%OF
THE DATA FOR TRAINING AND 30%FOR EVALUATION. RESULTS
REPRESENT THE MEAN ACCURACY OVER 20SUCH TESTS.
Dataset Accuracy Std Dev Clock Time
Iris 2D PCA (150 pts) 73% 24% 87ms
Iris 3D PCA (150 pts) 77% 22% 85ms
Iris 4D PCA (150 pts) 75% 19% 85ms
Wine PCA (178 pts) 82% 17% 123ms
MNIST 10D PCA (200 pts) 64% 11% 124ms
MNIST 10D PCA (2000 pts) 75% 8% 330ms
MNIST 10D PCA (2000 pts) 79% 10% 4.04s
A. Conclusions
In summary, deep learning algorithms rely on iterative and
corrective renement of a collection of model parameters and
hyper-parameters to obtain a sufciently accurate learning
representation. Such approaches require intensive training
and extensive computational resources.
In this paper we presented a novel constructive approach
for training deep neural networks using geometric methods.
The key insight is a fundamental relationship between linear
matrix inequalities and their ability to bound the shape of
data, and the rectied linear unit (ReLU) activation func-
tion employed in modern neural networks. We show that
the constructive algorithm is able to train a deep neural
network classier in one shot and show it achieves equal
or superior levels of training and test classication accuracy
with far less training time, in classical machine learning
classication benchmark datasets. In particular, we motivate
the use of adaptive cover sizing approaches for generating
data representations and their corresponding neural network
representations. In the context of topological cover learning,
we revisit the age old trade-off between model complexity
(as quantied by the number of elements in the data cover)
and generalizability. Our algorithm is the rst in a new class
of algorithms for constructive deep learning; future work will
investigate Reeb graph and Morse theory methods for data
shape decomposition and neural network parameterization.
REFERENCES
[1] Lam, H. K.,et al. Tuning of the structure and parameters of neural
network using an improved genetic algorithm. Industrial Electronics
Society, 2001. IECON'01. The 27th Annual Conference of the IEEE.
Vol. 1. IEEE, 2001.
[2] Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning
Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science.
[3] Forina, M. et al, PARVUSAn Extendible Package for Data Explo-
ration, Classication and Correlation. Institute of Pharmaceutical and
Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa,
Italy.
[4] Bengio, Yoshua. Practical recommendations for gradient-based train-
ing of deep architectures. Neural networks: Tricks of the trade.
Springer, Berlin, Heidelberg, 2012. 437-478.
[5] Parekh, Rajesh, Jihoon Yang, and Vasant Honavar. Constructive neural-
network learning algorithms for pattern classication. IEEE Transac-
tions on neural networks 11.2 (2000): 436-451.
[6] Nair, Vinod, and Geoffrey E. Hinton. Rectied linear units improve
restricted boltzmann machines. Proceedings of the 27th international
conference on machine learning (ICML-10), 2010.
[7] Shang, Wenling, et al. Understanding and improving convolutional
neural networks via concatenated rectied linear units. International
Conference on Machine Learning, 2016.
[8] Arora, Raman, et al. Understanding deep neural networks with rectied
linear units. arXiv preprint arXiv:1611.01491 (2016).
[9] Clevert, Djork-Arn´e, Thomas Unterthiner, and Sepp Hochreiter. Fast
and accurate deep network learning by exponential linear units (elus).
arXiv preprint arXiv:1511.07289 (2015).
[10] Baldi, Pierre. Autoencoders, unsupervised learning, and deep archi-
tectures. Proceedings of ICML workshop on unsupervised and transfer
learning. 2012.
[11] Boscaini, Davide, et al. Learning shape correspondence with
anisotropic convolutional neural networks. Advances in Neural Infor-
mation Processing Systems. 2016.
[12] Kingma, Diederik P., and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980 (2014).
[13] Hadgu, Asmelash Teka, Aastha Nigam, and Ernesto Diaz-Aviles.
Large-scale learning with AdaGrad on Spark. Big Data (Big Data),
2015 IEEE International Conference on. IEEE, 2015.
7