Extracted Text
holographic_neural_architectures.pdf
Holographic Neural Architectures
Tariq Daouda
Institute for Research in Immunology and Cancer
Department of biochemistry
Université de Montréal
tariq.daouda@umontreal.ca
Jeremie Zumer
Institute for Research in Immunology and Cancer
Department of Computer Science and Operations Research
Université de Montréal
jeremie.zumer@umontreal.ca
Claude Perreault
Institute for Research in Immunology and Cancer
Department of Medicine
Université de Montréal
claude.perreault@umontreal.ca
Sébastien Lemieux
Institute for Research in Immunology and Cancer
Department of Computer Science and Operations Research
Université de Montréal
s.lemieux@umontreal.ca
Abstract
Representation learning is at the heart of what makes deep learning effective. In
this work, we introduce a new framework for representation learning that we call
Holographic Neural Architectures (HNAs). In the same way that an observer
can experience the 3D structure of a holographed object by looking at its holo-
gram from several angles, HNAs derive Holographic Representations from the
training set. These representations can then be explored by moving along a con-
tinuous bounded single dimension. We show that HNAs can be used to make
generative networks, state-of-the-art regression models and that they are inher-
ently highly resistant to noise. Finally, we argue that because of their denoising
abilities and their capacity to generalize well from very few examples, models
based upon HNAs are particularly well suited for biological applications where
training examples are rare or noisy.
1 Introduction
The success of deep learning algorithms during the last decade can be attributed to their ability to
learn relevant latent representations [6]. However, despite these recent advances, generalization from
few and potentially noisy examples remains an open problem. Most deep learning architectures have
many more parameters than the number of examples they are trained upon and are therefore partic-
ularly sensitive to noise and overtting on small datasets. These limitations can make deep learning
Preprint. Work in progress.
algorithms hard to apply to real life situations where data is scarce and noisy. These problems could
in theory be overcome if we were able to effectively learn relevant, smooth, latent representations
from few examples and despite the presence of noise.
Manifold learning is an emergent property of some deep learning algorithms. A class of algorithms
that aims at deriving latent, lower-dimensional manifolds from the training set are Autoencoders
(AEs) with bottleneck layers [5]. Since these networks reconstruct their input using fewer units than
the input dimensionality, it is assumed that the bottleneck representation `summarises' the input.
More recently, the application of variational approaches to auto-encoders have led to the introduc-
tion of Variational Autoencoders (VAEs) [16]. VAEs address the problem of manifold learning by
constraining the parameter distribution to a simpler distribution family than the true distribution.
This approach forces the latent space to be dense, enabling straightforward sampling for generation.
In our approach, manifold learning is also a consequence of the network's training. We force the
network to generalize by placing an extremely severe bottleneck over the information received from
training examples. Here, we are projecting all training example into a single bounded dimension.
As with VAEs, we also combine the input information with an optimized prior. However, we treat
the prior as a separate input to the network. Because the network has very little information from the
training examples, it must complement it with an accurate general representation of the training set.
Because these representations are continuous, multi-dimensional, and represent the whole training
set, we call them `Holographic Representations' and the architectures capable of generating them
`Holographic Neural Architectures' (HNAs).
In this work, we introduce the general framework of HNAs as well as an implementation of the
concept. We tested our models on several datasets: MNIST [18], Fashion-MNIST [29], SVHN [22],
Olivetti [26], TCGA [14] and IEDB [28]. We show that HNAs are inherently highly resistant to
input noise. We report that HNAs can be used to build noise resistant generative networks even on
very small datasets, as well as state-of-the-art regression models. Finally, we also explore the effects
of different activation functions on the performances of HNAs. Our results show that sine types
activations outperform more widely used non-linearities.
2 Architecture
We implemented HNAs as two joined networks as depicted in gure 1: a backbone network, that
learns distributions over the training data, and an observer network whose output modulates the
activations of the backbone network neurons. The backbone is itself composed of two parts: an
optimized prior, and a posterior network that integrates the prior with the information received from
the observer. Only the observer network receives information specic to the desired output. The
combination in the backbone of the prior and the specic information encoded by the observer pro-
duces an approximation of the desired output. Interestingly, without the measurement made by the
observer network, the output of the backbone network alone represents an undifferentiated state, i.e.
it is a distribution of all possible outputs. Furthermore, because the output of the observer network
is directly connected to all neurons inside the backbone network, theses neurons are entangled. In
other words, their activations are synchronized in a way that causes the output of the network to
collapse into values representing a more restricted set of possible outputs. Because the role of the
backbone network is to explicitly encode the space of shared variations among examples, it allows
for the specic information of the observer to be encoded in a very low dimensionality space. For
this paper we used a single neuron with a sine activation as the output of the observer network,
reducing all the specic information about the input to a single, bounded and continuous dimension.
Figure 1 shows the architecture that we used to implement our HNAs. The prior can be easily
conditioned (for example on the class label), by having one distribution for every condition. Here
we used Gaussian mixtures as priors and optimized the parameters of each mixture during training.
Every hidden layers in the backbone as well as the output layer receive the output of the previous
layer, concatenated with a skip connection to the observer's output. These skip connections are
expanded to match the size of the previous hidden layer. Therefore, the activation of the layers of
the backbone network are computed using the following formula:
hn=f(Wn(hn 1jsn)) +bn (1)
2
Input
Prior
Condition
Obs
Skip
Skip
Output
Backbone networkObserver network
}
Backbone Figure 1: The HNA used in this work
Wherehnis the activation of then
th
layer,fis an activation function,snis then
th
skip connection
of the same size ashn 1,Wna parameter matrix of sizeN2N,bnis a bias parameter,jindicates
a concatenation andis matrix multiplication. Thanks to the skip connections, the integration of
the prior with the observer's measurement is performed throughout the network. Every layer in the
backbone network has its activation modied by the output of the observer, both directly and through
the activation of its previous layer.
3 Holographic Generative Networks
Generative networks try to learn the distribution of training examples in order to generate new sam-
ples. Because HNAs explicitly learn priors over the training data, they should perform well as gener-
ative networks. We designed Holographic Generative Networks (HGN) as auto-encoders. Here the
information received by the observer is the image to be reconstructed, while the backbone network
receives random samples drawn from Gaussian mixtures conditioned on the image class. To ensure
the reproducibility of our results, and assess the ease of training of HGNs, we used a single architec-
ture for all experiments in this section regardless of the dataset: MNIST[18], Fashion-MNIST[29],
SVHN[22], Olivetti[26]. All networks minimized a MSE reconstruction loss and were trained us-
ing adagrad[11] with a learning rate of 0.01. Layers were initialized using Glorot initialization[12],
Gaussian mixtures parameters were initialized using a [0, 1] bounded uniform distribution, each
mixture contained 16 1-dimensional Gaussians. We used a layer size of 128, 2 hidden layers for
the observer network and 4 for the backbone. All networks were implemented and trained using the
deep learning framework Mariana[10].Non-linearitites: lrelurelu sigmoid sinsin10tanh
0.02
0.04
0.06
0 2 4
M
S
E
Fashion−MNIST
0.025
0.030
0 2 4
0
0.1
0.2
Olivetti
0.01
0.02
024
EPOCH (x 10
4
)
0 2 4
0.05
0.1
SVHN
0.014
0.018
0 2 4 0 2 4
MNIST
0.03
0 2 4
0.04
Figure 2: Impact of non-linearities on the convergence of HGNs. Sine type activations outperform
more widely used non-linearities
Because HNAs heavily rely on the modulation of the backbone network by the observer's output,
the backbone representations have to be rich and exible enough to allow for a wide set of possible
modulations. To explore the impact of activation functions on HNAs, we compared convergence
performances of networks using the following non-linearities: sigmoid, tanh, relu[21], lrelu[20],
and sine. For networks using sine activations we used a normal sine activation for all layers but
the last one, for which we used a sine normalized between [0, 1] to match target values. As shown
in gure 2, HGNs using sine type activation functions consistently converge better than the oth-
ers. Surprisingly, performances of non-sine activation functions are very dataset dependent. We also
3
found that multiplying the output of the observer by an arbitrary value (here 10) can improve conver-
gence 2, we refer to this non-linearity as `sin10'. As shown in gure 2, HGNs using sin10 converge
better than those using a regular sine function. The only exception being the Olivetti dataset, on
which both networks showed very similar performances. These results suggest that HNAs benet
from the non-linear processing performed by sine type functions. Fashion-MNIST
MNIST
Olivetti
SVHN
Figure 3: HGN generated images. The rst row shows generated images using an FSS method. The
second row shows the closest images in the training set calculated using an euclidean distance. The
third row shows the observer activation.
Sampling from a trained HGN is trivial. After xing the class label, the whole generation process
is controlled by the single output of the observer. Since this value is bounded between [-1, 1],
our sampling method simply generates images using an ascending vector ofNevenly spaced values
between [-1, 1]. We call this method of sampling Full Spectrum Sampling (FSS). It has the advantage
of generating samples showing the whole spectrum of what the model can generate with arbitrary
precision. Furthermore, because all elements are arranged along a single dimension, we found that
examples will generally be arranged in a semantic way, with similar images corresponding to
similar observer activations. This is exemplied in gure 3 where we can see the gradual evolution
of generated outputs that occur on all datasets.
As shown in gure 3, the same architecture was able to generate realistic images for all datasets.
Some of the generated images are however a little blurry. Our hypothesis is that the blur is a con-
sequence of the architecture not having enough capacity to effectively model the distribution of
training examples. This is supported by the fact that generations on the Olivetti dataset are much
sharper. Of all datasets, the Olivetti dataset is the most challenging for generative models because
of the small number of samples. This dataset contains 10 images for every one of the 40 subjects.
The small number of examples makes it easy to overt and harder to generalize. Nonetheless, the
HGN was able to generate smooth interpolation between facial expression as shown in gure 3.
Using the same architecture as for the other datasets, we were able to generate 100 images for each
subject from the 10 available in the training set. Because the model was conditioned on the image
class, here the subject, it learned a different movement pattern for each class. These movements
include continuous changes of facial expressions and 3D rotations as shown in gure 3. On Olivetti
we've found that adding a small xed amount of stochastic Gaussian noise can slightly improve the
smoothness and quality of interpolation. Interestingly, SVHN generations show no side digits. This
is due to the lack of correlation between the center digit and the side digits. This behavior is related
to the noise reduction performed by HNAs. This aspect is explored further in the next section.
3.1 Demonstration of Noise Resistance
An interesting feature of HNAs is their ability to lter out noise from the training set. This general-
ization property is especially important when the signal to noise ratio cannot be as directly assessed
as with images, or when getting noise free examples is virtually impossible.
To illustrate the nature of the noise ltering, we generated a synthetic dataset from three concentric
crescents, each one corresponding to a single class. For each class, we generated 1000 points and
added increasing amounts of normal noise. We then trained an HGN to reproduce the points of each
class. Here it is important to note that, contrary to standard denoising Auto-Encoders (dAE) setup
where the network receives noisy inputs but the costs are computed on clean targets[27, 7], here both
4
Figure 4: Comparison of denoising performances of an HNG, a standard AE, and a VAE tested on a
corrupted synthetic toy dataset.
the input and the target are identical and thus share the same noise. This setup therefore mimics a
real life situation where the true target cannot be known. We can see in gure 4 that the HGN
was able to remove a signicant part of the noise in the reconstructions. As expected, the capacity
of the network to extract the underlying distributions decreases as we increase the noise standard
deviation. However, at high levels of noise (std=0.08), the network is still able to extract clear
distinct distributions for each class. At very high levels of noise (std=0.16), the network extracted
important features about the original data such as the bell shape and the 3 separate classes. These
results show that HGNs are capable of retrieving information about underlying distributions from
very corrupted examples. Figure 4 also shows the results obtained by a standard AE with relu units
and an architecture similar to the backbone network of the HGN, and by a variational autoencoder
(VAE). We've used a standard VAE implementation [16]. The inference model is gaussian with a
diagonal covariance, and the generative model is a bernoulli vector:
zq(zjx) =N((x);(x)
2
) (2)
xp(xjz) =B((z)) (3)
We report results with the priorp(z) =N(0;I)as we have not found that learning the prior during
training helps the model perform the denoising operation. Under these conditions, the results from
gure 3 display a typical failure mode of VAEs when trained on low-dimensionality inputs (as hinted
also in Chen et al. [8]). As expected, the standard AE overtted the training set and was unable to
distinguish the true input signal from the overall image distribution.
We further designed an experiment on the Olivetti and MNIST datasets mimicking the variability
encountered in real life data capture. In both cases we corrupted images using the following formula:
e
X=max(0; X+N(0;
2
)) (4)
The clipping to 0 for negative values simulates a situation where the signal goes below the detection
threshold. The corruption was performed only once on the whole dataset, not at every training
step. This simulates the real life situation where only a nite number of noisy training examples are
available.
Using these corrupted datasets we compared the denoising performances of HGN to those of other
models. Because standard implementations of AEs, dAEs and VAEs are not conditioned on the
class, we trained all models either on a single class for each dataset (gure 5), or on unlabeled
datasets by putting all examples under the same class (gure 6). As shown in gure 5, at low noise
5
AE
dAE
HGN
Noise 0.08
VAE
Input
Ground truth
Noise 0.32
AE
dAE
HGN
Noise 0.08
VAE
Input
Ground truth
Noise 0.32
A
B Figure 5: Comparison of denoising performances when models are trained on corrupted data of a
single class of: (A) MNIST and (B) Olivetti. Reconstructions obtained with a standard AE, a dAE,
a VAE and a HGN.HGN
Noise 0.08
VAE
Input
Ground truth
HGN
Noise 0.16
VAE
Input
Noise 0.32
Noise 0.64
Figure 6: Comparison of denoising performances when models are trained on corrupted unlabeled
examples. Reconstructions obtained with a VAE and a HGN.
levels on MNIST (std=0.08), both the AE and the dAE appear to have overtted the data. The
reconstructions of the VAE and HGN are both noiseless and look similar to each other. On higher
noise levels (std=0.32), the dAE was able to remove more of the noise than AE. The VAE output
collapsed towards the average image. This shows a failure mode of VAEs. Natural images rarely
have a multivariate gaussian structure. Therefore, unlesszis lossless, which is not going to be
the case in practice, the best-tting distribution for the generative model will be averaging multiple
outputs [30, 15]. The HGN reconstructions were not affected by the noise increase.
On the Olivetti dataset, training on a single class means training on 10 images. Figure 5 therefore
shows the whole training set. At low noise level (std=0.08), the AE and dAE show similar outputs
and neither was able to perform any noise reduction. The VAE again collapsed on the average
face. The HGN reconstructions however are signicantly less noisy than the inputs and also show
6
variability in the face angle. On high noise levels (std=0.32) the HGN was the only model that did
not overt, and was able to remove a major part of the noise.
Figure 6 shows the comparison of a HGN and VAE when trained on an unlabeled, corrupted version
of Olivetti. At low noise levels (std=0.08), both models generated faces that are less noisy than the
inputs. Interestingly, some of the faces generated by the HGN do not reconstruct the inputs, such as
with the face of the third woman from the left. This behavior suggests that the model is relying more
on the general representation in the backbone than on the single value received from the observer.
The more we increase the noise level, the bigger the difference between the two models. At very
high noise levels (std=0.32 and std=0.64) the HGN was more able to extract a face structure than
the VAE, and still generate outputs showing some variability. In conclusion, despite being presented
with very corrupted inputs from very small datasets, HGNs did not overt, and were able to lter
out a major part of the noise. These results show that HGNs are highly noise resistant.
4 Biomedical Datasets
4.1 The Cancer Genome Atlas dataset
Recent years have seen a big increase in the availability of RNA sequencing data [14, 19]. Despite
constant improvements in the devices and protocols being introduced, RNA sequencing remains
a complex process. Samples are often collected by different experimenters in different conditions.
These factors introduce an element of uncertainty that is very difcult to quantify. This renders RNA
sequencing, and therefore the gene expressions that are derived from the sequencing, susceptible to
noise. Another issue that is inherent to the cost and availability of biological data, is that it can often
be impossible to get a large number of RNA sequencing samples for a given condition.BLCA
−4
4
P
C
v
a
lu
e
Example
FSS
−4
4
0
5
−1 1
Observer value
0
5
−1 1
PC1 PC2 PC3 PC4
0
5
PC1
0
5
PC2
0
6
−1 1
PC3
−3
3
−1 1
PC4
Observer value
−1 1 1
−4
4
−4 4
PC1
P
C
2
0
5
0 5
PC1
P
C
2
KIPANA B
Example
FSS
−1 1
Observer
0
5
0 4
PC1
KICH
KIRC
KIRP
Subtypes
D E
BLCA
−1 1 −1 1
KIPAN
C
Figure 7: PCA of cancer gene expression samples for: (A) bladder cancer samples (BLCA) when
training examples and generated samples are reduced using the rst 2 principal components of train-
ing examples. Round dots represent training examples, triangles represent samples generated using
a FSS method. The color represents the activation of the observer. (B) shows the same graph for
kidney cancers (KIPAN). (C) Distributions of the 3 cancer subtypes composing KIPAN. The last
row shows how well generated samples follow the distribution of training examples when principal
components are taken individually for BLCA (D) and KIPAN (E).
To investigate how well HGNs can model cancer gene expression data, we trained an HGN on the
TCGA dataset [14]. This dataset contains 13 246 samples from 37 cancers. The number of sam-
ples for each cancer type ranges from 45 to 1 212. Every sample contains the measured expressions
for 20 531 genes. We used the same previous HGN architecture, but with a layer size of 300 neu-
rons and 10 layers in the backbone network. Here the network receives gene expressions instead
of images and cancer types instead of image classes. Since gene expressions cannot be plotted and
visually evaluated, we elected to reduce the dimensionality of both training examples and recon-
structions using a PCA (gure 7). We chose PCA because it is a linear transformation, therefore
7
any non-linear behavior observed after the dimensionality reduction is due entirely to the network.
We computed, for every cancer type, a PCA transformation on the training examples, and used that
transformation to reduce both training examples and reconstructions. Reconstructions, therefore,
are not only presented from the `point of view' of training examples, but they also have no inuence
on the PCA computation. The transformation is therefore not biased toward accurately representing
reconstructions.
Figure 7 shows how well the distribution of generated gene expressions, models the distribution of
training samples when both are reduced to their rst 4 principal components. Results on KIPAN are
particularly interesting. This cancer type is made of 3 distinct subtypes. Despite the heterogeneity
of the samples, the HGN was capable of modeling the 3 subtypes, and of generating intermediate
samples between these subtypes. These results strongly suggest that HGNs can accurately model
RNA sequencing data. Figure 7 also sheds light on the inner functioning of HGNs as it show
the trajectory followed in the input space as the observer value increases. Through the interaction
with the backbone network, the 1-dimensional manifold learned by the observer is embedded in the
higher-dimensional holographic representation of the backbone. The result is then projected onto
the input space.
4.2 IEDB
MHC-I associated peptides (MIPs) are small sections of proteins that are presented at the surface
of cells by the MHC-I molecules. MIPs present an image of the inner functioning of the cell to the
immune system and thus play an essential role in the identication of cancerous and virus-infected
cells. Given their central role in immunity, being able to reliably predict the binding afnity of
MIPs to MHC-I alleles has been identied as a milestone for the development of virus and cancer
vaccines [9, 3]. Modern high throughput studies routinely generate tens of thousands of MIPs whose
binding afnity to the subjects' MHC-I molecules have to be assessed [23]. Given the impossibility
of experimentally measuring such a high number of afnities in the lab, machine learning afnity
predictors are now the norm [3, 2, 13, 17, 24]. MHC-I molecules are however encoded by extremely
polymorphic genes in the human species (13 324 alleles) [25]. This makes it impossible to gather
tens of thousands of training examples for every single allele and modern predictors suffer from a
lack of precision regarding rare alleles.
We previously showed on the Olivetti dataset that HGNs are capable of learning rich representations
from a small number of training examples. Here we tested if this extends in particular to regression
models as well. Our Holographic Regression Network (HRN) shares the same architecture as the
HGNs. However, the network takes the label of the MHC-I allele instead of the class label, and
the sequence of the MIP in amino acids encoded using word embeddings [4] instead of the image.
Most MIPs vary between 8 to 11 amino acids in length. To account for this difference, we used a
default size of 11 and padded the remaining slots with zeros if needed. The output of the network
is the predicted afnity of the MIP sequence to the MHC-I allele. We trained our network on the
IEDB database [28] that contains 185 157 measurements of afnities of MIPs to MHC-I alleles, for
6 species including humans. We normalized all the measurements between [0, 1] by transforming
them using the following formula:
X= log
2(V+ 2) (5)
whereVis the vector of measurements, and then by dividing by the maximum value ofX. IEDB
also contains several MIP to MHC-I alleles associations with underspecied binding measurements
that only give an upper or lower bound. We elected to keep these examples and use the lower or
upper bound as the target value. We evaluated our model on a 80/20 random split of the data. We
trained our model on 80% of the data and tested on the remaining 20%.
Our model achieved comparable performances than those reported by the state-of the-art model
NetMHC 4.0 [1] (see table 1). HRN results show higher Pearson correlation coefcient (PCC) than
NetMHC4.0 on all lengths except for the 11-mers. Furthermore, our results also show very good
performances on the alleles that were removed in the study Andreatta and Nielsen [1] for having too
few examples (less than 20 MIPs). These results show that the ability of HGNs to generalize from
very few examples extends to HRNs as well. This suggests that HNAs can improve the performances
of regression models, especially when the number of training examples is small.
8
NetMHC4.0 HRN
8-mer 0.717 0.761
9-mer 0.717 0.761
10-mer 0.744 0.747
11-mer 0.706 0.691
All lengths, human MIPs only - 0.769
All lengths - 0.760
All lengths, MHC-I alleles with less than 20 ex. - 0.953*
Table 1: Pearson correlation coefcient (PCC) NetMHC4.0 vs HRN. NetMHC4.0 results are taken
from Andreatta and Nielsen [1].
Training set contains 160 examples, testing 19 examples.
5 Conclusion
In this work, we have introduced HNAs, a new framework for deep learning capable of deriving
holographic representations from training sets. We have shown that HNAs can generalize from very
few examples, can be used for both generative models and state-of-the-art regression models, and
are highly noise resistant. These characteristics make HNAs particularly well-suited for biological
applications where the available training data is limited or noisy. We also showed that, in the con-
text of HNAs, the choice of activation function can dramatically inuence convergence. Our results
show that sine activations consistently outperforms several more widely used activation functions.
Whether these results extend to other type of architectures remains to be investigated. Our exper-
iments also suggest that HNAs are easy to train. All the networks in this work follow the same
basic recipe, and we used the exact same architecture for all tasks on image datasets. Finally, a very
interesting aspect of HNAs is their ability to project all examples into a single bounded dimension.
This strongly suggests that they could also be used as effective dimensionality reduction methods.
References
[1] Massimo Andreatta and Morten Nielsen. Gapped sequence alignment using articial neural
networks: application to the MHC class I system.Bioinformatics, 32(4):511517, 2015.
[2] Diana P aola Granados, Dev Sriranganadane, Tariq Daouda, Antoine Zieger, Céline M.
Laumont, Olivier Caron-Lizotte, Geneviève Boucher, Marie Pierre Hardy, Patrick Gendron,
Caroline Côté, Sébastien Lemieux, Pierre Thibault, and Claude Perreault. Impact of ge-
nomic polymorphisms on the repertoire of human MHC class I-associated peptides.Nature
communications, 5:3600, apr 2014. ISSN 20411723. doi: 10.1038/ncomms4600. URL
http://www.nature.com/doifinder/10.1038/ncomms4600.
[3] Linus Backert and Oliver Kohlbacher. Immunoinformatics and epitope prediction in the age
of genomic medicine.Genome Medicine, 7(1):119, 2015. ISSN 1756-994X. doi: 10.1186/
s13073-015-0245-0. URLhttp://genomemedicine.com/content/7/1/119.
[4] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic
language model.Journal of machine learning research, 3(Feb):11371155, 2003.
[5] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Unsupervised feature learning and
deep learning: A review and new perspectives.CoRR, abs/1206.5538, 2012. URLhttp:
//arxiv.org/abs/1206.5538.
[6] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and
new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):
17981828, 2013.
[7] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-
encoders as generative models. InAdvances in Neural Information Processing Systems, pages
899907, 2013.
9
[8] Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman,
Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder.CoRR, abs/1611.02731,
2016.
[9] Joseph D. Comber and Ramila Philip. MHC class I antigen presentation and implications for
developing a new generation of therapeutic vaccines.Therapeutic Advances in Vaccines, 2(3):
7789, 2014.
[10] Tariq Daouda. Mariana: The cutest deep learning framework which is also a wonderful declar-
ative language.https://github.com/tariqdaouda/Mariana, 2018.
[11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning
and stochastic optimization.Journal of Machine Learning Research, 12(Jul):21212159, 2011.
[12] Xavier Glorot and Yoshua Bengio. Understanding the difculty of training deep feedforward
neural networks. InProceedings of the thirteenth international conference on articial intelli-
gence and statistics, pages 249256, 2010.
[13] Diana Paola Granados, Wafaa Yahyaoui, Céline M Laumont, Tariq Daouda, Tara L Muratore-
Schroeder, Caroline Côté, Jean-Philippe Laverdure, Sébastien Lemieux, Pierre Thibault, and
Claude Perreault. MHC I-associated peptides preferentially derive from transcripts bearing
miRNA response elements.Blood, 119(26):e18191, jun 2012. ISSN 1528-0020. doi: 10.
1182/blood-2012-02-412593. URLhttp://www.ncbi.nlm.nih.gov/pubmed/22438248.
[14] Robert L Grossman, Allison P Heath, Vincent Ferretti, Harold E Varmus, Douglas R Lowy,
Warren A Kibbe, and Louis M Staudt. Toward a shared vision for cancer genomic data.New
England Journal of Medicine, 375(12):11091112, 2016.
[15] Matthew Hoffman and Matthew Johnson. Elbo surgery: yet another way to carve up the varia-
tional evidence lowerbound.NIPS Workshop in Advances in Approximate Bayesian Inference,
2016.
[16] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.CoRR,
abs/1312.6114, 2013.
[17] Céline M. Laumont, Tariq Daouda, Jean-philippe Laverdure, Éric Bonneil, Olivier Caron-
Lizotte, Marie-pierre Hardy, Diana P Granados, Chantal Durette, Sébastien Lemieux, Pierre
Thibault, and Claude Perreault. Global proteogenomic analysis of human MHC class I-
associated peptides derived from non-canonical reading frames.Nature Communications, 7:
10238, 2016. ISSN 2041-1723. doi: 10.1038/ncomms10238. URLhttp://www.nature.
com/doifinder/10.1038/ncomms10238.
[18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition.Proceedings of the IEEE, 86(11):22782324, 1998.
[19] John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad,
Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotype-tissue ex-
pression (gtex) project.Nature genetics, 45(6):580, 2013.
[20] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectier nonlinearities improve neural
network acoustic models. InProc. icml, volume 30, page 3, 2013.
[21] Vinod Nair and Geoffrey E Hinton. Rectied linear units improve restricted boltzmann ma-
chines. InProceedings of the 27th international conference on machine learning (ICML-10),
pages 807814, 2010.
[22] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.
Reading digits in natural images with unsupervised feature learning. InNIPS workshop on
deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
[23] Hillary Pearson, Tariq Daouda, Diana Paola Granados, Chantal Durette, Eric Bonneil, Math-
ieu Courcelles, Anja Rodenbrock, Jean-Philippe Laverdure, Caroline Cote, Sylvie Mader,
10
Tariq Daouda
Institute for Research in Immunology and Cancer
Department of biochemistry
Université de Montréal
tariq.daouda@umontreal.ca
Jeremie Zumer
Institute for Research in Immunology and Cancer
Department of Computer Science and Operations Research
Université de Montréal
jeremie.zumer@umontreal.ca
Claude Perreault
Institute for Research in Immunology and Cancer
Department of Medicine
Université de Montréal
claude.perreault@umontreal.ca
Sébastien Lemieux
Institute for Research in Immunology and Cancer
Department of Computer Science and Operations Research
Université de Montréal
s.lemieux@umontreal.ca
Abstract
Representation learning is at the heart of what makes deep learning effective. In
this work, we introduce a new framework for representation learning that we call
Holographic Neural Architectures (HNAs). In the same way that an observer
can experience the 3D structure of a holographed object by looking at its holo-
gram from several angles, HNAs derive Holographic Representations from the
training set. These representations can then be explored by moving along a con-
tinuous bounded single dimension. We show that HNAs can be used to make
generative networks, state-of-the-art regression models and that they are inher-
ently highly resistant to noise. Finally, we argue that because of their denoising
abilities and their capacity to generalize well from very few examples, models
based upon HNAs are particularly well suited for biological applications where
training examples are rare or noisy.
1 Introduction
The success of deep learning algorithms during the last decade can be attributed to their ability to
learn relevant latent representations [6]. However, despite these recent advances, generalization from
few and potentially noisy examples remains an open problem. Most deep learning architectures have
many more parameters than the number of examples they are trained upon and are therefore partic-
ularly sensitive to noise and overtting on small datasets. These limitations can make deep learning
Preprint. Work in progress.
algorithms hard to apply to real life situations where data is scarce and noisy. These problems could
in theory be overcome if we were able to effectively learn relevant, smooth, latent representations
from few examples and despite the presence of noise.
Manifold learning is an emergent property of some deep learning algorithms. A class of algorithms
that aims at deriving latent, lower-dimensional manifolds from the training set are Autoencoders
(AEs) with bottleneck layers [5]. Since these networks reconstruct their input using fewer units than
the input dimensionality, it is assumed that the bottleneck representation `summarises' the input.
More recently, the application of variational approaches to auto-encoders have led to the introduc-
tion of Variational Autoencoders (VAEs) [16]. VAEs address the problem of manifold learning by
constraining the parameter distribution to a simpler distribution family than the true distribution.
This approach forces the latent space to be dense, enabling straightforward sampling for generation.
In our approach, manifold learning is also a consequence of the network's training. We force the
network to generalize by placing an extremely severe bottleneck over the information received from
training examples. Here, we are projecting all training example into a single bounded dimension.
As with VAEs, we also combine the input information with an optimized prior. However, we treat
the prior as a separate input to the network. Because the network has very little information from the
training examples, it must complement it with an accurate general representation of the training set.
Because these representations are continuous, multi-dimensional, and represent the whole training
set, we call them `Holographic Representations' and the architectures capable of generating them
`Holographic Neural Architectures' (HNAs).
In this work, we introduce the general framework of HNAs as well as an implementation of the
concept. We tested our models on several datasets: MNIST [18], Fashion-MNIST [29], SVHN [22],
Olivetti [26], TCGA [14] and IEDB [28]. We show that HNAs are inherently highly resistant to
input noise. We report that HNAs can be used to build noise resistant generative networks even on
very small datasets, as well as state-of-the-art regression models. Finally, we also explore the effects
of different activation functions on the performances of HNAs. Our results show that sine types
activations outperform more widely used non-linearities.
2 Architecture
We implemented HNAs as two joined networks as depicted in gure 1: a backbone network, that
learns distributions over the training data, and an observer network whose output modulates the
activations of the backbone network neurons. The backbone is itself composed of two parts: an
optimized prior, and a posterior network that integrates the prior with the information received from
the observer. Only the observer network receives information specic to the desired output. The
combination in the backbone of the prior and the specic information encoded by the observer pro-
duces an approximation of the desired output. Interestingly, without the measurement made by the
observer network, the output of the backbone network alone represents an undifferentiated state, i.e.
it is a distribution of all possible outputs. Furthermore, because the output of the observer network
is directly connected to all neurons inside the backbone network, theses neurons are entangled. In
other words, their activations are synchronized in a way that causes the output of the network to
collapse into values representing a more restricted set of possible outputs. Because the role of the
backbone network is to explicitly encode the space of shared variations among examples, it allows
for the specic information of the observer to be encoded in a very low dimensionality space. For
this paper we used a single neuron with a sine activation as the output of the observer network,
reducing all the specic information about the input to a single, bounded and continuous dimension.
Figure 1 shows the architecture that we used to implement our HNAs. The prior can be easily
conditioned (for example on the class label), by having one distribution for every condition. Here
we used Gaussian mixtures as priors and optimized the parameters of each mixture during training.
Every hidden layers in the backbone as well as the output layer receive the output of the previous
layer, concatenated with a skip connection to the observer's output. These skip connections are
expanded to match the size of the previous hidden layer. Therefore, the activation of the layers of
the backbone network are computed using the following formula:
hn=f(Wn(hn 1jsn)) +bn (1)
2
Input
Prior
Condition
Obs
Skip
Skip
Output
Backbone networkObserver network
}
Backbone Figure 1: The HNA used in this work
Wherehnis the activation of then
th
layer,fis an activation function,snis then
th
skip connection
of the same size ashn 1,Wna parameter matrix of sizeN2N,bnis a bias parameter,jindicates
a concatenation andis matrix multiplication. Thanks to the skip connections, the integration of
the prior with the observer's measurement is performed throughout the network. Every layer in the
backbone network has its activation modied by the output of the observer, both directly and through
the activation of its previous layer.
3 Holographic Generative Networks
Generative networks try to learn the distribution of training examples in order to generate new sam-
ples. Because HNAs explicitly learn priors over the training data, they should perform well as gener-
ative networks. We designed Holographic Generative Networks (HGN) as auto-encoders. Here the
information received by the observer is the image to be reconstructed, while the backbone network
receives random samples drawn from Gaussian mixtures conditioned on the image class. To ensure
the reproducibility of our results, and assess the ease of training of HGNs, we used a single architec-
ture for all experiments in this section regardless of the dataset: MNIST[18], Fashion-MNIST[29],
SVHN[22], Olivetti[26]. All networks minimized a MSE reconstruction loss and were trained us-
ing adagrad[11] with a learning rate of 0.01. Layers were initialized using Glorot initialization[12],
Gaussian mixtures parameters were initialized using a [0, 1] bounded uniform distribution, each
mixture contained 16 1-dimensional Gaussians. We used a layer size of 128, 2 hidden layers for
the observer network and 4 for the backbone. All networks were implemented and trained using the
deep learning framework Mariana[10].Non-linearitites: lrelurelu sigmoid sinsin10tanh
0.02
0.04
0.06
0 2 4
M
S
E
Fashion−MNIST
0.025
0.030
0 2 4
0
0.1
0.2
Olivetti
0.01
0.02
024
EPOCH (x 10
4
)
0 2 4
0.05
0.1
SVHN
0.014
0.018
0 2 4 0 2 4
MNIST
0.03
0 2 4
0.04
Figure 2: Impact of non-linearities on the convergence of HGNs. Sine type activations outperform
more widely used non-linearities
Because HNAs heavily rely on the modulation of the backbone network by the observer's output,
the backbone representations have to be rich and exible enough to allow for a wide set of possible
modulations. To explore the impact of activation functions on HNAs, we compared convergence
performances of networks using the following non-linearities: sigmoid, tanh, relu[21], lrelu[20],
and sine. For networks using sine activations we used a normal sine activation for all layers but
the last one, for which we used a sine normalized between [0, 1] to match target values. As shown
in gure 2, HGNs using sine type activation functions consistently converge better than the oth-
ers. Surprisingly, performances of non-sine activation functions are very dataset dependent. We also
3
found that multiplying the output of the observer by an arbitrary value (here 10) can improve conver-
gence 2, we refer to this non-linearity as `sin10'. As shown in gure 2, HGNs using sin10 converge
better than those using a regular sine function. The only exception being the Olivetti dataset, on
which both networks showed very similar performances. These results suggest that HNAs benet
from the non-linear processing performed by sine type functions. Fashion-MNIST
MNIST
Olivetti
SVHN
Figure 3: HGN generated images. The rst row shows generated images using an FSS method. The
second row shows the closest images in the training set calculated using an euclidean distance. The
third row shows the observer activation.
Sampling from a trained HGN is trivial. After xing the class label, the whole generation process
is controlled by the single output of the observer. Since this value is bounded between [-1, 1],
our sampling method simply generates images using an ascending vector ofNevenly spaced values
between [-1, 1]. We call this method of sampling Full Spectrum Sampling (FSS). It has the advantage
of generating samples showing the whole spectrum of what the model can generate with arbitrary
precision. Furthermore, because all elements are arranged along a single dimension, we found that
examples will generally be arranged in a semantic way, with similar images corresponding to
similar observer activations. This is exemplied in gure 3 where we can see the gradual evolution
of generated outputs that occur on all datasets.
As shown in gure 3, the same architecture was able to generate realistic images for all datasets.
Some of the generated images are however a little blurry. Our hypothesis is that the blur is a con-
sequence of the architecture not having enough capacity to effectively model the distribution of
training examples. This is supported by the fact that generations on the Olivetti dataset are much
sharper. Of all datasets, the Olivetti dataset is the most challenging for generative models because
of the small number of samples. This dataset contains 10 images for every one of the 40 subjects.
The small number of examples makes it easy to overt and harder to generalize. Nonetheless, the
HGN was able to generate smooth interpolation between facial expression as shown in gure 3.
Using the same architecture as for the other datasets, we were able to generate 100 images for each
subject from the 10 available in the training set. Because the model was conditioned on the image
class, here the subject, it learned a different movement pattern for each class. These movements
include continuous changes of facial expressions and 3D rotations as shown in gure 3. On Olivetti
we've found that adding a small xed amount of stochastic Gaussian noise can slightly improve the
smoothness and quality of interpolation. Interestingly, SVHN generations show no side digits. This
is due to the lack of correlation between the center digit and the side digits. This behavior is related
to the noise reduction performed by HNAs. This aspect is explored further in the next section.
3.1 Demonstration of Noise Resistance
An interesting feature of HNAs is their ability to lter out noise from the training set. This general-
ization property is especially important when the signal to noise ratio cannot be as directly assessed
as with images, or when getting noise free examples is virtually impossible.
To illustrate the nature of the noise ltering, we generated a synthetic dataset from three concentric
crescents, each one corresponding to a single class. For each class, we generated 1000 points and
added increasing amounts of normal noise. We then trained an HGN to reproduce the points of each
class. Here it is important to note that, contrary to standard denoising Auto-Encoders (dAE) setup
where the network receives noisy inputs but the costs are computed on clean targets[27, 7], here both
4
Figure 4: Comparison of denoising performances of an HNG, a standard AE, and a VAE tested on a
corrupted synthetic toy dataset.
the input and the target are identical and thus share the same noise. This setup therefore mimics a
real life situation where the true target cannot be known. We can see in gure 4 that the HGN
was able to remove a signicant part of the noise in the reconstructions. As expected, the capacity
of the network to extract the underlying distributions decreases as we increase the noise standard
deviation. However, at high levels of noise (std=0.08), the network is still able to extract clear
distinct distributions for each class. At very high levels of noise (std=0.16), the network extracted
important features about the original data such as the bell shape and the 3 separate classes. These
results show that HGNs are capable of retrieving information about underlying distributions from
very corrupted examples. Figure 4 also shows the results obtained by a standard AE with relu units
and an architecture similar to the backbone network of the HGN, and by a variational autoencoder
(VAE). We've used a standard VAE implementation [16]. The inference model is gaussian with a
diagonal covariance, and the generative model is a bernoulli vector:
zq(zjx) =N((x);(x)
2
) (2)
xp(xjz) =B((z)) (3)
We report results with the priorp(z) =N(0;I)as we have not found that learning the prior during
training helps the model perform the denoising operation. Under these conditions, the results from
gure 3 display a typical failure mode of VAEs when trained on low-dimensionality inputs (as hinted
also in Chen et al. [8]). As expected, the standard AE overtted the training set and was unable to
distinguish the true input signal from the overall image distribution.
We further designed an experiment on the Olivetti and MNIST datasets mimicking the variability
encountered in real life data capture. In both cases we corrupted images using the following formula:
e
X=max(0; X+N(0;
2
)) (4)
The clipping to 0 for negative values simulates a situation where the signal goes below the detection
threshold. The corruption was performed only once on the whole dataset, not at every training
step. This simulates the real life situation where only a nite number of noisy training examples are
available.
Using these corrupted datasets we compared the denoising performances of HGN to those of other
models. Because standard implementations of AEs, dAEs and VAEs are not conditioned on the
class, we trained all models either on a single class for each dataset (gure 5), or on unlabeled
datasets by putting all examples under the same class (gure 6). As shown in gure 5, at low noise
5
AE
dAE
HGN
Noise 0.08
VAE
Input
Ground truth
Noise 0.32
AE
dAE
HGN
Noise 0.08
VAE
Input
Ground truth
Noise 0.32
A
B Figure 5: Comparison of denoising performances when models are trained on corrupted data of a
single class of: (A) MNIST and (B) Olivetti. Reconstructions obtained with a standard AE, a dAE,
a VAE and a HGN.HGN
Noise 0.08
VAE
Input
Ground truth
HGN
Noise 0.16
VAE
Input
Noise 0.32
Noise 0.64
Figure 6: Comparison of denoising performances when models are trained on corrupted unlabeled
examples. Reconstructions obtained with a VAE and a HGN.
levels on MNIST (std=0.08), both the AE and the dAE appear to have overtted the data. The
reconstructions of the VAE and HGN are both noiseless and look similar to each other. On higher
noise levels (std=0.32), the dAE was able to remove more of the noise than AE. The VAE output
collapsed towards the average image. This shows a failure mode of VAEs. Natural images rarely
have a multivariate gaussian structure. Therefore, unlesszis lossless, which is not going to be
the case in practice, the best-tting distribution for the generative model will be averaging multiple
outputs [30, 15]. The HGN reconstructions were not affected by the noise increase.
On the Olivetti dataset, training on a single class means training on 10 images. Figure 5 therefore
shows the whole training set. At low noise level (std=0.08), the AE and dAE show similar outputs
and neither was able to perform any noise reduction. The VAE again collapsed on the average
face. The HGN reconstructions however are signicantly less noisy than the inputs and also show
6
variability in the face angle. On high noise levels (std=0.32) the HGN was the only model that did
not overt, and was able to remove a major part of the noise.
Figure 6 shows the comparison of a HGN and VAE when trained on an unlabeled, corrupted version
of Olivetti. At low noise levels (std=0.08), both models generated faces that are less noisy than the
inputs. Interestingly, some of the faces generated by the HGN do not reconstruct the inputs, such as
with the face of the third woman from the left. This behavior suggests that the model is relying more
on the general representation in the backbone than on the single value received from the observer.
The more we increase the noise level, the bigger the difference between the two models. At very
high noise levels (std=0.32 and std=0.64) the HGN was more able to extract a face structure than
the VAE, and still generate outputs showing some variability. In conclusion, despite being presented
with very corrupted inputs from very small datasets, HGNs did not overt, and were able to lter
out a major part of the noise. These results show that HGNs are highly noise resistant.
4 Biomedical Datasets
4.1 The Cancer Genome Atlas dataset
Recent years have seen a big increase in the availability of RNA sequencing data [14, 19]. Despite
constant improvements in the devices and protocols being introduced, RNA sequencing remains
a complex process. Samples are often collected by different experimenters in different conditions.
These factors introduce an element of uncertainty that is very difcult to quantify. This renders RNA
sequencing, and therefore the gene expressions that are derived from the sequencing, susceptible to
noise. Another issue that is inherent to the cost and availability of biological data, is that it can often
be impossible to get a large number of RNA sequencing samples for a given condition.BLCA
−4
4
P
C
v
a
lu
e
Example
FSS
−4
4
0
5
−1 1
Observer value
0
5
−1 1
PC1 PC2 PC3 PC4
0
5
PC1
0
5
PC2
0
6
−1 1
PC3
−3
3
−1 1
PC4
Observer value
−1 1 1
−4
4
−4 4
PC1
P
C
2
0
5
0 5
PC1
P
C
2
KIPANA B
Example
FSS
−1 1
Observer
0
5
0 4
PC1
KICH
KIRC
KIRP
Subtypes
D E
BLCA
−1 1 −1 1
KIPAN
C
Figure 7: PCA of cancer gene expression samples for: (A) bladder cancer samples (BLCA) when
training examples and generated samples are reduced using the rst 2 principal components of train-
ing examples. Round dots represent training examples, triangles represent samples generated using
a FSS method. The color represents the activation of the observer. (B) shows the same graph for
kidney cancers (KIPAN). (C) Distributions of the 3 cancer subtypes composing KIPAN. The last
row shows how well generated samples follow the distribution of training examples when principal
components are taken individually for BLCA (D) and KIPAN (E).
To investigate how well HGNs can model cancer gene expression data, we trained an HGN on the
TCGA dataset [14]. This dataset contains 13 246 samples from 37 cancers. The number of sam-
ples for each cancer type ranges from 45 to 1 212. Every sample contains the measured expressions
for 20 531 genes. We used the same previous HGN architecture, but with a layer size of 300 neu-
rons and 10 layers in the backbone network. Here the network receives gene expressions instead
of images and cancer types instead of image classes. Since gene expressions cannot be plotted and
visually evaluated, we elected to reduce the dimensionality of both training examples and recon-
structions using a PCA (gure 7). We chose PCA because it is a linear transformation, therefore
7
any non-linear behavior observed after the dimensionality reduction is due entirely to the network.
We computed, for every cancer type, a PCA transformation on the training examples, and used that
transformation to reduce both training examples and reconstructions. Reconstructions, therefore,
are not only presented from the `point of view' of training examples, but they also have no inuence
on the PCA computation. The transformation is therefore not biased toward accurately representing
reconstructions.
Figure 7 shows how well the distribution of generated gene expressions, models the distribution of
training samples when both are reduced to their rst 4 principal components. Results on KIPAN are
particularly interesting. This cancer type is made of 3 distinct subtypes. Despite the heterogeneity
of the samples, the HGN was capable of modeling the 3 subtypes, and of generating intermediate
samples between these subtypes. These results strongly suggest that HGNs can accurately model
RNA sequencing data. Figure 7 also sheds light on the inner functioning of HGNs as it show
the trajectory followed in the input space as the observer value increases. Through the interaction
with the backbone network, the 1-dimensional manifold learned by the observer is embedded in the
higher-dimensional holographic representation of the backbone. The result is then projected onto
the input space.
4.2 IEDB
MHC-I associated peptides (MIPs) are small sections of proteins that are presented at the surface
of cells by the MHC-I molecules. MIPs present an image of the inner functioning of the cell to the
immune system and thus play an essential role in the identication of cancerous and virus-infected
cells. Given their central role in immunity, being able to reliably predict the binding afnity of
MIPs to MHC-I alleles has been identied as a milestone for the development of virus and cancer
vaccines [9, 3]. Modern high throughput studies routinely generate tens of thousands of MIPs whose
binding afnity to the subjects' MHC-I molecules have to be assessed [23]. Given the impossibility
of experimentally measuring such a high number of afnities in the lab, machine learning afnity
predictors are now the norm [3, 2, 13, 17, 24]. MHC-I molecules are however encoded by extremely
polymorphic genes in the human species (13 324 alleles) [25]. This makes it impossible to gather
tens of thousands of training examples for every single allele and modern predictors suffer from a
lack of precision regarding rare alleles.
We previously showed on the Olivetti dataset that HGNs are capable of learning rich representations
from a small number of training examples. Here we tested if this extends in particular to regression
models as well. Our Holographic Regression Network (HRN) shares the same architecture as the
HGNs. However, the network takes the label of the MHC-I allele instead of the class label, and
the sequence of the MIP in amino acids encoded using word embeddings [4] instead of the image.
Most MIPs vary between 8 to 11 amino acids in length. To account for this difference, we used a
default size of 11 and padded the remaining slots with zeros if needed. The output of the network
is the predicted afnity of the MIP sequence to the MHC-I allele. We trained our network on the
IEDB database [28] that contains 185 157 measurements of afnities of MIPs to MHC-I alleles, for
6 species including humans. We normalized all the measurements between [0, 1] by transforming
them using the following formula:
X= log
2(V+ 2) (5)
whereVis the vector of measurements, and then by dividing by the maximum value ofX. IEDB
also contains several MIP to MHC-I alleles associations with underspecied binding measurements
that only give an upper or lower bound. We elected to keep these examples and use the lower or
upper bound as the target value. We evaluated our model on a 80/20 random split of the data. We
trained our model on 80% of the data and tested on the remaining 20%.
Our model achieved comparable performances than those reported by the state-of the-art model
NetMHC 4.0 [1] (see table 1). HRN results show higher Pearson correlation coefcient (PCC) than
NetMHC4.0 on all lengths except for the 11-mers. Furthermore, our results also show very good
performances on the alleles that were removed in the study Andreatta and Nielsen [1] for having too
few examples (less than 20 MIPs). These results show that the ability of HGNs to generalize from
very few examples extends to HRNs as well. This suggests that HNAs can improve the performances
of regression models, especially when the number of training examples is small.
8
NetMHC4.0 HRN
8-mer 0.717 0.761
9-mer 0.717 0.761
10-mer 0.744 0.747
11-mer 0.706 0.691
All lengths, human MIPs only - 0.769
All lengths - 0.760
All lengths, MHC-I alleles with less than 20 ex. - 0.953*
Table 1: Pearson correlation coefcient (PCC) NetMHC4.0 vs HRN. NetMHC4.0 results are taken
from Andreatta and Nielsen [1].
Training set contains 160 examples, testing 19 examples.
5 Conclusion
In this work, we have introduced HNAs, a new framework for deep learning capable of deriving
holographic representations from training sets. We have shown that HNAs can generalize from very
few examples, can be used for both generative models and state-of-the-art regression models, and
are highly noise resistant. These characteristics make HNAs particularly well-suited for biological
applications where the available training data is limited or noisy. We also showed that, in the con-
text of HNAs, the choice of activation function can dramatically inuence convergence. Our results
show that sine activations consistently outperforms several more widely used activation functions.
Whether these results extend to other type of architectures remains to be investigated. Our exper-
iments also suggest that HNAs are easy to train. All the networks in this work follow the same
basic recipe, and we used the exact same architecture for all tasks on image datasets. Finally, a very
interesting aspect of HNAs is their ability to project all examples into a single bounded dimension.
This strongly suggests that they could also be used as effective dimensionality reduction methods.
References
[1] Massimo Andreatta and Morten Nielsen. Gapped sequence alignment using articial neural
networks: application to the MHC class I system.Bioinformatics, 32(4):511517, 2015.
[2] Diana P aola Granados, Dev Sriranganadane, Tariq Daouda, Antoine Zieger, Céline M.
Laumont, Olivier Caron-Lizotte, Geneviève Boucher, Marie Pierre Hardy, Patrick Gendron,
Caroline Côté, Sébastien Lemieux, Pierre Thibault, and Claude Perreault. Impact of ge-
nomic polymorphisms on the repertoire of human MHC class I-associated peptides.Nature
communications, 5:3600, apr 2014. ISSN 20411723. doi: 10.1038/ncomms4600. URL
http://www.nature.com/doifinder/10.1038/ncomms4600.
[3] Linus Backert and Oliver Kohlbacher. Immunoinformatics and epitope prediction in the age
of genomic medicine.Genome Medicine, 7(1):119, 2015. ISSN 1756-994X. doi: 10.1186/
s13073-015-0245-0. URLhttp://genomemedicine.com/content/7/1/119.
[4] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic
language model.Journal of machine learning research, 3(Feb):11371155, 2003.
[5] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Unsupervised feature learning and
deep learning: A review and new perspectives.CoRR, abs/1206.5538, 2012. URLhttp:
//arxiv.org/abs/1206.5538.
[6] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and
new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):
17981828, 2013.
[7] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-
encoders as generative models. InAdvances in Neural Information Processing Systems, pages
899907, 2013.
9
[8] Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman,
Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder.CoRR, abs/1611.02731,
2016.
[9] Joseph D. Comber and Ramila Philip. MHC class I antigen presentation and implications for
developing a new generation of therapeutic vaccines.Therapeutic Advances in Vaccines, 2(3):
7789, 2014.
[10] Tariq Daouda. Mariana: The cutest deep learning framework which is also a wonderful declar-
ative language.https://github.com/tariqdaouda/Mariana, 2018.
[11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning
and stochastic optimization.Journal of Machine Learning Research, 12(Jul):21212159, 2011.
[12] Xavier Glorot and Yoshua Bengio. Understanding the difculty of training deep feedforward
neural networks. InProceedings of the thirteenth international conference on articial intelli-
gence and statistics, pages 249256, 2010.
[13] Diana Paola Granados, Wafaa Yahyaoui, Céline M Laumont, Tariq Daouda, Tara L Muratore-
Schroeder, Caroline Côté, Jean-Philippe Laverdure, Sébastien Lemieux, Pierre Thibault, and
Claude Perreault. MHC I-associated peptides preferentially derive from transcripts bearing
miRNA response elements.Blood, 119(26):e18191, jun 2012. ISSN 1528-0020. doi: 10.
1182/blood-2012-02-412593. URLhttp://www.ncbi.nlm.nih.gov/pubmed/22438248.
[14] Robert L Grossman, Allison P Heath, Vincent Ferretti, Harold E Varmus, Douglas R Lowy,
Warren A Kibbe, and Louis M Staudt. Toward a shared vision for cancer genomic data.New
England Journal of Medicine, 375(12):11091112, 2016.
[15] Matthew Hoffman and Matthew Johnson. Elbo surgery: yet another way to carve up the varia-
tional evidence lowerbound.NIPS Workshop in Advances in Approximate Bayesian Inference,
2016.
[16] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.CoRR,
abs/1312.6114, 2013.
[17] Céline M. Laumont, Tariq Daouda, Jean-philippe Laverdure, Éric Bonneil, Olivier Caron-
Lizotte, Marie-pierre Hardy, Diana P Granados, Chantal Durette, Sébastien Lemieux, Pierre
Thibault, and Claude Perreault. Global proteogenomic analysis of human MHC class I-
associated peptides derived from non-canonical reading frames.Nature Communications, 7:
10238, 2016. ISSN 2041-1723. doi: 10.1038/ncomms10238. URLhttp://www.nature.
com/doifinder/10.1038/ncomms10238.
[18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition.Proceedings of the IEEE, 86(11):22782324, 1998.
[19] John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad,
Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotype-tissue ex-
pression (gtex) project.Nature genetics, 45(6):580, 2013.
[20] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectier nonlinearities improve neural
network acoustic models. InProc. icml, volume 30, page 3, 2013.
[21] Vinod Nair and Geoffrey E Hinton. Rectied linear units improve restricted boltzmann ma-
chines. InProceedings of the 27th international conference on machine learning (ICML-10),
pages 807814, 2010.
[22] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.
Reading digits in natural images with unsupervised feature learning. InNIPS workshop on
deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
[23] Hillary Pearson, Tariq Daouda, Diana Paola Granados, Chantal Durette, Eric Bonneil, Math-
ieu Courcelles, Anja Rodenbrock, Jean-Philippe Laverdure, Caroline Cote, Sylvie Mader,
10