Extracted Text


1904.09658.pdf

Probabilistic Face Embeddings
Yichun Shi and Anil K. Jain
Michigan State University, East Lansing, MI
shiyichu@msu.edu, jain@cse.msu.edu
Abstract
Embedding methods have achieved success in face
recognition by comparing facial features in a latent seman-
tic space. However, in a fully unconstrained face setting,
the facial features learned by the embedding model could
be ambiguous or may not even be present in the input face,
leading to noisy representations. We propose Probabilistic
Face Embeddings (PFEs), which represent each face image
as a Gaussian distribution in the latent space. The mean
of the distribution estimates the most likely feature values
while the variance shows the uncertainty in the feature val-
ues. Probabilistic solutions can then be naturally derived
for matching and fusing PFEs using the uncertainty infor-
mation. Empirical evaluation on different baseline models,
training datasets and benchmarks show that the proposed
method can improve the face recognition performance of
deterministic embeddings by converting them into PFEs.
The uncertainties estimated by PFEs also serve as good in-
dicators of the potential matching accuracy, which are im-
portant for a risk-controlled recognition system.
1. Introduction
When humans are asked to describe a face image, they
not only give the description of the facial attributes, but also
the condence associated with them. For example, if the
eyes are blurred in the image, a person will keep the eye
size as an uncertain information and focus on other features.
Furthermore, if the image is completely corrupted and no
attributes can be discerned, the subject may respond that
he/her cannot identify this face. This kind of uncertainty (or
condence) estimation is common and important in human
decision making.
On the other hand, the representations used in state-of-
the-art face recognition systems are generally condence-
agnostic. These methods depend on an embedding model
(e.g. Deep Neural Networks) to give a deterministic point
representation for each face image in the latent feature
space [28,,,,]. A point in the latent space rep-
resents the model's estimation of the facial features in the
given image. If the error in the estimation is somehow��=��(��)ambiguous
face
distinguishable
face
(a) deterministic embedding��(��|��)
ambiguous
face
distinguishable
face (b) probabilistic embedding
Figure 1: Difference between deterministic face embeddings and proba-
bilistic face embeddings (PFEs). Deterministic embeddings represent ev-
ery face as a point in the latent space without regards to its feature ambi-
guity. Probabilistic face embedding (PFE) gives a distributional estimation
of features in the latent space instead.Best viewed in color.
bounded, the distance between two points can effectively
measure the semantic similarity between the corresponding
face images. But given a low-quality input, where the ex-
pected facial features are ambiguous or absent in the image,
a large shift in the embedded points is inevitable, leading to
false recognition (Figure).
Given that face recognition systems have already
achieved high recognition accuracies on relatively con-
strained face recognition benchmarks,e.g. LFW [10] and
YTF [38], where most facial attributes can be clearly ob-
served, recent face recognition challenges have moved on
to more unconstrained scenarios, including surveillance
videos [19,,] (See Figure). In these tasks, any type
and degree of variation could exist in the face image, where
most of the desired facial features learned by the representa-
tion model could be absent. Given this lack of information,
it is unlikely to nd a feature set that could always match
these faces accurately. Hence state-of-the-art face recogni-
tion systems which obtained over99%accuracy on LFW
have suffered from a large performance drop on IARPA
Janus benchmarks [19,,].
To address the above problems, we proposeProbabilistic
Face Embeddings (PFEs), which give a distributional esti-
mation instead of a point estimation in the latent space for
each input face image (Figure). The mean of the distri-
bution can be interpreted as the most likely latent feature
values while the span of the distribution represents the un-
1

(a) IJB-A [19](b) IJB-S [13]
Figure 2: Example images from IJB-A and IJB-S. The rst columns show
still images, followed by video frames of the respective subjects in the next
three columns. These benchmarks present a more unconstrained recogni-
tion scenario where there is a large variability in the image quality.
certainty of these estimations. PFE can address the uncon-
strained face recognition problem in a two-fold way: (1)
During matching (face comparison), PFE penalizes uncer-
tain features (dimensions) and pays more attention to more
condent features. (2) For low quality inputs, the con-
dence estimated by PFE can be used to reject the input or
actively ask for human assistance to avoid false recognition.
Besides, a natural solution can be derived to aggregate the
PFE representations of a set of face images into a new dis-
tribution with lower uncertainty to increase the recognition
performance. The implementation of PFE is open-sourced
1
.
The contributions of the paper can be summarized as below:
1.
(PFE) which represents face images as distributions in-
stead of points.
2.
for face matching and feature fusion using PFE.
3.
embeddings into PFEs without additional training data.
4.
PFE can improve face recognition performance of deter-
ministic embeddings and can effectively lter out low-
quality inputs to enhance the robustness of face recogni-
tion systems.
2. Related Work
Uncertainty Learning in DNNsTo improve the robust-
ness and interpretability of discriminant Deep Neural Net-
works (DNNs), deep uncertainty learning is getting more
attention [15,,]. There are two main types of uncer-
tainty:model uncertaintyanddata uncertainty. Model un-
certainty refers to the uncertainty of model parameters given
the training data and can be reduced by collecting additional
training data [23,,,]. Data uncertainty accounts for
the uncertainty in output whose primary source is the inher-
ent noise in input data and hence cannot be eliminated with
more training data [16]. The uncertainty studied in our work
can be categorized as data uncertainty. Although techniques
have been developed for estimating data uncertainty in dif-
ferent tasks, including classication and regression [16],
1
https://github.com/seasonSH/
Probabilistic-Face-Embeddings
they are not suitable for our task since our target space
is not well-dened by given labels
2
. Variational Autoen-
coders [18] can also be regarded as a method for estimating
data uncertainty, but it mainly serves a generation purpose.
Specic to face recognition, some studies [6,,] have
leveraged the model uncertainty for analysis and learning of
face representations, but to our knowledge, ours is the rst
work that utilizes data uncertainty
3
for recognition tasks.
Probabilistic Face RepresentationModeling faces as
probabilistic distributions is not a new idea. In the eld
of face template/video matching, there exists abundant lit-
erature on modeling the faces as probabilistic distribu-
tions [30,], subspace [3] or manifolds [1,] in the fea-
ture space. However, the input for such methods is a set of
face images rather than a single face image, and they use
a between-distribution similarity or distance measure,e.g.
KL-divergence, for comparison, which does not penalize
the uncertainty. Meanwhile, some studies [20,] have at-
tempted to build a fuzzy model of a given face using the
features of face parts. In comparison, the proposed PFE rep-
resents each single face image as a distribution in the latent
space encoded by DNNs and we use an uncertainty-aware
log likelihood score to compare the distributions.
Quality-aware PoolingIn contrast to the methods above,
recent work on face template/video matching aims to lever-
age the saliency of deep CNN embeddings by aggregating
the deep features of all faces into a single compact vec-
tor [43,,,]. In these methods, a separate module
learns to predict the quality of each face in the image set,
which is then normalized for a weighted pooling of feature
vectors. We show that a solution can be naturally derived
under our framework, which not only gives a probabilis-
tic explanation for quality-aware pooling methods, but also
leads to a more general solution where an image set can also
be modeled as a PFE representation.
3. Limitations of Deterministic Embeddings
In this section, we explain the problems of determin-
istic face embeddings from both theoretical and empirical
views. LetXdenote the image space andZdenote the la-
tent feature space ofDdimensions. An ideal latent space
Zshould only encodeidentity-salientfeatures and bedis-
entangledfrom identity-irrelevant features. As such, each
identity should have a unique intrinsic codez2 Zthat best
represents this person and each face imagex2 Xis an ob-
servation sampled fromp(xjz). The process of training face
embeddings can be viewed as a joint process of searching
for such a latent spaceZand learning the inverse mapping
2
Although we are given the identity labels, they cannot directly serve
as target vectors in the latent feature space.
3
Some in the literature have also used the terminology “data uncer-
tainty” for a different purpose [42].

1 9 17 25 33 41 49 57 65 73
0.25
0.00
0.25
0.50
0.75
1.00
impostor original vs. degraded (a) Gaussian Blur0.10.20.30.40.50.60.70.80.91.0
0.25
0.00
0.25
0.50
0.75
1.00
impostor original vs. degraded (b) Occlusion0.150.300.450.600.750.901.051.201.351.50
0.25
0.00
0.25
0.50
0.75
1.00
impostor original vs. degraded (c) Random Gaussian Noise
Figure 3: Illustration offeature ambiguity dilemma. The plots show the cosine similarity on LFW dataset with different degrees of degradation. Blue lines
show the similarity between original images and their respective degraded versions. Red lines show the similarity between impostor pairs of degraded
images. The shading indicates the standard deviation. With larger degrees of degradation, the model becomes more condent (very high/low scores) in a
wrong way.
p(zjx). For deterministic embeddings, the inverse mapping
is a Dirac delta functionp(zjx) =(zf(x)), wherefis
the embedding function. Clearly, for any spaceZ, given the
possibility of noises inx, it is unrealistic to recover the exact
zand the embedded point of a low-quality input would in-
evitably shift away from its intrinsicz(no matter how much
training data we have).
The question is whether this shift could be bounded such
that we still have smaller intra-class distances compared to
inter-class distances. However, this is unrealistic for fully
unconstrained face recognition and we conduct an experi-
ment to illustrate this. Let us start with a simple example:
given a pair of identical images, a deterministic embedding
will always map them to the same point and therefore the
distance between them will always be0, even if these im-
ages do not contain a face. This implies that “a pair of im-
ages being similar or even the same does not necessarily
mean the probability of their belonging to the same person
is high”.
To demonstrate this, we conduct an experiment by manu-
ally degrading the high-quality images and visualizing their
similarity scores. We randomly select a high-quality im-
age of each subject from the LFW dataset [10] and manu-
ally insert Gaussian blur, occlusion, and random Gaussian
noise to the faces. In particular, we linearly increase the
size of Gaussian kernel, occlusion ratio and the standard
deviation of the noise to control the degradation degree. At
each degradation level, we extract the feature vectors with
a 64-layer CNN
4
, which is comparable to state-of-the-art
face recognition systems. The features are normalized to a
hyper-spherical embedding space. Then, two types of co-
sine similarities are reported: (1) similarity between pairs
of original image and its respective degraded image, and (2)
similarity between degraded images of different identities.
As shown in Figure, for all the three types of degrada-
tion, the genuine similarity scores decrease to0while the
4
trained on Ms-Celeb-1M [8] with AM-Softmax [34]
(a) Low-similarity Genuine Pairs (b) High-similarity Impostor Pairs
Figure 4: Example genuine pairs from IJB-A dataset estimated with the
lowest similarity scores and impostor pairs with the highest similarity
scores (among all possible pairs) by a 64-layer CNN model. The genuine
pairs mostly consist of one high-quality and one low-quality image while
the impostor pairs are all low-quality images. Note that these pairs are not
templates in the verication protocol.
impostor similarity scores converge to1:0! These indicate
two types of errors that can be expected in a fully uncon-
strained scenario even when the model is very condent
(very high/low similarity scores):
(1)
(2)
To conrm this, we test the model on the IJB-A dataset
by nding impostor/genuine image pairs with the high-
est/lowest scores, respectively. The situation is exactly as
we hypothesized (See Figure). We call this Feature Am-
biguity Dilemmawhich is observed when the deterministic
embeddings are forced to estimate the features of ambigu-
ous faces. The experiment also implies that there exist a
dark spacewhere the ambiguous inputs are mapped to and
the distance metric is distorted.
4. Probabilistic Face Embeddings
To address the aforementioned problem caused by data
uncertainty, we propose to encode the uncertainty into the
face representation and take it into account during match-
ing. Specically, instead of building a model that gives a
point estimation in the latent space, we estimate a distri-
butionp(zjx)in the latent space to represent the potential

appearance of a person's face
5
. In particular, we use a mul-
tivariate Gaussian distribution:
p(zjxi) =N(z;
i;
2
iI) (1)
where
iandiare both aD-dimensional vector pre-
dicted by the network from thei
th
input imagexi. Here
we only consider a diagonal covariance matrix to reduce the
complexity of the face representation. This representation
should have the following properties:
1. should encode the most likely facial fea-
tures of the input image.
2. should encode the model's con-
dence along each feature dimension.
In addition, we wish to use a single network to predict the
distribution. Considering that new approaches for training
face embeddings are still being developed, we aim to de-
velop a method that could convert existing deterministic
face embedding networks to PFEs in an easy manner. In
the followings, we rst show how to compare and fuse the
PFE representations to demonstrate their strength and then
propose our method for learning PFEs.
Note 1.Because of the space limit, we provide the proofs
of all the propositions below in Section.
4.1. Matching with PFEs
Given the PFE representations of a pair of images
(xi;xj), we can directly measure the “likelihood” of them
belonging to the same person (sharing the same latent code):
p(zi=zj), wherezip(zjxi)andzjp(zjxj). Speci-
cally,
p(zi=zj) =
Z
p(zijxi)p(zjjxj)(zizj)dzidzj:(2)
In practice, we would like to use the log likelihood instead,
whose solution is given by:
s(xi;xj) = logp(zi=zj)
=
1
2
D
X
l=1
(
(
(l)
i

(l)
j
)
2

2(l)
i
+
2(l)
j
+ log(
2(l)
i
+
2(l)
j
))
const; (3)
whereconst=
D
2
log 2,
(l)
i
refers to thel
th
dimension of

iand similarly for
(l)
i
.
Note that this symmetric measure can be viewed as the
expectation of likelihood of one input's latent code condi-
tioned on the other, that is
s(xi;xj) = log
Z
p(zjxi)p(zjxj)dz
= logE
zp(zjxi)[p(zjxj)]
= logE
zp(zjxj)[p(zjxi)]:
(4)
5
following the notations in Section.��
1
��
2
��
��
��
(a) (b)
Figure 5: Fusion with PFEs. (a) Illustration of the fusion process as a di-
rected graphical model. (b) Given the Gaussian representations of faces
(from the same identity), the fusion process outputs a new Gaussian distri-
bution in the latent space with a more precise mean and lower uncertainty.
As such, we call itmutual likelihood score (MLS). Different
from KL-divergence, this score is unbounded and cannot be
seen as a distance metric. It can be shown that the squared
Euclidean distance is equivalent to a special case of MLS
when all the uncertainties are assumed to be the same:
Property 1.If
(l)
i
is a xed number for all dataxiand
dimensionsl, MLS is equivalent to a scaled and shifted neg-
ative squared Euclidean distance.
Further, when the uncertainties are allowed to be differ-
ent, we note that MLS has some interesting properties that
make it different from a distance metric:
1.Attentionmechanism: the rst term in the bracket in
Equation (3) can be seen as a weighted distance which
assigns larger weights to less uncertain dimensions.
2.Penaltymechanism: the second term in the bracket in
Equation (3) can be seen as a penalty term which pe-
nalizes dimensions that have high uncertainties.
3. xiorxjhas large uncertainties, MLS
will be low (because of penalty) irrespective of the dis-
tance between their mean.
4.
means are close to each other, MLS could be very high.
The last two properties imply that PFE could solve the fea-
ture ambiguity dilemma if the network can effectively esti-
matei.
4.2. Fusion with PFEs
In many cases we have a template (set) of face images,
for which we need to build a compact representation for
matching. With PFEs, a conjugate formula can be derived
for representation fusion (Figure). Let fx1;x2; : : : ;xng
be a series of observations (face images) from the same
identity andp(zjx1;x2; : : : ;xn)be the posterior distribu-
tion after then
th
observation. Then, assuming all the obser-
vations are conditionally independent (given the latent code
z). It can be shown that:
p(zjx1;x2; : : : ;xn+1) =
p(zjxn+1)
p(z)
p(zjx1;x2; : : : ;xn);
(5)

where is a normalization factor. To simplify the notations,
let us only consider a one-dimensional case below; the so-
lution can be easily extended to the multivariate case.
Ifp(z)is assumed to be a noninformative prior,i.e.p(z)
is a Gaussian distribution whose variance approaches1,
the posterior distribution in Equation (5) is a new Gaussian
distribution with lower uncertainty (See Section). Further,
given a set of face imagesfx1;x2; : : : ;xng, the parameters
of the fused representation can be directly given by:
^n=
n
X
i=1
^
2
n

2
i
i; (6)
1
^
2
n
=
n
X
i=1
1

2
i
: (7)
In practice, because the conditional independence as-
sumption is usually not true,e.g. video frames include a
large amount of redundancy, Equation () will be biased
by the number of images in the set. Therefore, we take
dimension-wise minimum to obtain the new uncertainty.
Relationship to Quality-aware PoolingIf we consider a
case where all the dimensions share the same uncertaintyi
fori
th
input and let the quality valueqi=
1

2
i
be the output
of the network. Then Equation (6) can be written as
^
n=
P
n
i=1
qi
i
P
n
j
qj
: (8)
If we do not use the uncertainty after fusion, the algorithm
will be the same as recent quality-aware aggregation meth-
ods for set-to-set face recognition [43,,].
4.3. Learning
Note that any deterministic embeddingf, if properly
optimized, can indeed satisfy the properties of PFEs: (1)
the embedding space is a disentangled identity-salient la-
tent space and (2)f(x)represents the most likely features
of the given input in the latent space. As such, in this
work we consider a stage-wise training strategy: given a
pre-trained embedding modelf, we x its parameters, take
(x) =f(x), and optimize an additional uncertainty mod-
ule to estimate(x). When the uncertainty module is
trained on the same dataset of the embedding model, this
stage-wise training strategy allows us to have a more fair
comparison between PFE and the original embeddingf(x)
than an end-to-end learning strategy.
The uncertainty module is a network with two fully-
connected layers which shares the same input as of the bot-
tleneck layer
6
. The optimization criteria is to maximize the
6
Bottleneck layer refers to the layer which outputs the original face
embedding.
mutual likelihood score of all genuine pairs(xi;xj). For-
mally, the loss function to minimize is
L=
1
jPj
X
(i;j)2P
s(xi;xj) (9)
wherePis the set of all genuine pairs andsis dened in
Equation (3). In practice, the loss function is optimized
within each mini-batch. Intuitively, this loss function can
be understood as an alternative to maximizingp(zjx): if
the latent distributions of all possible genuine pairs have a
large overlap, the latent targetzshould have a large likeli-
hoodp(zjx)for any correspondingx. Notice that because
(x)is xed, the optimization wouldn't lead to the collapse
of all the(x)to a single point.
5. Experiments
In this section, we rst test the proposed PFE method on
standard face recognition protocols to compare with deter-
ministic embeddings. Then we conduct qualitative analy-
sis to gain more insight into how PFE behaves.Due to the
space limit, we provide the implementation details in Sec-
tion.
To comprehensively evaluate the efcacy of PFEs, we
conduct the experiments on7benchmarks, including the
well knownLFW[10],YTF[38],MegaFace[14] and four
other more unconstrained benchmarks:
CFP[29] contains7;000frontal/prole face photos of500
subjects. We only test on the frontal-prole (FP) protocol,
which includes7;000pairs of frontal-prole faces.
IJB-A[19] is a template-based benchmark, containing
25;813faces images of500subjects. Each template in-
cludes a set of still photos or video frames. Compared with
previous benchmarks, the faces in IJB-A have larger varia-
tions and present a more unconstrained scenario.
IJB-C[24] is an extension of IJB-A with140;740faces im-
ages of3;531subjects. The verication protocol of IJB-C
includes more impostor pairs so we can compute True Ac-
cept Rates (TAR) at lower False Accept Rates (FAR).
IJB-S[13] is a surveillance video benchmark containing
350surveillance videos spanning30hours in total,5;656
enrollment images, and202enrollment videos of202sub-
jects. Many faces in this dataset are of extreme pose or low-
quality, making it one of the most challenging face recogni-
tion benchmarks (See Figure
We use the CASIA-WebFace [44] and a cleaned version
7
of MS-Celeb-1M [8] as training data, from which we re-
move the subjects that are also included in the test datasets
8
.
7
https://github.com/inlmouse/MS-Celeb-1M_
WashList.
8
84and4;182subjects were removed from CASIA-WebFace and MS-
Celeb-1M, respectively.

Base Model Representation LFW YTF CFP-FP IJB-A
Original98:93 94:74 93:84 78:16Softmax +
Center Loss [36] PFE 99:27 95:42 94:51 80:83
Original97:6593:3689:7660:82
Triplet [28]
PFE 98:4593:9690:0461:00
Original99:15 94:80 92:41 78:54
A-Softmax [21]
PFE 99:32 94:94 93:37 82:58
Original99:2895:6494:7784:69
AM-Softmax [34]
PFE 99:5595:9295:0687:58
Table 1: Results of models trained on CASIA-WebFace. “Original” refers
to the deterministic embeddings. The better performance among each base
model are shown in bold numbers. “PFE” uses mutual likelihood score for
matching. IJB-A results are verication rates at FAR=0:1%.
Method Training Data LFW YTF
MF1 MF1
Rank1 Veri.
DeepFace+ [32] 4M 97:35 91:4 - -
FaceNet [28] 200M 99:63 95:1 - -
DeepID2+ [31] 300K 99:47 93:2 - -
CenterFace [36] 0.7M 99:28 94:9 65:23 76:52
SphereFace [21] 0.5M 99:42 95:0 75:77 89:14
ArcFace [4] 5.8M 99:83 98:02 81:03 96:98
CosFace [35] 5M 99:73 97:6 77:11 89:88
L2-Face [26] 3.7M 99:78 96:08 - -
Baseline 4.4M 99:70 97:18 79:43 92:93
PFEfuse 4.4M - 97:32 - -
PFEfuse+match 4.4M 99:82 97:36 78:95 92:51
Table 2: Results of our models (last three rows) trained on MS-Celeb-1M
and state-of-the-art methods on LFW, YTF and MegaFace. The MegaFace
verication rates are computed at FAR=0:0001%. “-” indicates that the
author did report the performance on the corresponding protocol.
5.1. Experiments on Different Base Embeddings
Since our method works by converting existing deter-
ministic embeddings, we want to evaluate how it works
with different base embeddings,i.e. face representations
trained with different loss functions. In particular, we im-
plement the following state-of-the-art loss functions: Soft-
max+Center Loss [36], Triplet Loss [28], A-Softmax [21]
and AM-Softmax [34]
9
. To be aligned with previous
work [21,], we train a 64-layer residual network [21]
with each of these loss functions on the CASIA-WebFace
dataset as base models. All the features are`2-normalized
to a hyper-spherical embedding space. Then we train the
uncertainty module for each base model on the CASIA-
WebFace again for3;000steps. We evaluate the perfor-
mance on four benchmarks: LFW [10], YTF [38], CFP-
FP [29] and IJB-A [19], which present different challenges
in face recognition. The results are shown in Table. The
PFE improves over the original representation in all cases,
indicating the proposed method is robust with different em-
beddings and testing scenarios.
5.2. Comparison with State­Of­The­Art
To compare with state-of-the-art face recognition meth-
ods, we use a different base model, which is a64-layer
9
We also tried implementing ArcFace [4] but it does not converge well
in our case. So we did not use it.
Method Training Data
IJB-A (TAR@FAR)
CFP-FP
0:1% 1 :0%
DR-GAN [33] 1M 53:94:3 77:42:7 93:41
Yinet al. [46] 0.5M 73:94:2 77:52:5 94:39
TPE [27] 0.5M 90:01:0 93:40:5 89:17
NAN [43] 3M 88:11:1 94:10:8 -
QAN [22] 5M 89:313:92 94:201:53 -
Caoet al. [2] 3.3M 90:41:4 95:80:6 -
Multicolumn [41] 3.3M 92:01:3 96:20:5 -
L2-Face [26] 3.7M 94:30:5 97:000:4 -
Baseline 4.4M 93:301:29 96:150:71 92:78
PFEfuse 4.4M 94:590:72 95:920:73 -
PFEfuse+match 4.4M 95:250:89 97:500:4393:34
Table 3: Results of our models (last three rows) trained on MS-Celeb-1M
and state-of-the-art methods on CFP (frontal-prole protocol) and IJB-A.
Method Training Data
IJB-C (TAR@FAR)
0:001% 0:01% 0:1% 1%
Yinet al. [45] 0.5M - - 69:3 83:8
Caoet al. [2] 3.3M 74:7 84:0 91:0 96:0
Multicolumn [41] 3.3M 77:1 86:2 92:7 96:8
DCN [40] 3.3M - 88:5 94:798:3
Baseline 4.4M 70:10 85:37 93:61 96:91
PFEfuse 4.4M 83:14 92:38 95:47 97:36
PFEfuse+match 4.4M 89:64 93:25 95:4997:17
Table 4: Results of our models (last three rows) trained on MS-Celeb-1M
and state-of-the-art methods on IJB-C.
network trained with AM-Softmax on the MS-Celeb-1M
dataset. Then we x the parameters and train the uncer-
tainty module on the same dataset for12;000steps. In the
following experiments, we compare 3 methods:
Baselineonly uses the original features of the 64-layer
deterministic embedding along with cosine similarity for
matching. Average pooling is used in case of tem-
plate/video benchmarks.
PFEfuseuses the uncertainty estimationin PFE and
Equation (6) to aggregate the features of templates but
uses cosine similarity for matching. If the uncertainty
module could estimate the feature uncertainty effectively,
fusion withshould be able to outperform average pool-
ing by assigning larger weights to condent features.
PFEfuse+matchusesboth for fusion and matching (with
mutual likelihood scores). Templates/videos are fused
based on Equation (6) and Equation (7).
In Table
benchmarks: LFW, YTF and MegaFace. Although the ac-
curacy on LFW and YTF are nearly saturated, the proposed
PFE still improves the performance of the original represen-
tation. Note that MegaFace is a biased dataset: because all
the probes are high-quality images from FaceScrub, the pos-
itive pairs in MegaFace are both high-quality images while
the negative pairs only contain at most one low-quality im-
age
10
. Therefore, neither of the two types of error caused
10
The negative pairs of MegaFace in the verication protocol only in-

Method Training Data
Surveillance-to-Single Surveillance-to-Booking Surveillance-to-Surveillance
Rank-1 Rank-5 Rank-10 1% 10% Rank-1 Rank-5 Rank-10 1% 10% Rank-1 Rank-5 Rank-10 1% 10%
C-FAN [7] 5.0M 50:82 61:16 64:95 16:44 24:19 53:04 62:67 66:35 27:40 29:7010:0517:55 21:06 0:11 0:68
Baseline 4.4M 50:00 59:07 62:70 7:22 19:05 47:54 56:14 61:08 14:75 22:99 9:40 17:52 23:04 0:06 0:71
PFEfuse 4.4M 53:44 61:40 65:0510:53 22:8755:45 63:17 66:3816:70 26:20 8:18 14:52 19:31 0:09 0:63
PFEfuse+match 4.4M 50:16 58:33 62:2831:88 35:3353:60 61:75 64:9735:99 39:829:2020:82 27:34 0:84 2:83
Table 5: Performance comparison on three protocols of IJB-S. The performance is reported in terms of rank retrieval (closed-set) and TPIR@FPIR (open-set)
instead of the media-normalized version [13]. The numbers “ 1%” and “10%” in the second row refer to the FPIR.1 9 17 25 33 41 49 57 65 73
1800
2200
2600
3000
3400
impostor original vs. degraded
(a) Gaussian Blur0.10.20.30.40.50.60.70.80.91.0
1800
2200
2600
3000
3400
impostor original vs. degraded (b) Occlusion0.150.300.450.600.750.901.051.201.351.50
1800
2200
2600
3000
3400
impostor original vs. degraded (c) Random Noise
Figure 6: Repeated experiments on feature ambiguity dilemma with the proposed PFE. The same model in Figure
converted to a PFE by training an uncertainty module. No additional training data nor data augmentation is used for training.
(a) Low-score Genuine Pairs (b) High-score Impostor Pairs
Figure 7: Example genuine pairs from IJB-A dataset estimated with the
lowest mutual likelihood scores and impostor pairs with the highest scores
by the PFE version of the same 64-layer CNN model in Section. In
comparison to Figure, most images here are high-quality ones with clear
features, which can mislead the model to be condent in a wrong way.
Note that these pairs are not templates in the verication protocol.
by the feature ambiguity dilemma (Section) will show up
in MegaFace and it naturally favors deterministic embed-
dings. However, the PFE still maintains the performance
in this case. We also note that such a bias, namely the tar-
get gallery images being of higher quality than the rest of
gallery, would not exist in real world applications.
In Table
challenging datasets: CFP, IJB-A and IJB-C. The images in
these datasets present larger variations in pose, occlustion,
etc, and facial features could be more ambiguous. As such,
we can see that PFE achieves a more signicant improve-
ment on these three benchmarks. In particular on IJB-C at
FAR= 0:001%, PFE reduces the error rate by64%. In ad-
dition, simply fusing the original features with the learned
uncertainty (PFEfuse) also helps the performance.
In Table
latest benchmark, IJB-S. Again, PFE is able to improve the
performance in most cases. Notice that the gallery tem-
plates in the “Surveillance-to-still” and “Surveillance-to-
booking” all include high-quality frontal mugshots, which
clude those between probes and distractors.0.0000.0010.0020.0030.0040.0050.006
Uncertainty
0.00
0.03
0.06
0.09
0.12
0.15
0.18
Frequency
LFW
IJB-A
IJB-S
Figure 8: Distribution of estimated uncertainty on different datasets. Here,
“Uncertainty” refers to the harmonic mean ofacross all feature dimen-
sions. Note that the estimated uncertainty is proportional to the complexity
of the datasets.Best viewed in color.
present little feature ambiguity. Therefore, we only see a
slight performance gap in these two protocols. But in the
most challenging “surveillance-to-surveillance” protocol,
larger improvement can be achieved by using uncertainty
for matching. Besides, PFEfuse+matchimproves the perfor-
mance signicantly on all the open-set protocols, which in-
dicates that MLS has more impact on the absolute pairwise
score than the relative ranking.
5.3. Qualitative Analysis
Why and when does PFE improve performance?We
rst repeat the same experiments in Section
PFE representation and MLS. The same network is used as
the base model here. As one can see in Figure, although
the scores of low-quality impostor pairs are still increasing,
they converge to a point that is lower than the majority of
genuine scores. Similarly, the scores of cross-quality gen-
uine pairs converge to a point that is higher than the majority
of impostor scores. This means the two types of errors dis-
cussed in Section
conrmed by the IJB-A results in Figure. Figure

x  mean sample1 sample2 sample3 sample40.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08 0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08 0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Figure 9: Visualization results on a high-quality, a low-quality and a mis-
detected image from IJB-A. For each input, 5 images are reconstructed by
a pre-trained decoder using the mean and4randomly sampledzvectors
from the estimated distributionp(zjx).
the distribution of estimated uncertainty on LFW, IJB-A
and IJB-S. As one can see, the “variance” of uncertainty
increases in the following order: LFW<IJB-A<IJB-S.
Comparing with the performance in Section, we can see
that PFE tends to achieve larger performance improvement
on datasets with more diverse image quality.
What does DNN see and not see?To answer this ques-
tion, we train a decoder network on the original embedding,
then apply it to PFE by samplingzfrom the estimated distri-
butionp(zjx)of givenx. For a high-quality image (Figure
Row 1), the reconstructed images tend to be very consistent
without much variation, implying the model is very certain
about the facial features in this images. In contrast, for a
lower-quality input (Figure
be observed from the reconstructed images. In particular,
attributes that can be clearly discerned from the image (e.g.
thick eye-brow) are still consistent while attributes cannot
(e.g. eye shape) be discerned have larger variation. As for
a mis-detected image (Figure
tion can be observed in the reconstructed images: the model
does not see any salient feature in the given image.
6. Risk-controlled Face Recognition
In many scenarios, we may expect a higher performance
than our system is able to achieve or we may want to make
sure the system's performance can be controlled when fac-
ing complex application scenarios. Therefore, we would ex-
pect the model to reject input images if it is not condent. A
common solution for this is to lter the images with a qual-
ity assessment tool. We show that PFE provides a natural
solution for this task. We take all the images from LFW and
IJB-A datasets for image-level face verication (We do not
follow the original protocols here). The system is allowed
to “lter out” a proportion of all images to maintain a bet-
ter performance. We then report the TAR@FAR= 0:001%
against the “Filter Out Rate”. We consider two criteria for
ltering: (1) the detection score of MTCNN [37] and (2) a
condence value predicted by our uncertainty module. Here
the condence fori
th
sample is dened as the inverse of har-
LFW IJB-A
Ours
H
L
MTCNN
H
L
Figure 10: Example images from LFW and IJB-A that are estimated with
the highest (H) condence/quality scores and the lowest (L) scores by our
method and MTCNN face detector.0 20 40 60 80 100
Filter Out Rate
80
83
86
89
92
95
98
TAR@FAR=0.001% confidence (ours)
MTCNN score
(a) LFW0 20 40 60 80 100
Filter Out Rate
40
50
60
70
80
90
TAR@FAR=0.001% confidence (ours)
MTCNN score (b) IJB-A
Figure 11: Comparison of verication performance on LFW and IJB-A
(not the original protocol) by ltering a proportion of images using differ-
ent quality criteria.
monic mean ofiacross all dimensions. For fairness, both
methods use the original deterministic embedding represen-
tations and cosine similarity for matching. To avoid satu-
rated results, we use the model trained on CASIA-WebFace
with AM-Softmax. The results are shown in Figure. As
one can see, the predicted condence value is a better in-
dicator of the potential recognition accuracy of the input
image. This is an expected result since PFE is trained un-
der supervision for the particular model while an external
quality estimator is unaware of the kind of features used
for matching by the model. Example images with high/low
condence/quality scores are shown in Figure.
7. Conclusion
We have proposed probabilistic face embeddings (PFEs),
which represent face images as distributions in the latent
space. Probabilistic solutions were derived to compare and
aggregate the PFE of face images. Unlike deterministic
embeddings, PFEs do not suffer from the feature ambigu-
ity dilemma for unconstrained face recognition. Quanti-
tative and qualitative analysis on different settings showed
that PFEs can effectively improve the face recognition per-
formance by converting deterministic embeddings to PFEs.
We have also shown that the uncertainty in PFEs is a good
indicator for the “discriminative”quality of face images. In
the future work we will explore how to learn PFEs in an
end-to-end manner and how to address the data dependency
within face templates.

References
[1]
Roberto Cipolla, and Trevor Darrell. Face recognition with
image sets using manifold density divergence. InCVPR,
2005.
[2]
drew Zisserman. Vggface2: A dataset for recognising faces
across pose and age. InFG, 2018.
[3]
image sets. InCVPR, 2010.
[4]
Zafeiriou. Arcface: Additive angular margin loss for deep
face recognition. InCVPR, 2019.,
[5]
approximation: Representing model uncertainty in deep
learning. InICML, 2016.
[6]
capacity of face representation.arXiv:1709.10433, 2017.
[7]
recognition: Component-wise feature aggregation network
(c-fan). InICB, 2019.,
[8]
Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for
large-scale face recognition. InECCV, 2016.,,,
[9]
certainty in representation of facial features for face recogni-
tion. InFace recognition. 2007.
[10]
Learned-Miller. Labeled faces in the wild: A database for
studying face recognition in unconstrained environments.
Technical report, University of Massachusetts, Amherst,
2007.,,,
[11]
and Xilin Chen. Log-euclidean metric learning on symmet-
ric positive denite manifold with application to image set
classication. InICML, 2015.
[12]
Accelerating deep network training by reducing internal co-
variate shift. InICML, 2015.
[13]
OConnor, Stephen Elliott, Kaleb Hebert, Julia Bryan, and
Anil K. Jain. IJB-S : IARPA Janus Surveillance Video
Benchmark . InBTAS, 2018.,,,
[14]
Miller, and Evan Brossard. The megaface benchmark: 1 mil-
lion faces for recognition at scale. InCVPR, 2016.
[15]
Bayesian segnet: Model uncertainty in deep convolutional
encoder-decoder architectures for scene understanding. In
BMVC, 2015.
[16]
in bayesian deep learning for computer vision? InNIPS,
2017.,
[17]
Shen, and Ling Shao. Striking the right balance with un-
certainty.arXiv:1901.07590, 2019.
[18]
tional bayes. InICLR, 2013.
[19]
ton, Jordan Cheney, Kristen Allen, Patrick Grother, Alan
Mah, and Anil K Jain. Pushing the frontiers of unconstrained
face detection and recognition: IARPA Janus Benchmark A.
InCVPR, 2015.,,,
[20]
chao Yang. Probabilistic elastic matching for pose variant
face verication. InCVPR, 2013.
[21]
Raj, and Le Song. Sphereface: Deep hypersphere embedding
for face recognition. InCVPR, 2017.,,
[22]
work for set to set recognition. InCVPR, 2017.,,
[23]
propagation networks.Neural Computation, 1992.
[24]
Kalka, Tim Miller, Otto Charles, Anil K Jain, Niggel Tyler,
Janet Anderson, Jordan Cheney, and Patrick Grother. Iarpa
janus benchmark-c: Face dataset and protocol. InICB, 2018.
1,
[25] Bayesian learning for neural networks.
PhD thesis, University of Toronto, 1995.
[26]
constrained softmax loss for discriminative face verication.
arXiv:1703.09507, 2017.
[27]
and Rama Chellappa. Triplet probabilistic embedding for
face verication and clustering. InBTAS, 2016.
[28]
Facenet: A unied embedding for face recognition and clus-
tering. InCVPR, 2015.,,
[29]
Vishal M Patel, Rama Chellappa, and David W Jacobs.
Frontal to prole face verication in the wild. InWACV,
2016.,
[30]
Face recognition from long-term observations. InECCV,
2002.
[31]
face representations are sparse, selective, and robust. In
CVPR, 2015.
[32]
Wolf. Deepface: Closing the gap to human-level perfor-
mance in face verication. InCVPR, 2014.
[33]
sentation learning gan for pose-invariant face recognition. In
CVPR, 2017.
[34]
ditive margin softmax for face verication.IEEE Signal Pro-
cessing Letters, 2018.,,
[35]
Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface:
Large margin cosine loss for deep face recognition. InCVPR,
2018.,

[36]
discriminative feature learning approach for deep face recog-
nition. InECCV, 2016.,,
[37]
discriminative feature learning approach for deep face recog-
nition. InECCV, 2016.,
[38]
in unconstrained videos with matched background similarity.
InCVPR, 2011.,,
[39]
for deep face representation with noisy labels.IEEE Trans.
on Information Forensics and Security, 2015.
[40]
networks. InECCV, 2018.
[41]
for face recognition. InBMVC, 2018.,,
[42]
Hong Liu, and Shaohua Teng. Data uncertainty in face recog-
nition.IEEE Trans. on Cybernetics, 2014.
[43]
Fang Wen, Hongdong Li, and Gang Hua. Neural aggregation
network for video face recognition. InCVPR, 2017.,,
[44]
face representation from scratch.arXiv:1411.7923, 2014.,
11,
[45]
Xiaoming Liu. Towards interpretable face recognition.
arXiv:1805.00611, 2018.
[46]
network for pose-invariant face recognition.IEEE Trans. on
Image Processing, 2018.
[47]
Ahmed, Ahsan Latif, Kaleem Razzaq Malik, and Abdul-
lahi Mohamud Sharif. Face recognition with bayesian convo-
lutional networks for robust surveillance systems.EURASIP
Journal on Image and Video Processing, 2019.
A. Proofs
A.1. Mutual Likelihood Score
Here we prove Equation (3) in the main paper. For sim-
plicity, we do not need to directly solve the integral. Instead,
let us consider an alternative vectorz=zizj, where
zip(zjxi),zjp(zjxj)and(xi;xj)are the pair of
images we need to compare. Then,p(zi=zj),i.e. Equa-
tion (2) in the main paper, is equivalent to the density value
ofp(z=0).
Thel
th
component (dimension) ofz,z
(l)
, is the sub-
traction of two Gaussian variables, which means:
z
(l)
 N(
(l)
i

(l)
j
; 
2(l)
i
+
2(l)
j
): (10)
Therefore, the mutual likelihood score is given by:
s(xi;xj)
= logp(zi=zj)
= logp(z=0)
=
D
X
l
logp(z
(l)
= 0)
=
1
2
D
X
l=1
(
(
(l)
i

(l)
j
)
2

2(l)
i
+
2(l)
j
+ log(
2(l)
i
+
2(l)
j
))

D
2
log 2:
(11)
Note that directly solving the integral will lead to the same
solution.
A.2. Property 1
Let us consider the case that
2(l)
i
equals to a constant
c >0for any imagexiand dimensionl. Thus the mutual
likelihood score between a pair(xi;xj)becomes:
s(xi;xj)
=
1
2
D
X
l=1
(
(
(l)
i

(l)
j
)
2
2c
+ log(2c))
D
2
log 2
=c1



i
j


2
c2;
(12)
wherec1=
1
4c
andc2=
D
2
log(4c)are both constants.
A.3. Representation Fusion
We rst prove Equation (5) in the main paper. Assuming
all the observationsx1;x2: : :xn+1are conditionally inde-
pendent given the latent codez. The posterior distribution
is:
p(zjx1;x2; : : : ;xn+1)
=
p(x1;x2; : : : ;xn+1jz)p(z)
p(x1;x2; : : : ;xn+1)
=
p(x1;x2; : : : ;xnjz)p(xn+1jz)p(z)
p(x1;x2; : : : ;xn+1)
=
p(x1;x2; : : : ;xn)p(xn+1)
p(x1;x2; : : : ;xn+1)
p(x1;x2; : : : ;xnjz)p(xn+1jz)p(z)
p(x1;x2; : : : ;xn)p(xn+1)
=
p(x1;x2; : : : ;xn;z)p(xn+1;z)
p(x1;x2; : : : ;xn)p(xn+1)p(z)
=
p(zjxn+1)
p(z)
p(zjx1;x2; : : : ;xn); (13)
where is a normalization constant. In this case, =
p(x1;x2;:::;xn)p(xn+1)
p(x1;x2;:::;xn+1)
.
Without loss of generality, let us consider a one-
dimensional case for the followings. The solution can

be easily extended to a multivariate case since all fea-
ture dimensions are assumed to be independent. It can be
shown that the posterior distribution in Equation (13) is
a Gaussian distribution through induction. Let us assume
p(zjx1;x2; : : : ;xn)is a Gaussian distribution with^nand
^
2
nas mean and variance, respectively. Note that the initial
casep(zjx1)is guaranteed to be a Gaussian distribution.
Let0and
2
0denote the parameters of the noninforma-
tive prior ofz. Then, if we takelogon both side of Equa-
tion (13), we have:
logp(zjx1;x2; : : : ;xn+1)
= logp(zjxn+1) + logp(zjx1;x2; : : : ;xn)logp(z) +const
=
(zn+1)
2
2
2
n+1

(z^n)
2
2^
2
n
+
(z0)
2
2
2
0
+const
=
(z^n+1)
2
2^
2
n+1
+const:
(14)
where “const” refers to the terms irrelevant tozand
^n+1= ^
2
n+1(
n+1

2
n+1
+
^n
^
2
n

0

2
0
); (15)
1
^
2
n+1
=
1

2
n+1
+
1
^
2
n

1

2
0
: (16)
Considering0! 1, we have
^n+1=
^
2
nn+1+
2
n+1^n

2
n+1
+ ^
2
n
; (17)
^
2
n+1=

2
n+1^
2
n

2
n+1
+ ^
2
n
: (18)
The result means the posterior distribution is a new Gaus-
sian distribution with a smaller variance. Further, we can
directly give the solution of fusingnsamples:
logp(zjx1;x2; : : : ;xn)
= log[ p(zjx1)
n
Y
i=2
p(zjxi)
p(z)
]
= (n1)logp(z)
n
X
i=1
logp(zjxi) +const
= (n1)
(z0)
2
2
2
0

n
X
i=1
(zi)
2
2
2
i
+const
=
(z^n)
2
2^
2
n
+const:
(19)
where =
Q
n
i=1
p(xi)
p(x1;x2;:::;xn)
and
^n=
n
X
i=1
^
2
n

2
i
i(n1)
^
2
n

2
0
0; (20)
1
^
2
n
=
n
X
i=1
1

2
i
(n1)
1

2
0
: (21)
Considering0! 1, we have
^n=
n
X
i=1
^
2
n

2
i
i; (22)
^
2
n=
1
P
n
i=1
1

2
i
: (23)
B. Implementation Details
All the models in the paper are implemented using Ten-
sorow r1.9. Two and Four GeForce GTX 1080 Ti GPUs
are used for training base models on CASIA-Webface [44]
and MS-Celeb-1M [8], respectively. The uncertainty mod-
ules are trained using one GPU.
B.1. Data Preprocessing
All the face images are rst passed through MTCNN
face detector [37] to detect 5 facial landmarks (two eyes,
nose and two mouth corners). Then, similarity transforma-
tion is used to normalize the face images based on the ve
landmarks. After transformation, the images are resized to
11296. Before passing into networks, each pixel in the
RGB image is normalized by subtracting127:5and dividing
by128.
B.2. Base Models
The base models for CASIA-Webface [44] are trained
for28;000steps using a SGD optimizer with a momentum
of0:9. The learning rate starts at0:1, and is decreased to
0:01and0:001after16;000and24;000steps, respectively.
For the base model trained on Ms-Celeb-1M [8], we train
the network for140;000steps using the same optimizer
settings. The learning rate starts at0:1, and is decreased
to0:01and0:001after80;000and120;000steps, respec-
tively. The batch size, feature dimension and weight decay
are set to256,512and0:0005, respectively, for both cases.
B.3. Uncertainty Module
ArchitectureThe uncertainty module for all models
are two-layer perceptrons with the same architecture:
FC-BN-ReLU-FC-BN-exp , whereFCrefers to fully
connected layers,BNrefers to batch normalization lay-
ers [12] and expfunction ensures the outputs
2
are all
positive values [16]. The rst FCshares the same input with
the bottleneck layer,i.e. the output feature map of the last
convolutional layer. The output of bothFClayers areD-
dimensional vectors, whereDis the dimensionality of the
latent space. In addition, we constrain the lastBNlayer to
share the sameand across all dimensions, which we
found to help stabilizing the training.
TrainingFor the models trained on CASIA-
WebFace [44], we train the uncertainty module for

Base Model Representation LFW YTF CFP-FP IJB-A
Original97:70 92:56 91:13 63:93Softmax +
Center Loss [36] PFE 97:89 93:10 91:36 64:33
Original96:9890:7285:6954:47
Triplet [28]
PFE 97:1091:2285:1051:35
Original97:1292:3889:31 54:48
A-Softmax [21]
PFE 97:9291:7889:96 58:09
Original98:3293:5090:2471:28
AM-Softmax [34]
PFE 98:6394:0090:5075:92
Table 6: Results of CASIA-Net models trained on CASIA-WebFace.
“Original” refers to the deterministic embeddings. The better performance
among each base model are shown in bold numbers. “PFE” uses mu-
tual likelihood score for matching. IJB-A results are verication rates at
FAR=0:1%.
Base Model Representation LFW YTF CFP-FP IJB-A
Original97:77 92:34 90:96 60:42Softmax +
Center Loss [36] PFE 98:28 93:24 92:29 62:41
Original97:4892:4690:0152:34
Triplet [28]
PFE 98:1593:6290:5456:81
Original98:07 92:72 89:34 63:21
A-Softmax [21]
PFE 98:47 93:44 90:54 71:96
Original98:6893:7890:5976:50
AM-Softmax [34]
PFE 98:9594:3491:2680:00
Table 7: Results of Light-CNN models trained on CASIA-WebFace.
“Original” refers to the deterministic embeddings. The better performance
among each base model are shown in bold numbers. “PFE” uses mu-
tual likelihood score for matching. IJB-A results are verication rates at
FAR=0:1%.
3;000steps using a SGD optimizer with a momentum of
0:9. The learning rate starts at0:001, and is decreased to
0:0001after2;000steps. For the model trained on MS-
Celeb-1M[8], we train the uncertainty module for12;000
steps. The learning rate starts at0:001, and is decreased
to0:0001after8;000steps. The batch size for both cases
are256. For each mini-batch, we randomly select4images
from64different subjects to compose the positive pairs
(384pairs in all). The weight decay is set to0:0005in all
cases. A Subset of the training data was separated as the
validation set for choosing these hyper-parameters during
development phase.
Inference SpeedFeature extraction (passing through the
whole network) using one GPU takes1:5ms per image.
Note that given the small size of the uncertainty module,
it has little impact on the feature extraction time. Matching
images using cosine similarity and mutual likelihood score
takes4ns and15ns , respectively. Both are neglectable in
comparison with feature extraction time.
C. Results on Different Architectures
Throughout the main paper, we conducted the experi-
ments using a 64-layer CNN network [21]. Here, we eval-
uate the proposed method on two different network archi-
tectures for face recognition: CASIA-Net [44] and 29-layer
Light-CNN [39]. Notice that both networks require differ-
ent image shapes from our preprocessed ones. Thus we pad
our images with zero values and resize them into the tar-
get size. Since the main purpose of the experiment is to
evaluate the efcacy of the uncertainty module rather than
comparing with the original results of these networks, the
difference in the preprocessing should not affect a fair com-
parison. Besides, the original CASIA-Net does not con-
verge for A-Softmax and AM-Softmax, so we add an bot-
tleneck layer to output the embedding representation after
the average pooling layer. Then we conduct the experi-
ments by comparing probabilistic embeddings with base de-
terministic embeddings, similar to Section 5.1 in the main
paper. The results are shown in Table. With-
out tuning the architecture of the uncertainty module nor
the hyper-parameters, PFE still improve the performance in
most cases.