Extracted Text

2509.15591v1.pdf

Latent Zoning Network:
A Unified Principle for Generative Modeling,
Representation Learning, and Classification
Zinan Lin
∗
Microsoft Research
Redmond, WA, USA
Enshu Liu
Tsinghua University
Beijing, China
Xuefei Ning
Tsinghua University
Beijing, China
Junyi Zhu
†
Samsung R&D Institute UK
London, UK
Wenyu Wang
Redmond, WA, USA
Sergey Yekhanin
Microsoft Research
Redmond, WA, USA
Abstract
Generative modeling, representation learning, and classification are three core
problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions
remain largely disjoint. In this paper, we ask:
three?
across tasks. We introduceLatent Zoning Network(LZN) as a step toward
this goal. At its core,LZNcreates a shared Gaussian latent space that encodes
information across all tasks. Each data type (e.g., images, text, labels) is equipped
with an encoder that maps samples to, and a decoder that maps
latents back to data. ML tasks are expressed as compositions of these encoders and
decoders: for example, label-conditional image generation uses a label encoder
and image decoder; image embedding uses an image encoder; classification uses
an image encoder and label decoder. We demonstrate the promise ofLZNin
three increasingly complex scenarios: LZNcan enhance existing models
(image generation): LZN
improves FID onCIFAR10from 2.76 to 2.59—without modifying the training
objective. LZNcan solve tasks independently (representation learning):
LZNcan implement unsupervised representation learning without auxiliary loss
functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and
0.2%, respectively, on downstream linear classification onImageNet.LZNcan
solve multiple tasks simultaneously (joint generation and classification): With
image and label encoders/decoders,LZNperforms both tasks jointly by design,
improving FID and achieving SoTA classification accuracy onCIFAR10. The
code and trained models are available athttps://github.com/microsoft/
latent-zoning-networks
. The project website is athttps://zinanlin.me/
blogs/latent_zoning_networks.html.
1 Introduction
Generative modeling, representation learning, and classification are three of the most widely used
machine learning (ML) tasks. Generative models like DALLE [70,69,5] and GPT [67,68,7,1] power
applications such as question answering and content creation. Representation learning, exemplified
by CLIP [66], supports tasks like information retrieval. Classification is central to tasks such as object
recognition [17] and sentiment analysis [19,].
Notably, the state-of-the-art (SoTA) techniques for these tasks differ. For example, SoTA generative
modeling relies on diffusion models [32,75,77] and auto-regressive transformers [67,68,7,1]; SoTA
∗
Correspondence to: Zinan Lin (zinanlin@microsoft.com).
†
Junyi Zhu conducted this collaboration while at KU Leuven.
39th Conference on Neural Information Processing Systems (NeurIPS 2025).
arXiv:2509.15591v1 [cs.LG] 19 Sep 2025

Shared
latent space
①②
③④
⑤⑥
⑦⑧
⑨⑩
⑪
⑫
⑬
⑭
⑮
⑯
⑰
①
②④ ⑤
Examples of tasks:
① Unconditional image generation
② Image embedding
③ Label-conditioned image generation
④ Image classification
⑤ Text embedding
⑥ Image-to-text generation
(e.g., image captioning)
⑦ Text-to-image generation
⑧ Text-to-text generation (e.g., chatbot)
⑧
③
⑦
⑥ Figure 1:Latent Zoning Network(LZN) connects multiple encoders and decoders through a
shared latent space, enabling a wide range of ML tasks via different encoder-decoder combinations or
standalone encoders/decoders. The figure illustrates eight example tasks, but more could be supported.
Only tasks 1-4 are evaluated in this paper, while the rest are for illustration.Latent space
Follows
Gaussian
distribution
“Cat”
“Dog”
“Bird”
The latent zone
of label
The latent zone
of image
Figure 2: The latent space ofLZNhas two
key properties:
a simple Gaussian prior, allowing easy sam-
pling for generation tasks.
serves as a shared representation across all
data types
data type induces a distinct partitioning of
the latent space into, where
each zone corresponds to a specific sample
(e.g., an individual image or label). The
latent space is shown as a closed circle for
illustration, but it is unbounded in practice.Samples
Anchor
points
Latent computation
Latent
zones
Latent
alignment
Encoder
1
Latent space
Flow
matching
Encoder
2
“Cat”“Dog”“Bird”
Latent space
Flow
matching
Figure 3: Training and inference inLZNrely on two
atomic operations:§ 2.2.1):
Computes latent zones for a data type by encoding sam-
ples into
[53,51] to partition the latent space. Conversely, any la-
tent point can be mapped to a sample via the decoder (not
shown).§ 2.2.2):
zones across data types by matching their FM processes.
This figure also illustrates the approach forLZNin joint
conditional generative modeling and classification (§ 5).
representation learning employs contrastive loss [28,10,26]; and SoTA classification uses dedicated
models trained with cross-entropy loss and its variants [46]. Although using distinct methods for these
tasks has long been established and widely accepted in the community, we revisit this methodology
from first principles and question its necessity. Specifically, we ask, out of curiosity:
Can a single principle unify generative modeling, representation learning, and classification?
Part of our motivation stems from Occam’s Razor [82], which favors simpler solutions when possible.
More importantly, while these tasks differ in formulation, they are fundamentally related and can
benefit from one another; a unified principle could facilitate such synergy.
3
In this paper, we reflect on the strengths and limitations of existing techniques and propose a new
unified framework,Latent Zoning Network(LZN), illustrated in. At the core of our
design is a shared
to a specific
space, and the corresponding decoder maps it back to the data. Different tasks can be interpreted as3
While auto-regressive (AR) transformers with large-scale pre-training provide one approach to unify these
tasks [67,68,7,1], SoTA transformer-based representation learning still relies on contrastive learning [45,88].
More importantly, our approach can be viewed as an orthogonal layer on top of transformers and should be seen
as complementary rather than competing. See
2

performing “translations” within the latent space—either using encoder–decoder pairs or leveraging
a single encoder or decoder. Compared to popular
no constraint on the latent distribution [10],LZN’s latent space is: it follows a simple prior
distribution for easy sampling. In contrast to modern
conditions (e.g., class labels, text) are treated as separate inputs [86],LZNmaintains a
space that unifies different types of conditional information. Finally, unlike standard,
where class labels are model outputs,LZNtreats “class labels” as a data type connected through the
latent space like any other data type.
To train and perform inference withLZN, we rely on two atomic operations (Fig. 3):
computation:
we first use the encoder to compute each sample’s, then apply flow matching [ 53,51]
to map these points to their corresponding latent. This procedure ensures that the resulting
zones collectively follow a simple Gaussian distribution, facilitating generation tasks, while also
guaranteeing that zones from different samples remain disjoint–allowing them to serve as unique
latents for classification and representation learning.
produced by two different encoders to facilitate the tasks that require translations between encoders
and decoders from different data types. This is a fundamentally challenging task due to its discrete
nature. To address it, we introduce a novel “soft” approximation that performs the alignment midway
through the flow matching process, enabling effective and tractable training.
We demonstrate that, despite its simplicity and reliance on just two atomic operations,LZNis capable
of supporting a wide range of seemingly diverse tasks. To illustrate its versatility and practical utility,
we present three
•
L1: Enhancing§ 3). LZNlatents can be computed without su-
pervision, they can be seamlessly integrated into existing models as an additional conditioning
signal–without requiring any changes to the training loss or methods. In this setup,LZNlatents
hopefully can learn to improve the task performance. To demonstrate this, we incorporateLZNinto
rectified flow models [53]–a state-of-the-art generative approach for images–and observe improved
sample quality acrossCIFAR10,AFHQ-Cat,CelebA-HQ,LSUN-Bedroomdatasets. Specifically,
onCIFAR10,LZNcloses the FID gap between conditional and unconditional generation by 59%.
•L2: Solving§ 4). LZNcan also tackle tasks entirely on its own,
without relying on existing methods. As a case study, we useLZNto implement unsupervised
representation learning, a task traditionally addressed with contrastive loss. We find thatLZNcan
even outperform the seminal methods such as MoCo [28] and SimCLR [10] by 9.3% and 0.2%,
respectively, on downstream
•
L3: Solving§ 5). LZNis capable of handling
multiple
one for labels—enablingLZNto jointly support class-conditional generation and classification
within a single, unified framework. Built on rectified flow, this implementation outperforms the
baseline conditional rectified flow model in generation quality, while also achieving state-of-the-
art classification accuracy. Notably, the performance on both generation and classification exceeds
that of training each task in isolation. This supports our core motivating intuition: seemingly
distinct tasks can benefit from shared representations, andLZNprovides a principled framework
for enabling such synergy.
While the early results are promising, many challenges, open questions, and exciting opportunities
remain unexplored. In principle, as more encoder–decoder pairs are added to the shared latent space,
the range of applicationsLZNcan support should grow at least quadratically (Fig. 1). Whether LZN
can scale gracefully and realize this potential remains to be seen. We hope this work opens a new line
of research toward this ambitious vision. See more discussions in.
2LZN)
2.1 Overall Framework
Revisiting existing approaches.
tasks, we first analyze the strengths and limitations of existing approaches.
•
Generative modeling. x from an unknown distributionp, generative models
aim to learn a decoderDxsuch thatDx(z approximatesp, wherezis random noise drawn
3

from a simple distribution.
4
Whilezcan carry useful information for representation learning
and classification [50], this pipeline has key limitations: ztoxlacks
flexibility. For example, in diffusion models, the optimal mapping is fixed once the distributions
ofxandzare fixed [77]. To introduce controllability, models often augmentDwith additional
condition inputsc1, . . . , ck :G(z, c1, . . . , ck) [86]. This is suboptimal–conditions may overlap
or conflict (e.g., text vs. label conditions in image generation), and the resulting representation
(z, c1, . . . , ck) becomes fragmented. Dto recoverzfrom a samplexis non-trivial
for some SoTA generative models [36]. These issues limit the effectiveness of generative models
for representation learning, as also observed in prior work [24].
•
Unsupervised representation learning.
5
SoTA representation learning typically uses contrastive
loss [10], where an encoderEmaps related sample pairs–either from the same modality (e.g.,
image augmentations) or across modalities (e.g., image–text pairs)–to similar embeddings, while
pushing unrelated samples apart. These embeddings can perform zero-shot classification by
comparing pairwise cosine similarities [66] or be adapted for classification using a linear head
trained with cross-entropy loss [10]. However, contrastive loss leadsEto discard important
details (e.g., augmentations, modalities), making the representations unsuitable for standalone
generation. Moreover, with few exceptions [2], the representations lack distributional constraints,
making them hard to sample from for generative tasks unless training an additional generative
model [44].
•
Classification.
with cross-entropy loss to map inputs to class labels [19,54]. Intermediate layer outputs can be
used as representations [79]. However, because the objective focuses solely on classification,
these representations tend to discard class-irrelevant information, limiting their utility for general-
purpose representation or generation. As with contrastive learning, they also lack distributional
constraints for generative tasks.
While one could combine the above objectives and methods into a single training setup [60,43], our
focus is on designing a clean, unified framework that naturally integrates all these tasks.
Desiderata.
data and a latent space. The main differences lie in: the mapping direction (e.g., latent-to-data for
generation, data-to-latent for representation/classification), constraints on the latent space (e.g., a
simple prior for generative models, none for others), and the amount of information encoded (e.g.,
class labels for classification tasks, detailed reconstructions for generative models). To support all
these tasks in a single framework, we seek:
information of all tasks;
Easy mappings
Framework.Fig. 2):
•
A unified latent space.
appearing as input latent in text-to-image generation or as output in text generation. This makes it
hard to define a unified latent space across tasks.
We address this by introducing a hypothetical
samples in the world. Each foundation latent is an abstract entity that appears through
in different, such as images, text, and even class labels (e.g., “cat” or “dog”).
Importantly, different latents can share the same observation (e.g., multiple cat images all labeled
“cat” and described as “a cat image”). As a result, each observed sample defines a—a
subset of the latent space that produces the same observation in that data type. This provides a
unified way to represent and connect all data types within the same latent space.
•
A generative latent space.
easy
easy
•
Easy mappings.
ing. Conversely, a latent point can be decoded into a data type using its.
Tasks.Fig. 1):
4
For diffusion models [32,75,77],zis the initial Gaussian noise in the sampling process, plus intermediate
noise if using SDE sampling [78]. For AR transformers,
5
We use “latent”, “representation”, and “embedding” interchangeably in the paper.
4

•
Single-module tasks.
For instance, the image encoder alone produces image embeddings (representations), while the
image decoder alone enables unconditional image generation.
•
Cross-module tasks. label encoder
+ image decoder
enables class-conditional image generation,image encoder + label
decoder
We expect that tasks can benefit from each other through this unified framework (validated in).
Each task contributes its core information to the latent space, making it increasingly expressive and
powerful. Conversely, since all tasks interface through the same latent space, improvements in the
latent representations can facilitate learning across tasks.
As the latent space partitions into zones, we name this frameworkLatent Zoning Network(LZN).
2.2 Implementation of Atomic Operations
Training and inference inLZNrely on two operations (Fig. 3):.
We will see in
2.2.1 Latent Computation
Desiderata. LZN. Given samples
Xx 1, . . . , xn} of the same data type (e.g., images, text, labels), the goal is to sample their latents
z1, . . . , zn=X with a Csuch that:
is Gaussian: zz 1, . . . , zn} follows Gaussian distributionN,) , and
The latent zones of different samples are disjoint:z i)z j) ==
Approach. Exto map each sample to an
anchor pointai= x(xi) . We then apply the seminal flow matching (FM) method [53,51], which
establishes a one-to-one mapping between distributions, to transform these anchor points into latent
zones. Specifically, we define a family of distributionsπtwith endpointsπ0=,) , the desired
prior latent distribution, andπ1(s) =
1
n
P
n
i=1
δs i)
, the distribution of the anchor points.
6
The intermediate distributionπtis induced by linearly interpolating between sampless0∼ 0 and
s1∼ 1 viaφ(s0, s1, t) = (1)s 0+ 1 (i.e.,πtis(1) 0+ 1 ). The velocity field in FM
[53] can then be computed as
Vs, t) s0∼π0,s1∼π1
ȷ
∂φ(s 0, s1, t)
∂t
|φ(s0, s1, t) =
ff
=
P
n
i=1
(ai−) exp
ˇ
−
(s−ta i)
2
2(1−t)
2
ı
(1)
P
n
i=1
exp
ˇ
−
(s−ta i)
2
2(1−t)
2
ı.
We can obtainstby integrating along the FM trajectory:st= FMx(s0;) 0+
R
t
τ
Vsτ, ττ
fors0∼ 0 . It has been shown [53] that the distribution ofstisπt. Similarly, integrating backward
st= IFMx(s1−g;) 1−g+
R
t
τ−g
Vsτ, ττ fors1−g∼ 1−g≜π 1+,)
also yieldsπt, wheregis a small constant.
7
With a slight abuse of notation, we also writest=
IFMx(ai, ϵi;)
to representIFMx(s1−g;) withs1−g= (1a i+ i . With these setups, we
define the latent computation as
zi=X
i
≜ x(ai, ϵi; 0) i∼ N,).
Due to the discussed FM properties, this satisfies the two desiderata by construction (§ A.1).
Implementation. Cinvolves an integral, which we approximate using standard numerical
solvers such as Euler or DPM-Solver [55,56] that operates on a finite list of time stepst1, . . . , tr ,
as in prior work on diffusion and RF models [53,78]. A key property of our approach is that all
operations are differentiable, allowing gradients to backpropagate all the way from latentzito the
encoders xduring training. More details are deferred to.
Efficiency optimization. Vrequires access to all samples, making the
memory and computation cost ofChigh. To address this, we introduce several optimization
6
δ
7
FM is well-defined only when bothπ0andπ1have full support. However, in our case,π1is a mixture of
Dirac deltas and lacks full support. Therefore, we use the full-support distribution 1−gas the starting point.
5

techniques–latent parallelism, custom gradient checkpointing, and minibatch approximation–that
make the training of
2.2.2 Latent Alignment
Desiderata., latent zones from different data types are computed independently,
which undermines the purpose of a shared latent space. Many applications require these zones to
be. We consider two types of alignment:
for example, the latent zone of the “cat” label should cover all latent zones of all cat images.
One-to-one alignment:
share the same latent zone. Concrete examples will be shown in.
Formally, letXx 1, . . . , xn} andYy 1, . . . , ym} be two datasets from different data types.
The pairing is defined byki, whereyi(e.g., a cat image) is paired withxki(e.g., the “cat” label).
We aim to ensureSupp
Γ
CX
ki
˙
⊇CY
i
) for allim] , meaning the latent zone of
xkicovers that ofyi. This formulation supports many-to-one alignments directly. For one-to-one
alignment, a symmetric constraint can be added with
Approach. yi,
when mapped via the trajectory, matches the anchor point of ki
: x(CY
i
; 1) = x(xki
)
Challenge: discrete assignment is non-differentiable.
Before introducing our solution, we illustrate
why the problem is nontrivial by examining strawman approaches. A natural idea is to directly
minimize the distance:d x(CY
i
; 1) x(xki)) whered·,) is a distance metric. This
approach fails because FM deterministically maps each latent to exactly one anchor pointEx(xj) , so
the above objective effectively becomes minimizing the distance between anchor points. However,
a latent zone is influenced by all anchor points, not just its own. Therefore, reducing the distance
between a pair of anchors does not necessarily improve zone-level alignment. More fundamentally,
the core challenge is that FM induces a
anchor. This discrete operation is non-differentiable and cannot be directly optimized during training.
Technique 1: Soft approximation of alignment.To address this issue, our key idea is to introduce a
soft approximation of the discrete anchor assignment process. Let us defines
i
t= FMx(CY
i
;)
andal= x(xl) . By construction, the distributionπtis a mixture of Gaussians:πt=
1
n
P
n
l
Ntal,)
2
I),
where thel-th component corresponds to anchoral. We define the
(soft) probability thats
i
tis assigned toalas being proportional to the density ofs
i
tunder thel-th
Gaussian component:
P
Γ
al|s
i
t
˙
=
exp
ȷ
−
∥s
i
t
−ta
l
∥
2
2(1−t)
2
ff
/
P
n
j
exp
ȷ
−
∥s
i
t
−ta
j
∥
2
2(1−t)
2
ff
.
This formulation provides a smooth, differentiable approximation of the otherwise discrete assignment.
Whent , the approximation is fully smooth, withP
Γ
al|s
i
0
˙
= 1/n for alli, l, reflecting a uniform
assignment. Astincreases toward 1, the assignment becomes sharper. In the limit ast , it
converges to the true discrete assignment, where
i
tdeterministically maps to its assigned anchor.
From this, a straightforward idea is to maximize the assignment probability over all time steps such
as
P
t∈{t 1,...,tr}
P
Γ
aki
|s
i
t
˙
(recall thattis are solver time steps; see). However, our ultimate
goal is only to ensure correct assignment att . Even if this is achieved, the above objective would
continue to push the intermediate statessttowardaki, which is unnecessary and potentially harmful.
Technique 2: Optimizing maximum assignment probability.
To avoid this, we propose to maximize
max
t∈{t 1,...,tr}P
Γ
aki
|s
i
t
˙ . This ensures that once the trajectory reaches the correct anchor near
t, the objective is maximized (i.e., equals 1) and no further gradient is applied as desired.
Technique 3: Early step cutoff.
However, this approach introduces a new issue: ifstdiverges from
akiearly on, the maximum probability remains at the constant1/n(attained att 1= 0 ), yielding
no training signal. To mitigate this, we truncate the set of time steps used in the maximization,
restricting it to the later stages of the trajectory:{tu, . . . , tr} , whereuis a hyperparameter that
excludes early time steps. Putting it all together, our proposed alignment objective is:
Align (X
m
X
i=1
max
t∈{t u,...,tr}
P
Γ
aki
|s
i
t
˙
.
Please see
6

Latent space
Generation
Training Generator
Draw a latent of the target image
as an extra condition input to the generator
Generator
Draw a latent from the whole latent space (standard Gaussian)
as an extra condition input to the generator Figure 4:LZNfor unconditional generative modeling (§ 3). During training, the LZNlatent of each
target image is fed as an extra condition to the rectified flow (RF) model [53], making the RF learn
conditional LZNlatents. The objective remains the standard RF loss, and theLZN
encoder is trained end-to-end within it. During generation, we sampleLZNlatents from a standard
Gaussian and use them as the extra condition. We illustrate the approach with RF, but sinceLZN
latents require no supervision and are differentiable, the method could apply to other tasks by adding
a condition input for
2.3 Decoder
The decoderDxmaps theLZNlatent back to its corresponding sample:Dx(zi) = i . As we will
describe in more detail later, it can be implemented using either a generative model (§ 3) or FM (§ 5).
2.4 Relationships to Alternatives
A prominent alternative that can also unifies different ML tasks is the use of AR transformers with
large-scale pertaining, such as large language models (LLMs) [67,68,7,1]. These models unify
generation
Classification
the sentence “Which animal is in this image?” [68]. Additionally, prior work has shown that the
intermediate outputs in these models can serve as strong71].
As such, this approach is: the core formulation remains a generative modeling
problem. Tasks that can be framed as generation–such as classification-can be unified naturally.
However, for other tasks like representation learning, this approach must rely on surrogate methods,
such as using intermediate model outputs. In contrast,LZNoffers a new formulation that seamlessly
unifies all these tasks within a single framework, as we will demonstrate in the following sections.
More importantly, AR transformers (and other generative models) should be seen as
and complementary LZN, rather than as competitors. LZNdecoders that map latents
to data samples can be instantiated using any generative model. We demonstrate this in.
This allowsLZNto leverage the strengths of existing generative models within its unified framework.
2.5 Scope of Experiments
While the framework is general, this paper focuses specifically on the
three case studies: (1) generation (§ 3), (2) unsupervised representation learning (§ 4), and (3) joint
classification and generation (§ 5). These case studies are arranged in order of increasing complexity:
the first enhances a single task, the second solves a task using onlyLZNwithout
objectives, and the third tackles
3 Case Study 1: Unconditional Generative Modeling
Approach (Fig. 4). LZNlatentsCX can be computed using a standalone encoder without
introducing new training objectives (§ 2.2.1), they can be easily integrated into existing pipelines
as additional network inputs, without modifying the original loss function. We apply this idea
to, where the generator serves as the LZNdecoder. We modify
the decoder by adding an extra input to accept theLZNlatents. Both the encoder and decoder are
trained jointly using the original generative loss. In this setup, latent alignment (§ 2.2.2) is not used.
Importantly, the encoder is only used during training to computeCX . During inference, latents
are sampled from the Gaussian prior and passed directly to the decoder,
model’s inference efficiency.
processes.
Why it helps.LZNlatents provide unique representations for each image. This pushes the generator
(decoder) to depend more directly on the latent for reconstruction, making the image distribution
conditioned on the latent more deterministic. As a result, the generative objective becomes easier to
optimize. Our experiments later confirm thatLZNlatents indeed capture useful features for generation.
7

Table 1: Unconditional image generation quality scores across four datasets. The best results are
ingray box
. ApplyingLZNto generative models improves RF on most image quality metrics. RF
is a SoTA method; due to space constraints, we omit additional methods—see [53] for extensive
comparisons between RF and others. Note that Inception Score (IS) is best suited for natural images
like, though we report it for all datasets for completeness.
Algo.
CIFAR1032) AFHQ-Cat256)
FID↓↓↑↑↑↓↓ FID↓↓↑↑↑↓↓
RF 2.76 4.05 9.51 0.70 0.59 6.08 49.60 1.80 0.86 0.28 0.5145 17.92
RF+LZN 2.593.959.53 0.70 0.590.03550.415.6849.321.96 0.87 0.300.337610.29
Algo.
CelebA-HQ256) LSUN-Bedroom256)
FID↓↓↑↑↑↓↓ FID↓↓↑↑↑↓↓
RF 6.95 0.76 6.2516.22 0.60
RF+LZN 7.1710.332.92 0.76 0.450.490115.905.95 2.22 0.410.484337.01
Figure 5: Generated images of RF+LZN,,. More in.
Related work. 44] also aims to improve unconditional image generation with unsupervised
representations. However, RCG separates the process into distinct stages: it first trains a representation
model, then a generator on those representations, followed by a conditional generator conditioned
on the representations. In contrast, our approach requires no separate stages or extra loss terms;
everything is trained end-to-end with the original generative objective.
Results. LZNlatents into the seminal rectified flow (RF) models [53], which is closely
related to diffusion models [32,75,77]. RF achieved SoTA image generation performance on several
datasets [53,41], and has been used in some latest Stable Diffusion models [23] and AR models [52].
We follow the experimental setup of RF [53], evaluating on four datasets:CIFAR10,AFHQ-Cat,
CelebA-HQ, andLSUN-Bedroom. The latter three are high-resolution (256 ). In addition to
standard metrics—FID [31], sFID [58], IS [72], precision [40], recall [40], and CMMD [33]—we
also report ℓ2distance between an image and its reconstruction), which is
relevant for applications like image editing via latent manipulation [74,50]. Results are shown in
Tab. 1, with three key findings: LZNlatents improve image quality.
difference in RF+LZNis the inclusion of an additional latent drawn from a Gaussian distribution. This
improvement indicates that the decoder effectively leveragesLZNlatents for meaningful generative
features. LZNsignificantly reduces reconstruction error across all datasets,
that its latents capture essential image information.
its image quality often trails that of conditional generation [44]. Compared to RF’sCIFAR10results in
Tab. 4, we find thatLZNsubstantially reduces the FID gap between conditional and unconditional
generation by 59%, and even outperforms conditional RF in sFID and reconstruction error.
See
on implementation, datasets, metrics, and additional results such as more generated images
and ablation studies.
4 Case Study 2: Unsupervised Representation Learning
Related work.
[10,11,28,12,14,26,13], which pulls similar images (e.g., augmentations of the same image)
together in the representation space. A central challenge is avoiding collapse, where all inputs
are mapped to the same representation. Common solutions include pushing the representations of
dissimilar samples from large batches [10] or memory banks [28] away, or adopting architectural
designs that prevent collapse [26,13,28]. Other unsupervised representation learning approaches
[73,42] include masked image modeling [43,4,27,63] and training on other auxiliary tasks
[83,,,].
Approach (Fig. 6). Align (X
(§ 2.2.2), where Xx i}
n
i=1 andYy i}
n
i=1 with mappingki= contain image pairs(xi, yi)
of random augmentations of the same image. Unlike traditional contrastive methods, our approach
inherently avoids collapse: different images are mapped to distinct latent zones by design, eliminating
the need for large memory banks or specialized architectures. Notably, only a singleLZNencoder
8

Latent
alignmentLatent spaceLatent space
Data
augmentation Figure 6:LZNfor unsupervised representation learning (§ 4). During training, each image batch
undergoes two sets of data augmentations, and latent zones for each set are computed
encoder. We then apply latent alignment (§ 2.2.2) to train the encoder. At inference, we can use the
LZN
latents, the encoder outputs (i.e., anchor points), or intermediate encoder outputs (§ D.4). The
latter two options avoid the costly latent computation process.
Table 2: Classification accuracy on
ImageNetby training a linear classifier on
the unsupervised representations. Methods
marked with
§
are based on contrastive learn-
ing.
8
The horizontal line separates baselines
that perform worse or better than ourLZN.
All methods use the ResNet-50 architecture
[29] for fair comparison.
9
Algorithm Top-1 Acc↑ Top-5 Acc↑
InstDisc
§
[84] 54.0 [28] -
BigBiGAN [21] 56.6 [28] -
LocalAgg
§
[90] 58.8 [28] -
MoCo
§
[28] 60.2 [10] -
PIRL
§
[57] 63.6 [10] -
CPC v2
§
[30] 63.8 [10] 85.3 [10]
CMC
§
[80] 66.2 [26] 87.0 [26]
SimSiam
§
[13] 68.1 [13] -
SimCLR
§
[10] 69.3 [10] 89.0 [10]
MoCo v2
§
[12] 71.7 [12] -
SimCLR v2
§
[11] 71.7 [11] -
BYOL
§
[26] 74.3 [26] 91.6 [26]
DINO
§
[9] 75.3 [9] -
LZN 69.5 89.3
Table 3: Classification accuracy onCIFAR10. Base-
line results are from [39]. “RF+LZN(no gen)” refers
to “RF+LZN” with the RF loss for generation disabled.
The horizontal line separates baselines that perform
worse or better than our RF+LZN. Note that these re-
sults refer to training purely on theCIFAR10dataset
(without pretraining or external data).
Algorithm Acc↑
VGG16 92.64%
ResNet18 93.02%
ResNet50 93.62%
ResNet101 93.75%
RegNetX_200MF 94.24%
RegNetY_400MF 94.29%
MobileNetV2 94.43%
ResNeXt29(32x4d)94.73%
ResNeXt29(2x64d)94.82%
SimpleDLA 94.89%
DenseNet121 95.04%
PreActResNet1895.11%
DPN92 95.16%
DLA 95.47%
RF+LZN 93.59%
RF+LZN 94.47%
for bothXandYis needed and decoders are not required. See
process.
Results. 10,26,28], whereLZNis trained on the unlabelled
ImageNetdataset using ResNet-50 [29] to learn representations. A
top of these representations in a supervised manner, and its accuracy is evaluated on theImageNet
test set. The results are shown in. Note that these results are obtained without any pretraining
or use of external data. LZNmatches or outperforms several established methods
in the field, including seminal approaches such as MoCo [28] (LZNoutperforms it by 9.3%)
and SimCLR [10] (LZNoutperforms it by 0.2%). This is remarkable given that LZNis a new
framework capable of supporting not only unsupervised representation learning but also other tasks
(§ 3). However, there remains a significant gap to SoTA performance.
current results are not fully optimized: (1) Training iteration. LZNcontinues
to improve rapidly with ongoing training (§ D), so we expect the gap to SoTA to narrow with full
training.
the results significantly [14]. We leave these directions for future work.
Due to space constraints, please refer to
and additional results such as representation visualization and ablation studies.
8
Note that we use the term
contrastive loss, but to all approaches that encourage relevant images to share similar representations; see.
9
It is known that better architectures [11] or training on larger datasets [62] can yield stronger results. To
ensure a fair comparison, we include only methods reporting results with
on ImageNetdataset. This excludes potentially stronger methods lacking ResNet-50 results onImageNet,
such as DINOv2 [62]. See
9

Table 4: Conditional image generation quality onCIFAR10. The best results are ingray box
.
Applying
Algo.FID↓↓↑↑↑↓↓
RF 2.47 4.05 9.77 0.71 0.58
RF+LZN 2.403.999.88 0.71 0.580.02290.38
5 Case Study 3: Conditional Generative Modeling and Classification
Approach (Fig. 3)., we consider Xx i}
n
i=1 andYy i}
m
i=1 , where eachxi
is a class label andyiis an image labeled asxki. In addition to the image encoder and decoder,
we introduce a label encoder-decoder pair. Since labels come from a finite set, both modules share
a matrixA
qc of label anchor points, whereqis the latent dimension andcis the number
of classes. The encoder maps a one-hot labelhto its anchor viaAh. The decoder recovers the
class ID of a latentgby first applying FM to obtain its anchorFMA(g , and then computing its
corresponding class latent zonearg max FMA(g
T
A , whereFMA denotes FM over anchors in
A. The training objective extends that of Align (X . After training, the model can
perform both conditional and unconditional generation, as well as classification by design. See
for detailed pseudocode of the training, generation, and classification processes.
Related work.
generative models with classification losses or networks [60,72], treating label inputs to the generator
and outputs from the classifier as separate components. In contrast,LZNunifies label inputs and
outputs within a shared latent space.
Results., we conduct experiments on CIFAR10. Image quality metrics
are shown in, and classification accuracies are shown in. Key observations are:
LZNimproves both image quality and reconstruction error., this confirms that LZN
latents capture useful features for generation. LZNachieves classification accuracy on par with
SoTA.
The fact thatLZN, which jointly performs generation and classification and differs significantly from
standard classification pipelines, can match SoTA performance is notable. Currently,LZNlags behind
the bestCIFAR10result by 1%, potentially due to architectural factors: we use the RF encoder (§ E)
without classification-specific optimization. With a better architecture design (as in other methods
in), LZNcould likely improve further.
improves both. LZNin
Tab. 1; and (ii) “RF+LZN” achieving higher classification accuracy than “RF+LZN(no gen)” in.
These results support our motivation from
and demonstrate that
Due to space constraints, please refer to
and additional results such as generated images and ablation studies.
6 Limitations and Future Work
(1) Training efficiency. LZNrequires backpropagating through the FM trajectory (§ 2.2.1),
which is computationally expensive. In, we describe several optimization strategies we imple-
mented to mitigate this cost. To further improve efficiency, we observe an interesting parallel between
trainingLZNand training large language models (LLMs) (see), suggesting that some efficient
training techniques developed for LLMs may be applicable here.
WhileLZNis fundamentally capable of generative modeling without any auxiliary losses (see),
in, we only demonstrate how it can enhance existing generative models. Exploring how to fully
leverageLZNfor standalone generative modeling remains an open direction for future work.
Improving performance. LZNachieves competitive results in unsupervised representation
learning (§ 4) and classification (§ 5), there remains a gap to the SoTA. Bridging this gap is an
interesting direction. One promising avenue is to incorporate well-established improvements from
the literature that we have not yet adopted, such as more advanced architectural designs, as discussed
in.
applications and at most two tasks simultaneously (§ 5). However, LZNis designed to be extensible:
by incorporating additional encoders and decoders, it can naturally support more modalities and
perform multiple tasks concurrently (Fig. 1). We leave this exploration to future work.
10

Acknowledgement
The authors would like to thank the anonymous reviewers for their helpful suggestions. Xuefei Ning
acknowledges the support by the National Natural Science Foundation of China (No. 62506197). In
addition, the authors gratefully acknowledge Cat Cookie and Cat Doudou for graciously lending their
adorable faces for.
10
References
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report., 2023.
[2]
Devansh Arpit, Aadyot Bhatnagar, Huan Wang, and Caiming Xiong. Momentum contrastive
autoencoder: Using contrastive learning for latent space distribution matching in wae.
preprint arXiv:2110.10303, 2021.
[3]
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael
Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-
embedding predictive architecture. In
Vision and Pattern Recognition, pages 15619–15629, 2023.
[4]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image trans-
formers., 2021.
[5]
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang,
Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.
Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
[6]
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity
natural image synthesis., 2018.
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners., 33:1877–1901, 2020.
[8]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering
for unsupervised learning of visual features. In
computer vision (ECCV), pages 132–149, 2018.
[9]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski,
and Armand Joulin. Emerging properties in self-supervised vision transformers. In
of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
[10]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework
for contrastive learning of visual representations. In
learning, pages 1597–1607. PmLR, 2020.
[11]
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big
self-supervised models are strong semi-supervised learners.
processing systems, 33:22243–22255, 2020.
[12]
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum
contrastive learning., 2020.
[13]
Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In
of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758,
2021.
[14]
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised
vision transformers. In
vision, pages 9640–9649, 2021.
10
From47].
11

[15]
Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image
synthesis for multiple domains. In
and pattern recognition, pages 8188–8197, 2020.
[16]
Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow matching in latent space.
preprint arXiv:2307.08698, 2023.
[17]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In
recognition, pages 248–255. Ieee, 2009.
[18]
Li Deng. The mnist database of handwritten digit images for machine learning research [best of
the web]., 29(6):141–142, 2012.
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In
the North American chapter of the association for computational linguistics: human language
technologies, volume 1 (long and short papers), pages 4171–4186, 2019.
[20]
Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning
by context prediction. In,
pages 1422–1430, 2015.
[21]
Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning.
in neural information processing systems, 32, 2019.
[22]
Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discrimi-
native unsupervised feature learning with convolutional neural networks.
information processing systems, 27, 2014.
[23]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini,
Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans-
formers for high-resolution image synthesis. In
learning, 2024.
[24]
Michael Fuest, Pingchuan Ma, Ming Gui, Johannes Schusterbauer, Vincent Tao Hu, and
Bjorn Ommer. Diffusion models and representation learning: A survey.
arXiv:2407.00783, 2024.
[25]
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by
predicting image rotations., 2018.
[26]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena
Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar,
et al. Bootstrap your own latent-a new approach to self-supervised learning.
information processing systems, 33:21271–21284, 2020.
[27]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked
autoencoders are scalable vision learners. In
computer vision and pattern recognition, pages 16000–16009, 2022.
[28]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for
unsupervised visual representation learning. In
computer vision and pattern recognition, pages 9729–9738, 2020.
[29]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In,
pages 770–778, 2016.
[30]
Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In
tional conference on machine learning, pages 4182–4192. PMLR, 2020.
[31]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash equilibrium.
neural information processing systems, 30, 2017.
12

[32]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.
in neural information processing systems, 33:6840–6851, 2020.
[33]
Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti,
and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
9307–9315, 2024.
[34]
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation., 2017.
[35]
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. In
pattern recognition, pages 4401–4410, 2019.
[36]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.
Analyzing and improving the image quality of stylegan. In
conference on computer vision and pattern recognition, pages 8110–8119, 2020.
[37]
Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.
Advances in neural information processing systems, 31, 2018.
[38]
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
[39]
kuangliu. Train cifar10 with pytorch.https://github.com/kuangliu/pytorch-cifar,
2025.
[40]
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved
precision and recall metric for assessing generative models.
processing systems, 32, 2019.
[41]
Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the training of rectified flows.
in Neural Information Processing Systems, 37:63082–63109, 2024.
[42]
Mufan Li, Mihai Nica, and Dan Roy. The neural covariance sde: Shaped infinite depth-
and-width networks at initialization.,
35:10795–10808, 2022.
[43]
Tianhong Li, Huiwen Chang, Shlok Mishra, Han Zhang, Dina Katabi, and Dilip Krishnan.
Mage: Masked generative encoder to unify representation learning and image synthesis. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
2142–2152, 2023.
[44]
Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-
supervised representation generation method.
Systems, 37:125441–125468, 2024.
[45]
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang.
Towards general text embeddings with multi-stage contrastive learning.
arXiv:2308.03281, 2023.
[46]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense
object detection. In,
pages 2980–2988, 2017.
[47]
Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, Harsha Nori, and Sergey Yekhanin. Dif-
ferentially private synthetic data via foundation model apis 1: Images.
arXiv:2305.15560, 2023.
[48]
Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples
in generative adversarial networks., 31,
2018.
13

[49]
Zinan Lin, Vyas Sekar, and Giulia Fanti. Why spectral normalization stabilizes gans: Analysis
and improvements., 34:9625–9638, 2021.
[50]
Zinan Lin, Kiran Thekumparampil, Giulia Fanti, and Sewoong Oh. Infogan-cr and modelcen-
trality: Self-supervised model training and selection for disentangling gans. In
conference on machine learning, pages 6127–6139. PMLR, 2020.
[51]
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow
matching for generative modeling., 2022.
[52]
Enshu Liu, Xuefei Ning, Yu Wang, and Zinan Lin. Distilled decoding 1: One-step sampling of
image auto-regressive models with flow matching., 2024.
[53]
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate
and transfer data with rectified flow., 2022.
[54]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach., 2019.
[55]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver:
A fast ode solver for diffusion probabilistic model sampling in around 10 steps.
Neural Information Processing Systems, 35:5775–5787, 2022.
[56]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-
solver++: Fast solver for guided sampling of diffusion probabilistic models.
arXiv:2211.01095, 2022.
[57]
Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant rep-
resentations. In
recognition, pages 6707–6717, 2020.
[58]
Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with
sparse representations., 2021.
[59]
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving
jigsaw puzzles. In, pages 69–84. Springer, 2016.
[60]
Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with
auxiliary classifier gans. In, pages 2642–2651.
PMLR, 2017.
[61]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive
predictive coding., 2018.
[62]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov,
Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning
robust visual features without supervision., 2023.
[63]
Druv Pai, Ziyang Wu Wu, Sam Buchanan, Yaodong Yu, and Yi Ma. Masked completion
via structured diffusion with white-box transformers. International Conference on Learning
Representations, 2023.
[64]
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties
in gan evaluation. In
recognition, pages 11410–11420, 2022.
[65]
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context
encoders: Feature learning by inpainting. In
vision and pattern recognition, pages 2536–2544, 2016.
[66]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In,
pages 8748–8763. PmLR, 2021.
14

[67]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language
understanding by generative pre-training. 2018.
[68]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners., 1(8):9, 2019.
[69]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical
text-conditional image generation with clip latents., 1(2):3,
2022.
[70]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark
Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In
machine learning, pages 8821–8831. Pmlr, 2021.
[71]
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-
networks. In
Processing. Association for Computational Linguistics, 11 2019.
[72]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans., 29,
2016.
[73]
Michael Sander, Pierre Ablin, and Gabriel Peyré. Do residual neural networks discretize
neural ordinary differential equations?,
35:36520–36532, 2022.
[74]
Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for
semantic face editing. In
pattern recognition, pages 9243–9252, 2020.
[75]
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper-
vised learning using nonequilibrium thermodynamics. In
learning, pages 2256–2265. pmlr, 2015.
[76]
[77]
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution., 32, 2019.
[78]
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations.
preprint arXiv:2011.13456, 2020.
[79]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-
thinking the inception architecture for computer vision. In
on computer vision and pattern recognition, pages 2818–2826, 2016.
[80]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In
puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XI 16, pages 776–794. Springer, 2020.
[81]
Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder.
neural information processing systems, 33:19667–19679, 2020.
[82], pages 735–735. Springer US, Boston, MA, 2010.
[83]
Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichten-
hofer. Masked feature prediction for self-supervised visual pre-training. In
IEEE/CVF conference on computer vision and pattern recognition, pages 14668–14678, 2022.
[84]Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via
non-parametric instance discrimination. In
vision and pattern recognition, pages 3733–3742, 2018.
15

[85]
Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:
Construction of a large-scale image dataset using deep learning with humans in the loop.
preprint arXiv:1506.03365, 2015.
[86]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image
diffusion models. In,
pages 3836–3847, 2023.
[87]
Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In
Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14,
2016, Proceedings, Part III 14, pages 649–666. Springer, 2016.
[88]
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin,
Baosong Yang, Pengjun Xie, Fei Huang, et al. mgte: Generalized long-context text represen-
tation and reranking models for multilingual text retrieval.,
2024.
[89]
Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, and Yu Wang.
Flasheval: Towards fast and accurate evaluation of text-to-image diffusion generative models.
In,
pages 16122–16131, 2024.
[90]
Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised
learning of visual embeddings. In
computer vision, pages 6002–6012, 2019.
16

Appendix Contents
A More Details on Latent Computation (§ 2.2.1)
A.1 The Desired Properties
A.2 More Implementation Details
A.3 Efficiency Optimizations
B More Details on Latent Alignment (§ 2.2.2)
B.1 More Implementation Details
B.2 Efficiency Optimizations
C More Details and Results on Case Study 1
C.1 Algorithm Pseudocode
C.2 More Implementation Details
C.3 More Experimental Settings
C.4 More Results
D More Details and Results on Case Study 2
D.1 Algorithm Pseudocode
D.2 More Implementation Details
D.3 More Experimental Settings
D.4 More Results
E More Details and Results on Case Study 3
E.1 Algorithm Pseudocode
E.2 More Implementation Details
E.3 More Experimental Settings
E.4 More Results
F Extended Discussions on Related Work
G Extended Discussions on Limitations and Future Work
17

A More Details on Latent Computation (§ 2.2.1)
A.1 The Desired Properties
In this section, we explain in more detail why the construction in
two desired properties.
•
Prior distribution is Gaussian. z
Uniformz 1, . . . , zn}
is induced by (1) drawingaa 1, . . . , an} andϵ,) ,
(2) computings1−g= (1a , and (3) computingIFMx(s1−g; 0) . In the above process,
s1−g∼ 1−g by definition. Due to the property of FM discussed in, IFMx(s1−g; 0) 0
when 1−g∼ 1−g. Therefore, the latent 0=,).
•
Disjoint latent zones. z,) can be uniquely map to one of{a1, . . . , an}
through the defined FM. We define the latent zone ofi-th sample as the set of latents that map to
ai:Zi=z x(z i} . The probability that the latent computed through
in the incorrect latent zone can then be defined asP x(ai, ϵ; 0) j) fori= , where the
probability is over the randomness ofϵ) . We can see that this probability can be made
arbitrarily small by choosing a sufficiently smallg. This is because that to make the latent fall in
the incorrect latent zone, we need to have∥ϵ∥on the scale of

(1−ga i−aj)
g

, whose probability
→ wheng . To make this intuition more precise, we give the closed form of this probability
for a toy one-dimensional case below.
Theorem 1. n samplesx1, x2 , with their anchor pointsa1=1, a 2= 1 .
We have
P x(a1, ϵ; 0) 2) = x(a2, ϵ; 0) 1) = Φ
ȷ
g
g
ff
,
where
Proof.
With a slight abuse of notation, we defineFMx(s; 1, t2) =
R
t2
τt 1
Vsτ, ττ as following
the FM trajectory fromt1tot2. We generalize the latent zone definition above to all time steps as
the latents at time steptthat map to the anchor pointai:Z
t
i
=z x(z i} . Due to
symmetry, we know that
t
1= (−∞,
t
2= (0,). Therefore, we have
P x(a1, ϵ; 0) 2) =−(1
ȷ
ϵ >
1
g
ff
= Φ
ȷ
g
g
ff
,
whereΦ (·) is the CDF function of Gaussian distribution. Similarly, we can get that
P x(a2, ϵ; 0) 1) = Φ
ˇ
g1
g
ı
.
A.2 More Implementation Details
We find that in the training or inference of some tasks benefit from using a more concentrated latent
distribution:
zi=x 1, . . . , xn)
i
≜ x(ai, αϵi; 0)
whereϵi∼ N,) andα, is the scaling factor. Similar techniques have been used in prior
generative models for improving sample quality [35,,,].
A.3 Efficiency Optimizations
We introduce a series of efficiency optimization techniques so that the training ofLZNcan scale up to
large models and large batch sizes.
Minibatch approximation (reducing memory and computation cost).
tion requires using
this by using only the current minibatch asx1, . . . , xn , which significantly reduces memory and
computation cost.
Note that this approximation has nuanced implications on the two desired properties discussed in
§ 2.2.1.
18

•
Minibatch approximation still preserves the Gaussian prior.LZNensures that the latent
distribution within each minibatch is approximatelyN,) . As a result, the overall latent
distribution becomes a mixture of Gaussians with the same parameters, which is stillN,) .
Therefore, the global prior remains valid under the minibatch approximation.
•
However, minibatch approximation violates the disjoint zone property.
latent zones of each sample now depends on other samples in the same batch, which could change
across different batches. Despite this approximation, our experiments show it performs well.
•
In generative modeling (§ 3), the latent does not need perfect zone disjointness—as long
as it provides some information about the input sample, it can help reduce the variance needed
to learn by the generative model (rectified flow in our case) and improve the generation quality.
•In representation learning (§ 4), latent alignment occurs. Thus, inconsis-
tency across batches is irrelevant.
•
In classification (§ 5), we only need to map samples
labels. Thus, inconsistency across batches is irrelevant.
That said,
mance (§ 5).
Custom gradient checkpointing (reducing memory cost).
diate results for use in backpropagation, incurring significant memory cost. Gradient checkpointing
11
reduces memory usage (with the cost of extra computation) by selectively discarding intermediates
in the forward pass and recomputing them during the backward pass. This technique is typically
applied within neural networks. In our case, we discover that the main memory bottleneck lies in
latent computation, which has memory complexityOn
2
qr , wherenis the number of samples,q
the latent dimension, andrthe solver steps. We design a custom strategy that skips storing velocity
computations and retains only the latent trajectoriesst. This reduces memory complexity toOnqr ,
which makes the training far more manageable.
Latent parallelism (making training scalable with multi-GPU).
above, the main computation overhead also lies in latent computation. A natural idea is to parallelize
it with multi-GPU. We partition the data samples across GPUs, and each GPU computes anchor
points for its assigned subset. These anchor points are then broadcast to all GPUs, allowing each
to compute latents for its own samples using the complete set of anchors. To ensure that gradients
can propagate back correctly through the anchor points to the originating GPUs, we use the undoc-
umented PyTorch functiontorch.distributed.nn.functional.all_gather, which—unlike
the standard—maintains gradient flow to the original sources.
B More Details on Latent Alignment (§ 2.2.2)
B.1 More Implementation Details
Optionally, we can apply a logarithm to the assignment probability to make the loss resemble a
standard cross-entropy formulation. In that case, our proposed alignment objective is:
Align (X
m
X
i=1
max
t∈{t u,...,tr}
log
Γ
aki
|s
i
t
˙
.
B.2 Efficiency Optimizations
We apply the same efficiency optimizations in
C More Details and Results on Case Study 1
C.1 Algorithm Pseudocode
Fig. 7
RF+LZN.
11
https://pytorch.org/docs/stable/checkpoint.html
19

Algorithm 1:
Input :
Decoder: x
Number of iterations:
Batch size:
1, . . . , T
2 1, . . . , xB←
3 1, . . . , ϵB←
4 1, . . . , tB←
5 i← i)ϵi+ixi
6 x(ξi)
Algorithm 2:LZN
Input :
Decoder: x
Encoder: x(used by
Number of iterations:
Batch size:
1, . . . , T
2 1, . . . , xB←
3 z1, . . . , zB←x 1, . . . , xB)
4 1, . . . , ϵB←
5 1, . . . , tB←
6 i← i)ϵi+ixi
7 x(ξi;zi)
Figure 7: LZN. Left:
illustration of the standard RF [53] training process. In each iteration, a batch of real samples and a
batch of Gaussian noise are drawn and interpolated to produce noisy inputs, which are then passed
through the decoder network to compute the loss. LZN
training process. The key differences are highlighted in gray: we computeLZNlatents for the samples
using the method in, and provide these latents as an additional input to the RF decoder.
Algorithm 3:
Input : x
1
2 x(ξ
Algorithm 4:LZN
Input : x
1
2z
3 x(ξz
Figure 8: LZN. Left:
illustration of the standard RF generation process [53]. The decoder takes Gaussian noise as input
and generates a sample. The actual process is iterative, but we leave out the steps for simplicity and
only show the starting input (Gaussian noise) and the final output (the generated image).
simple illustration of the RF+LZNgeneration process. The main differences are shown in gray: we
sample extraLZNlatents from Gaussian noise and use them as additional inputs to the RF decoder
during the iterative generation process.
C.2 More Implementation Details
Architecture.
•
Decoder. 53] is concatenating theLZNlatent with the
timestep embedding.
•
Encoder. 53] by connecting the output of each ResNet block
with a latent transformation block. The sum of the outputs of the latent transformation blocks
forms theLZNlatent. Each latent transformation block consists of: (1) a1 convolution that
projects the ResNet output to 20 channels, reducing dimensionality; and (2) a small MLP with a
200-dimensional hidden layer that outputs the latent from the flattened convolution output.
C.3 More Experimental Settings
Datasets.
•
(32 ) [38]
objects. We only utilize the training set for this experiment.
•256) [15]
20

•256) [34]
•256) [85]
Metrics.
•
FID [31] and sFID [58]
both into the latent space of a pretrained network (e.g., Inception-v3 [79]), fitting each set of latents
with Gaussian distributions, and computing their Wasserstein-2 distance. The key difference
between FID and sFID is the feature layer used: FID uses pooled features, while sFID uses
intermediate features, making it more sensitive to spatial details.
•
Inception score (IS) [72]
(how confidently a classifier predicts a class) and diversity across all images (coverage over
different classes). Since the classifier is trained onImageNet[17], IS is best suited for natural
image datasets like. We report IS for all datasets for completeness.
•
Precision and recall [40]
precision measures the fraction of generated images that are close to real ones, while recall
measures the fraction of real images that are close to the generated ones.
•
CMMD [33]
generated images. Compared to FID, it is reported to align better with human preference and have
better sample efficiency.
•
Reconstruction error
This reflects the model’s representational power and is crucial for applications like image editing,
where edits are made by modifying the image’s latent representation []. For RF, we first apply
the inverse ODE to map the image to its latent representation, then use the forward ODE to
reconstruct the image, and compute theℓ2distance between the original and reconstructed images.
For RF+LZN, we add an initial step: compute the image’sLZNlatentCX and feed it into the
RF latent computation process as an additional input.
Following the convention [53,76,48,89,49], the metrics are all computed using the training set of
the dataset.
For FID, sFID, IS, precision, recall, and CMMD, we subsample the training set and generate the
same number of samples to compute the metrics. The number of samples are:
•: 50000 (the whole training set).
•
: 5120, the largest multiple of the batch size (256) that is less than or equal to the
training set size (5153).
•
: 29952, the largest multiple of the batch size (256) that is less than or equal to the
training set size (30000).
•
: 29952, the largest multiple of the batch size (256) that is less than or equal to
30000. We limit the number of samples to 30000 so that the computation cost of the metrics are
reasonable.
For reconstruction error, we randomly sample a batch of images (2000 forCIFAR10and 256 for the
other datasets) from the training set. Each image is reconstructed 20 times (note that theLZNlatents
CX
Note that for all the random subsampling procedures mentioned above, we ensure the sampled sets
are consistent between RF and RF+LZN, so that the resulting metrics are directly comparable.
Sampler.
generation. For both RF and RF+LZN, we use the RK45 sampler from RF [53], which adaptively
determines the number of steps. In, we also analyze the effect of varying the number of
sampling steps using the Euler sampler [].
Hyperparameters.
•
•
•
•
•
21

•
•
•LZN:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•LZN:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•LZN:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•LZN:
•
22

Figure 9: Generated images of RF on.
•
•
•
•
•
•
•
Computation cost.
computation cost of model training), each RF+LZN
•: 8 hours on 16 A100 (40 GB) GPUs.
•: 10 hours on 32 A100 (40 GB) GPUs.
•: 58 hours on 32 A100 (40 GB) GPUs.
•: 341 hours on 32 A100 (40 GB) GPUs.
C.4 More Results
Generated images.LZN.
Ablation studies on FID implementation.
can result in different results [64]. In our main experiments, we use the implementation in consistency
models [76]. In, we additionally show the FID using two other implementations: RF [53] and
clean FID [64]. We can see that, while the numbers are different, the relative ranking across all three
implementations is consistent. Especially, RF+LZNachieves the best FID in three out of four datasets.
Ablation studies on sampling steps.
numbers of sampling steps. As shown in, RF+ LZNgenerally achieves better FID than the RF
baseline across most settings. Notably, in the only case where RF+LZNperforms worse than RF in
23

Figure 10: Generated images of RF+LZN.
Table 5: FID with different implementations for unconditional image generation. “CM” denotes
consistency models [76]; “RF” denotes Rectified Flow [53]; “clean” denotes clean FID [64]. The
best results are in
gray box
Algo.
CIFAR1032) AFHQ-Cat256)
FID (clean)↓↓↓ FID (clean)↓↓↓
RF 3.18 2.77 2.76 5.99 6.20 6.08
RF+LZN 3.05 2.61 2.59 5.66 5.69 5.68
Algo.
CelebA-HQ256) LSUN-Bedroom256)
FID (clean)↓↓↓ FID (clean)↓↓↓
RF 7.10 7.00 6.95 6.39 6.25 6.25
RF+LZN 7.31 7.23 7.17 5.88 5.87 5.95
Tab. 1, we observe that the underperformance occurs only at the highest number of sampling steps in
the Euler sampler (Fig. 17c).
24

Figure 11: Generated images of RF on.Figure 12: Generated images of RF+LZN.
25

Figure 13: Generated images of RF on.Figure 14: Generated images of RF+LZN.
26

Figure 15: Generated images of RF on.Figure 16: Generated images of RF+LZN.
27

10
1
10
2
10
3
NFE
10
1
3×10
0
4×10
0
6×10
0
FID
Rectified Flow
Rectified Flow + LZN (a).10
1
10
2
10
3
NFE
10
1
6×10
0
2×10
1
3×10
1
FID
Rectified Flow
Rectified Flow + LZN (b).10
1
10
2
10
3
NFE
10
1
FID
Rectified Flow
Rectified Flow + LZN (c).10
1
10
2
10
3
NFE
10
1
6×10
0
2×10
1
FID
Rectified Flow
Rectified Flow + LZN (d).
Figure 17: FID vs. number of sampling steps in the Euler sampler. RF+LZNoutperforms RF in most
cases.
28

D More Details and Results on Case Study 2
D.1 Algorithm Pseudocode
Alg. 5
After training, the encoderExcan be used to obtain image representations. We provide several
strategies for extracting these representations. Please see
D.2 More Implementation Details
Architecture. 28,10,26], we use the ResNet-50 architecture
[29] as the encoder forLZN. The only modification we make is replacing all batch normalization
layers with group normalization. However, our early experiments indicate that this change does not
lead to significant performance differences.
For the projection head following the ResNet-50 output, we use an MLP with one hidden layer, as in
[10,].
Data augmentation.11].
Representation. 10,11], after training the ResNet-50, we discard the
projection head and use only the ResNet-50 backbone to extract representations for training the linear
classifier. As a result, obtaining representations fromLZNin this way does not require going through
theLZNlatent computation processC, and thus has the same computational efficiency as baseline
methods.
Objective.Eq. (5)).
D.3 More Experimental Settings
Datasets. ImageNetdataset, which contains 1281167 training images and 50000
validation images.LZNis trained on the training set, and classification accuracy is evaluated on the
validation set.
Hyperparameters.
•
•
•
•
•
•
•.45
Computation cost.
computation cost of model training), eachLZNexperiment takes 1800 hours on 128 A100 (40 GB)
GPUs.
D.4 More Results
More baselines.
architecture.
Visualizing the learned representations.
the validation set of ImageNet and computed their embeddings using the trainedLZNmodel. We
chose the validation set to ensure that the results are not influenced by training set overfitting. We
then projected these embeddings into a 2D space using t-SNE–a widely used method for visualizing
high-dimensional representations, following seminal works such as SimCLR [10]. The resulting
12
Note that we use the term
contrastive loss, but to all approaches that encourage relevant images to share similar representations; see.
29

Algorithm 5:
Input :
Encoder: x
Number of iterations:
Batch size:
1, . . . , T
2 1, . . . , xB←
3
′
i
, x
′′
i
← i
4 xusing{x
′
1, . . . , x
′
B
}x
′′
1, . . . , x
′′
B
})
Table 6: Classification accuracy onImageNetby training a linear classifier on the unsupervised
representations. Methods with
§
are based on contrastive learning.
12
The horizontal line separates
baselines that perform worse or better than our. “R” means “ResNet”.
Algorithm ArchitectureTop-1 Acc↑ Top-5 Acc↑
Colorization [87] R101 39.6 [28] -
Jigsaw [59] R50w2× 44.6 [28] -
Exemplar [22] R50w3× 46.0 [28] -
DeepCluster [8] VGG 48.4 [28] -
CPC v1
§
[61] R101 48.7 [28] -
RelativePosition [20] R50w2× 51.4 [28] -
InstDisc
§
[84] R50 54.0 [28] -
Rotation [25] Rv50w4× 55.4 [28] -
BigBiGAN [21] R50 56.6 [28] -
LocalAgg
§
[90] R50 58.8 [28] -
MoCo
§
[28] R50 60.2 [10] -
BigBiGAN [21] Rv50w4× 61.3 [28] 81.9 [10]
PIRL
§
[57] R50 63.6 [10] -
CPC v2
§
[30] R50 63.8 [10] 85.3 [10]
CMC
§
[80] R50 66.2 [26] 87.0 [26]
SimSiam
§
[13] R50 68.1 [13] -
SimCLR
§
[10] R50 69.3 [10] 89.0 [10]
MoCo v2
§
[12] R50 71.7 [12] -
SimCLR v2
§
[11] R50 71.7 [11] -
BYOL
§
[26] R50 74.3 [26] 91.6 [26]
DINO
§
[9] R50 75.3 [9] -
DINO
§
[9] ViT-S 77.0 [9] -
DINO
§
[9] ViT-B/16 78.2 [9] -
DINO
§
[9] ViT-S/8 79.7 [9] -
DINO
§
[9] ViT-B/8 80.1 [9] -
I-JPEA [3] ViT-B/16 72.9 [3] -
I-JPEA [3] ViT-L/16 77.5 [3] -
I-JPEA [3] ViT-H/14 79.3 [3] -
I-JPEA [3] ViT-H/16 448 81.1 [3] -
LZN R50 69.5 89.3
t-SNE plot is in. We can see that the samples from different classes are well-clustered. This
suggests LZN learns meaningful image representations.
Ablation studies on representation choice. 10,11] has shown that the choice of feature
extraction layer significantly affects downstream performance. In particular, removing the projection
head often improves results. Motivated by this, we explore various feature extraction strategies for
LZN, which offers more flexibility due to its unique latent computation process (§ 2.2.1). Specifically,
we compare the following methods:
•) to train the classifier.
•α).
•
•
30

−20 0 20 40
−30
−20
−10
0
10
20
30
993
859
298
553
672
971
27
231
306
706
496
558
784
239
578
55
906
175
14
77 Figure 18: t-SNE visualization ofLZNrepresentations projected into 2D for 20 randomly selected
ImageNet validation classes. Images from the same class form distinct clusters, indicating thatLZN
learns meaningful image representations.
The first two methods are specific toLZN, while the last two follow the design commonly used in
prior contrastive learning work [10,].
The results are shown in. We observe the following:
•
latent achieves the lowest prediction accuracy. As discussed in, the LZNlatents are
reliable only when computed over the full dataset. However, for efficiency, both training and
inference rely on minibatches to approximate the latent representations. This approximation
increases the size of the latent zones, leading to potential overlap between the zones of different
samples across batches, which inevitably degrades downstream classification performance.
•
In comparison,LZNwithα yields significantly higher accuracy. This improvement can be
attributed to the reduced likelihood of overlap between latent zones whenα , making the
resulting representations more distinct and less noisy.
•
The final two methods, “with head” and “without head”, do not involve latent computation and
are therefore more efficient. Consistent with findings from prior contrastive learning studies [10],
we observe that “without head” performs substantially better. As explained in [10], the projection
head often discards important information—such as types of data augmentation—in order to
minimize the training loss. In contrast, layers preceding the head might retain richer and more
discriminative features, which are more useful for downstream classification tasks.
Ablation studies on the number of training steps.
training iterations. Accuracy continues to improve rapidly at the end of training, suggesting that with
more training, the gap between
31

W/ latentW/ head W/ latent
(α=0)
W/o head
Method
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Top-1 accuracy
11.9
54.9 55.9
65.6 Figure 19:LZN’s linear classification accuracy with different feature extraction methods. Note that
this experiment uses fewer iterations (1060000) than the main experiment (5000000) and omits data
augmentation when training the linear classifier (used in the main experiment), so the accuracies are
lower than the main experiment.0 1 2 3 4 5
Iterations 1e6
0.58
0.60
0.62
0.64
0.66
0.68
0.70
Top-1 accuracy
Figure 20:LZN’s linear classification accuracy vs. training iteration. The accuracy is still improving
at a fast rate at the end of training. More training might further improve the result.
E More Details and Results on Case Study 3
E.1 Algorithm Pseudocode
Alg. 6,, and
tion process.
E.2 More Implementation Details
Architecture.
•
Image decoder. 53] to include a one-hot encoding
of the class label as an additional input, concatenated with the timestep embedding. For RF+LZN,
we apply the same modification on top of the architecture described in. For unconditional
generation, this one-hot encoding is deterministically derived from theLZNlatent: given aLZN
latent, we use the class label decoder to predict the class and then encode it as a one-hot vector. As
a result, the decoder’s output remains fully determined by theLZNand RF latents, consistent with
§ 3. For conditional generation, this one-hot encoding is given as a condition, andLZNlatents are
sampled from the corresponding latent zone (as described in).
•.
Label FM.
However, due to sampling randomness during training, each batch may have an imbalanced class
32

Algorithm 6:LZN
Input :
Image decoder: y
Image encoder: y(used by
Label anchors:)
Number of iterations:
Batch size:
1, . . . , T
2 1, . . . , yB←
3 1, . . . , zB←y 1, . . . , yB)
4 1, . . . , ϵB←
5 1, . . . , tB←
6 i← i)ϵi+iyi
7
Training usingAlign (Xy 1, . . . , yB}) and RF loss onDy(ξi;i) (a weighted loss between
the two)
Algorithm 7: LZNgeneration (uncondi-
tional)
Input : y
1
2
3 y(ξ
Algorithm 8: LZNgeneration (uncondi-
tional)
Input : y
Label set: 1, . . . , cn
Class ID: k)
1
2z{c 1, . . . , cn})
k
3 y(ξ
Figure 21: LZN(with class labels). LZNcan
simultaneously support unconditional and conditional generation.
where theLZNlatent is drawn from the prior Gaussian distribution, which is exactly the same as
Fig. 8. LZNlatent is drawn from the latent zone of the
corresponding class. The changes on top of unconditional generation are highlighted in gray.
distribution. To address this, we modify theπ1distribution when computingFMx(·) to be a
mixture of Dirac delta functions centered atEx(xi) , with weights corresponding to the fraction of
classxisamples in the batch. During testing, we revert to a uniform prior, as the true class distribution
of the batch is not available.
Objective.Eq. (2)). We also tried the version with log (Eq. (5)) and
did not observe a large difference in results.
E.3 More Experimental Settings
Datasets..
Metrics., we evaluate on CIFAR10classification accuracy.
The accuracy is evaluated on
Sampler..
Hyperparameters.
•
•
•
•
•
33

Algorithm 9:LZN
Input : x
Label decoder: x
Images: 1, . . . , yB
11, . . . , zB←y 1, . . . , yB)
2i← x(zi)
Table 7: FID with different implementations for conditional image generation onCIFAR10. “CM”
denotes consistency models [76]; “RF” denotes Rectified Flow [53]; “clean” denotes clean FID [64].
The best results are ingray box
Algo.FID (clean)↓↓↓
RF 2.85 2.50 2.47
RF+LZN 2.70 2.42 2.40
•
•
•LZN:
•
•
•
•
•
•
•
•
•
Computation cost.
computation cost of model training), each RF+LZNexperiment takes 31 hours on 16 A100 (40 GB)
GPUs.
E.4 More Results
Generated images.LZN.
Ablation studies on FID implementation., we present the FID scores using three
different implementations in. We see that, while the numbers are different, the relative ranking
across all three implementations is consistent. Especially, RF+LZNachieves the best FID in all
implementations.
Ablation studies on sampling steps., we use the Euler
sampler with varying numbers of sampling steps. As shown in, RF+ LZNgenerally achieves
better FID than the RF baseline across most settings.
Ablation studies on classification techniques.
the classification results.
•
Recall that latent computation (Eq. (1)) includes randomness from ϵibecause each sample
corresponds to a latent, not a single point. Empirically, for classification tasks, using the
“center” of the latent zone yields better performance. Concretely, we setα in
computing latents. This is intuitive, as the center is likely farther from zone boundaries and better
represents the sample.
•
§ A.3
latents for efficiency. However, during inference, where gradient computation is unnecessary and
thus the overhead of large batch sizes is less critical, we can use a larger batch size to improve
performance.
34

Figure 22: Generated images of RF onCIFAR10(conditional generation). Every 5 rows corresponds
to one class in.
Fig. 25 αimprove the classification accuracy.
The best setting improves the default setting (batch size= 2000.0) by 2.9%.
Ablation studies on latent alignment hyperparameter.§ 2.2.2), we introduced
a hyperparameteruthat controls how many time steps are excluded from the latent alignment
objective. In our main experiments, we setu (out of a total of 100 steps). Here, we conduct
an ablation study by reducinguto 5 (i.e.,4×smaller). The results are shown in. We can see
thatudoes not affect the results much, and the performance remains better than the baseline RF
across most metrics. This is expected. Unlike common hyperparameters (such as loss weights) that
influence the optimal solution,udoes not alter the optimal solution, which is the perfect alignment
35

Figure 23: Generated images of RF+LZNonCIFAR10(conditional generation). Every 5 rows corre-
sponds to one class in.
between two latent zones. Instead, this parameter is introduced solely to help avoid getting stuck
in local optima (§ 2.2.2). We expect that any small but non-zero value of ushould be sufficient in
practice.
36

10
1
10
2
10
3
NFE
10
1
3×10
0
4×10
0
6×10
0
FID
Rectified Flow
Rectified Flow + LZN Figure 24: FID vs. number of sampling steps in the Euler sampler onCIFAR10(conditional
generation). RF+LZN1.00.80.60.40.20.0
α
2000
4000
8000
10000
Batch size
91.57 93.35 94.16 94.30 94.29 94.33
92.69 93.85 94.25 94.33 94.34 94.40
93.70 94.24 94.36 94.31 94.40 94.45
93.85 94.28 94.44 94.47 94.45 94.44
0.920
0.925
0.930
0.935
0.940
Accuracy
Figure 25: Classification accuracy with different hyperparameters onCIFAR10. Generally, increasing
the batch size and decreasing
Table 8: Conditional image generation quality and classification accuracy onCIFAR10. The best
results are ingray box
Algo. FID↓↓↑↑↑↓↑
RF 2.47 4.05 9.77 0.71 0.58
RF+LZNu) 2.403.999.88 0.71 0.58 94.47
RF+LZNu) 2.393.99 0.71 0.580.36
37

F Extended Discussions on Related Work
[16] proposes to conduct flow matching in the latent space. However, it has quite different goals and
techniques from.
•
Goals. 16] is to improve. In contrast, our goal is more ambitious: to
develop.
This broader scope requires a different design philosophy and technical approach, as detailed
next.
•
Techniques. 16] applies flow matching to the
autoencoder, which is reasonable when focusing solely on generation. However, such a latent
space is, limiting its suitability for classification
and compact representation learning. To support our broader objectives, we introduce several
novel techniques:
•
Match a, as opposed to a continuous-
to-continuous distribution matching in [16].
•
Use an, since our encoder and decoder are trained end-to-end, as opposed
to using a fixed pre-trained autoencoder and fixed latent space in [16].
•
Numerically solve
in [16].
•
Latent alignment
G Extended Discussions on Limitations and Future Work
Inference efficiency. LZNmight be high,
inference time, LZN is often as efficient as existing approaches.
•
For§ 3), we do not need to compute the latent during inference. Instead,
latents are sampled from the Gaussian prior and passed directly to the decoder, making the
generation speed comparable to the base model.
•
For§ 4), we find that dropping the final encoder layers during inference
improves performance (§ D.4), similar to the observation in prior contrastive learning methods
[10]. In this case, inference involves simply passing an image through the encoder
latent computation process (§ 2.2.1), just like in traditional contrastive learning methods.
Training efficiency.
batch size. Notably, this is also the case for many contrastive learning methods, including the seminal
works MoCo [28] and SimCLR [10], which compute pairwise similarities between all examples in a
batch.
The parallel between LLM training andLZNtraining.
the training of LLMs andLZN. Specifically, in LLM training, computing attention weights requires
O
Γ
c
2
dv
˙ , wherecis the context length,dis the attention dimension, andvis the number of layers.
InLZN, computing the latents (§ 2.2.1) requires On
2
qr , wherenis the number of samples in a
batch,
•c)n)
•d)q
•vr
Not only do these parameter pairs affect the time complexity in similar ways, but their computation
flows are also analogous: in LLMs, the pairwise inner product of token features is computed to derive
attention weights, and these weights are computed sequentially across layers. Similarly, inLZN, the
pairwise distances between intermediate anchor points of samples are computed to derive velocity,
and this velocity is updated sequentially across solver steps.
While LLM training is known to be computationally expensive, recent advances have significantly
improved its efficiency. Given the structural similarities, we expect that such advances in LLM
training could be adapted to enhance the training efficiency of
38

UsingLZNsolely to implement generative modeling. LZNcan be used solely for generative
modeling. By construction (§ 3), if the decoder is trained to map latents to the corresponding data
perfectly, then the generative distribution ofLZNis exactly
1
n
P
n
i=1
δs i)
, i.e., the empirical
distribution of the training set. We explored this approach in our early experiments. It performs well
on simple datasets such asMNIST[18], but generates blurry images on more complex datasets such
asCIFAR10. We hypothesize that this may be due to the minibatch approximation (§ A.3), which can
break the disjoint latent property, and/or the strict requirement that latent zones have no gaps between
them. We leave a deeper exploration of this direction to future work.
Societal impacts. LZNcan be used to improve ML models, it has the potential for both
beneficial and harmful applications. Positive use cases include creative content generation and
improved information retrieval, while negative applications may involve the creation of fake or
misleading content.
39