Extracted Text

2308.12350.pdf

Diffusion-based Image Translation with Label Guidance for Domain Adaptive
Semantic Segmentation
Duo Peng
1
Ping Hu
2
Qiuhong Ke
3
Jun Liu
1,*
1
Singapore University of Technology and Design
2
Boston University
3
Monash University
duopeng@mymail.sutd.edu.sg,pinghu@bu.edu,qiuhong.ke@monash.edu,junliu@sutd.edu.sg
Abstract
Translating images from a source domain to a target do-
main for learning target models is one of the most com-
mon strategies in domain adaptive semantic segmentation
(DASS). However, existing methods still struggle to preserve
semantically-consistent local details between the original
and translated images. In this work, we present an innova-
tive approach that addresses this challenge by using source-
domain labels as explicit guidance during image transla-
tion. Concretely, we formulate cross-domain image transla-
tion as a denoising diffusion process and utilize a novel Se-
mantic Gradient Guidance (SGG) method to constrain the
translation process, conditioning it on the pixel-wise source
labels. Additionally, a Progressive Translation Learning
(PTL) strategy is devised to enable the SGG method to work
reliably across domains with large gaps. Extensive exper-
iments demonstrate the superiority of our approach over
state-of-the-art methods.
1. Introduction
Semantic segmentation is a fundamental computer vision
task that aims to assign a semantic category to each pixel
in an image. In the past decade, fully supervised methods
[47,,,] have achieved remarkable progress in se-
mantic segmentation. However, the progress is at the cost
of a large number of dense pixel-level annotations, which
are significantly labor-intensive and time-consuming to ob-
tain [3,]. Facing this problem, one alternative is to lever-
age synthetic datasets, e.g., GTA5 [57] or SYNTHIA [60],
where the simulated images and corresponding pixel-level
labels can be generated by the computer engine freely. Al-
though the annotation cost is much cheaper than manual
labeling, segmentation models trained with such synthetic
datasets often do not work well on real images due to the
large domain discrepancy.
*Corresponding AuthorExisting Method
Ours
Source Semantics
①
②
③
①
①
②
②
③
③
① ②
②
②
①
①
③
③
③
RoadSide.WBuildWallFencePoleTr. LightTr. SignVeg.Terrain
SkyPersonRiderCarTruckBusTrainMotorBikeN/A
guide
Semantic
Consistency
(a) Architecture Comparison
(b) Results Comparison
Existing Methods Ours
Translation
Model
Translation
Model
Figure 1.(a)Architecture comparison between different image
translation methods for DASS. Existing methods mainly translate
images based on the input image. Our method further introduces
the source semantic label to guide the translation process, making
the translated content semantically consistent with the source la-
bel (as well as the source image).(b)Results comparison in detail.
First row: the source semantic label. Second row: the translated
image of an existing method [16], which is a state-of-the-art image
translation method based on GAN. Third row: the translated image
of our method. The translated image of the existing method shows
unsatisfactory local details (e.g., in case①of the second row, the
traffic light and its surrounding sky are translated into building).
In contrast, our method preserves details in a fine-grained manner.
The task of domain adaptive semantic segmentation
(DASS) aims to address the domain discrepancy problem
by transferring knowledge learned from a source domain
to a target one with labels provided for the source domain

only. It is a very challenging problem as there can be a large
discrepancy (e.g., between synthetic and real domains), re-
sulting in great difficulties in knowledge transfer. In DASS,
one type of methods [24,,,,,,,,,]
focuses on aligning the data distributions from different do-
mains in the feature space, which obtains promising perfor-
mance, while another type of approaches [86,,,,
38,] aims to generate pseudo labels for unlabeled tar-
get images, also making great achievements. Besides above
approaches, image translation [85] is also an effective and
important way to handle the DASS problem. It focuses on
translating the labeled source images into the target domain,
then using the translated images and source labels to train
a target-domain segmentation model. In the past few years,
lots of DASS studies [54,,,,] have been conducted
based on image translation. This is because image trans-
lation allows for easy visual inspection of the adaptation
method’s results, and the translated images can be saved as
a dataset for the future training of any model, which is con-
venient. Existing image translation methods mainly adopt
generative adversarial networks (GANs) [19] to translate
images across a large domain gap. However, due to the
intractable adversarial training, GAN models can be notori-
ously hard to train and prone to show unstable performance
[62,]. Furthermore, many studies [51,,] have also
shown that GAN-based methods struggle with preserving
local details, leading to semantic inconsistencies between
the translated images and corresponding labels, as shown
in the second row of Fig.
model on such mismatched image and label pairs will cause
sub-optimal domain adaptation performance. Facing these
inherent limitations, the development of GAN-based image
translation methods is increasingly encountering a bottle-
neck. Yet, there is currently little research exploring alter-
native approaches to improve image translation for DASS
beyond GANs.
Denoising Diffusion Probabilistic Models (DDPMs),
also known as diffusion models, have recently emerged as a
promising alternative to GANs for various tasks, such as im-
age generation [11,], image restoration [73,], and im-
age editing [1,], etc. Diffusion models have been shown
to have a more stable training process [11,,]. In-
spired by its strong capability, we propose to exploit the
diffusion model for image translation in DASS. Specifi-
cally, based on the diffusion technique, we propose a label-
guided image translation framework to preserve local de-
tails. We observe that existing image translation methods
[23,,,,] mainly focus on training a translation
model with image data alone, while neglecting the utiliza-
tion of source-domain labels during translation, as shown
in Fig.
the semantic category of each pixel, introducing them to
guide the translation process should improve the capability
of the translation model to preserve details. A straightfor-
ward idea of incorporating the source labels is to directly
train a conditional image translation model, where the trans-
lated results are semantically conditioned on the pixel-wise
source labels. However, it is non-trivial since the source
labels and target images are not paired, which cannot sup-
port the standard conditional training. Recent advances
[11,,] have shown that a pre-trained unconditional
diffusion model can become conditional with the help of
the gradient guidance [11]. The gradient guidance method
can guide the pre-trained unconditional diffusion model to
provide desired results by directly affecting the inference
process (without any training). In light of this, we propose
to first train an unconditional diffusion-based image trans-
lation model, and then apply gradient guidance to the im-
age translation process, making it conditioned on source la-
bels. However, we still face two challenges: (1) Traditional
gradient guidance methods generally focus on guiding the
diffusion model based on image-level labels (i.e., classifi-
cation labels), while the DASS problem requires the im-
age translation to be conditioned on pixel-level labels (i.e.,
segmentation labels). (2) The gradient guidance methods
typically work within a single domain, whereas the DASS
task requires it to guide the image translation across dif-
ferent domains. To address the first challenge, we pro-
pose a novel Semantic Gradient Guidance method (SGG),
which can precisely guide the image translation based on
fine-grained segmentation labels. To tackle the second chal-
lenge, we carefully design a Progressive Translation Learn-
ing strategy (PTL), which is proposed to drive the SGG
method to reliably work across a large domain gap. With
the help of SGG and PTL, our framework effectively han-
dles the image translation for DASS in a fine-grained man-
ner, which is shown in the third row of Fig.
In summary, our contributions are three-fold. (i) We pro-
pose a novel diffusion-based image translation framework
with the guidance of pixel-wise labels, which achieves im-
age translation with fine-grained semantic preservation. To
the best of our knowledge, this is the first study to exploit
the diffusion model for DASS. (ii) We devise a Seman-
tic Gradient Guidance (SGG) scheme accompanied with a
Progressive Translation Learning (PTL) strategy to guide
the translation process across a large domain gap in DASS.
(iii) Our method achieves the new state-of-the-art perfor-
mance on two widely used DASS settings. Remarkably, our
method outperforms existing GAN-based image translation
methods significantly on various backbones and settings.
2. Related Work
Domain adaptive semantic segmentation (DASS)is
a task that aims to improve the semantic segmentation
model’s adaptation performance in new target scenarios
without access to the target labels. Current DASS meth-

ods [24,,,,] have pursued three major direc-
tions to bridge the domain gap. The first direction aims
to align data distributions in the latent feature space. With
the advent of adversarial learning [14], many methods
[24,,,,,,,,,,,,] adopt the
adversarial learning to align feature distributions by fool-
ing a domain discriminator. However, the model based on
adversarial learning can be hard to train [82,,], as it
requires carefully balancing the capacity between the fea-
ture encoder (i.e., generator) and the domain discriminator,
which is intractable. The second direction considers explor-
ing the network’s potential of self-training on the target do-
main [81,,,,,,,]. Self-training methods
generally use the source-trained segmentation model to pre-
dict segmentation results for target-domain images. They
typically select the results with high confidence scores as
pseudo labels to fine-tune the segmentation model. These
works mainly focus on how to decide the threshold of con-
fidence scores for selecting highly reliable pseudo labels.
However, self-training is likely to introduce inevitable la-
bel noise [44,,], as we cannot fully guarantee that
the selected pseudo labels are correct especially when the
domain gap is large. The third direction is inspired by
the image translation technique [85], aiming to transfer the
style of source labeled images to that of the target do-
main [8,,,,,]. Inspired by the advances of
the generative adversarial networks (GANs), they mostly
adopt GANs as image translation models. While remark-
able progress has been achieved, many studies [51,,]
have indicated that GAN-based image translation methods
tend to show limited capacity for preserving semantic de-
tails. Unlike existing methods that handle image translation
based on the input source image, our method enables the
model to translate images further conditioned on the source
labels, which enables fine-grained semantic preservation.
Denoising Diffusion Probabilistic Models (DDPMs),
also namely diffusion models, are a class of generative mod-
els inspired by the nonequilibrium thermodynamics. Dif-
fusion models [64,] generally learn a denoising model
to gradually denoise from an original common distribution,
e.g., Gaussian noise, to a specific data distribution. It is first
proposed by Sohl-Dickstein et al. [64], and has recently
attracted much attention in a wide range of research fields
including computer vision [11,,,,,,,,,],
nature language processing [2,,], audio processing
[28,,,,] and multi-modality [33,,,].
Based on the diffusion technique, Dhariwal et al. proposed
a gradient guidance method [11] to enable pre-trained un-
conditional diffusion models to generate images based on a
specified class label. Although effective, the gradient guid-
ance method generally works with image-level labels and
is limited to a single domain, making it challenging to be
adopted for DASS. To address these issues, in this paper,
we propose a novel Semantic Gradient Guidance (SGG)
method and a Progressive Translation Learning (PTL) strat-
egy, enabling the utilization of pixel-level labels for label-
guided cross-domain image translation.
3. Revisiting Diffusion Models
As our method is proposed based on the Denoising Dif-
fusion Probabilistic Model (DDPM) [64,], here, we
briefly revisit DDPM. DDPM is a generative model that
converts data from noise to clean images by gradually de-
noising. A standard DDPM generally contains two pro-
cesses: a diffusion process and a reverse process. The
diffusion process contains no learnable parameters and it
just used to generate the training data for diffusion model’s
learning. Specifically, in the diffusion process, a slight
amount of Gaussian noise is repeatedly added to the clean
imagex0, making it gradually corrupted into a noisy image
xTafterT-step noise-adding operations. A single step of
diffusion process (fromxt−1toxt) is represented as fol-
lows:
ψψ♮labelψıeq.ddpm⌢q℘ψq↼x⌢tψȷψx⌢ıt-1℘↽ψ/ψ♮mathcalψıN℘↼♮sqrtψı1-Φetaψ⌢t℘x⌢ıt-1℘,ψΦetaψ⌢tψΨextbfψıI℘↽,ψ(1)
whereβtis the predefined variance schedule. Given a clean
imagex0, we can obtainxTby repeating the above diffu-
sion step fromt= 1tot=T. Particularly, an arbitrary
imagextin the diffusion process can also be directly calcu-
lated as:
ψψ♮labelψıeq.ddpm⌢diffuse℘ψx⌢ıt℘ψ/ψ♮sqrtψıΦarψı♮alphaψ℘⌢t℘ψx⌢0ψ⇁ψ♮sqrtψı1-Φarψı♮alphaψ℘⌢t℘ψ♮epsilonψ,ψ(2)
whereαt= 1−βt,¯αt=
Q
t
s=1
αsandϵ∼ N(0, I). Based
on this equation, we can generate images fromx0toxTfor
training the diffusion model.
The reverse process is opposite to the diffusion process,
which aims to train the diffusion model to gradually convert
data from the noisexTto the clean imagex0. One step of
the reverse process (fromxttoxt−1) can be formulated as:
ψψ♮labelψıeq.ddpm⌢reverse℘ψp⌢Ψhetaψ↼x⌢ıt-1℘ȷx⌢t↽ψ/ψ♮mathcalψıN℘↼ψ♮muψ⌢Ψhetaψ↼x⌢t,t↽,ψ♮Sigmaψ⌢Ψhetaψ↼x⌢t,t↽↽▷ψ(3)
Given the noisy imagexT, we can denoise it to the clean
imagex0by repeating the above equation fromt=Tto
t= 1. In Eq., the covariance Σθis generally fixed as
a predefined value, while the meanµθis estimated by the
diffusion model. Specifically, the diffusion model estimates
µθvia a parameterized noise estimatorϵθ(xt, t)which is
usually a U-Net [59]. Given xtas the input, the diffusion
model will output the value ofµθas follows:
ψψ♮labelψıeq.ddpm⌢mu℘ψ♮muψ⌢Ψhetaψ↼x⌢t,t↽ψ/ψβracψı1℘ı♮sqrtψı1-Φetaψ⌢t℘℘♮Bigψ↼x⌢tψ-ψβracψıΦetaψ⌢t℘ı♮sqrtψı1-Φarψı♮alphaψ℘⌢t℘℘♮epsilonψ⌢Ψhetaψ↼x⌢t,t↽♮Bigψ↽,ψ(4)
For training the diffusion model, the images generated in
the diffusion process are utilized to train the noise estimator

n-step Diffusion Process n-step Reverse Process
Source Image x0 xn0
x
 Target Image Figure 2. Illustration of the Diffusion-based Image Translation
Baseline. Starting from a source imagex0, the diffusion process
disturbs it into a noisy imagexn, which is then denoised into the
target imagex
ϕ
0
using the reverse process performed by a target-
pre-trained diffusion model.
ϵθto predict the noise added to the image, which can be
formulated as:
ψψ♮labelψıeq.ddpm⌢loss℘ψL⌢ıDM℘↼x↽/♮leftψ♮ȷ♮epsilonψ⌢Ψhetaψ↼x⌢t,t↽-ψ♮epsilonψγightψ♮ȷψ⌣2,ψ(5)
whereϵ∼ N(0,I)is the noise added to the imagext.
4. Method
We address the problem of domain adaptive semantic
segmentation (DASS), where we are provided the source
imagesX
S
with corresponding source labelsY
S
and the
target imagesX
T
(without target labels). Our goal is to train
a label-guided image translation model to translate source
images into target-like ones. Then we can use the trans-
lated data to train a target-domain segmentation model. To
achieve this goal, we first train a basic image translation
model based on the diffusion technique, which is the base-
line of our approach (Sec.). Then, we impose our SGG
method to the baseline model, making it conditioned on the
pixel-wise source labels (Sec.). We further propose a
PTL training strategy to enable our SGG to work across a
large domain gap (Sec.). Finally, we detail the training
and inference procedures (Sec.).
4.1. Diffusion-based Image Translation Baseline
We start with the baseline of our framework, which is
a basic image translation model based on diffusion mod-
els. Before translation, we use the (unlabeled) target do-
main imagesX
T
to pre-train a standard diffusion model
via Eq.. As the diffusion model is essentially a de-
noising model, we use the noise as an intermediary to link
the two large-gap domains. As shown in Fig., we first
leverage the diffusion process to addn-step noise onto the
source imagex0(x0∈ X
S
), making it corrupted intoxn,
and then use the pre-trained diffusion model to reverse (de-
noise) it fornsteps, obtainingx
ϕ
0
. Since the diffusion
model is pre-trained on the target domain, the reverse (de-
noising) process will convert the noise into target-domain
content, making the denoised imagex
ϕ
0
shown an appear-
ance closer to the target-domain images than the original
source-domain imagex0. Note that the diffusion model
solely trained on the target domain is able to denoise the
noisy image from a different source domain towards the
target domain, because the diffusion model’s training ob-
jective is derived from the variational bound on the negative
log-likelihoodE[−logpθ(X
T
)], indicating it learns to gen-
erate target-domain data from various distributions. This
property of diffusion model has been verified in [65] and
has been widely used in many studies [52,,].
4.2. Semantic Gradient Guidance (SGG)
The baseline model presented above handles image
translation in an unconditional manner, where denoising
from much noise however will freely generate new content
that may not be strictly consistent with the original image.
Inspired by the gradient guidance [11], we here propose a
novel SGG method to enable the unconditional baseline to
be conditioned on source labels.
Our SGG consists of two guidance modules, namely
the Local Class-regional Guidance (LCG) and the Global
Scene-harmonious Guidance (GSG). They are proposed to
affect the reverse process of the baseline model, aiming to
guide the denoising process to generate content semanti-
cally consistent with source labels. Specifically, the LCG
is proposed to preserve local details, while the GSG is de-
signed to enhance the global harmony of the denoised im-
ages. As shown in Fig.
and GSG modules at each reverse step, aiming to provide
both local-precision and global-harmony guidance.
(1) Local Class-regional Guidance (LCG)
The key idea of LCG is to use a pixel classifier to cal-
culate loss to measure if the generated (denoised) pixel has
the same semantic as the source semantic label, and then
utilize the loss gradient as the guidance signal to guide the
pixel to generate same-semantic content. Assuming that we
already have a pre-trained segmentation modelg(i.e., the
pixel classifier) that can perform pixel classification well
over domains (note that we will address this assumption in
Sec.), based on g, LCG can carry out guidance as fol-
lows.
Here, we usex
ϕ
k
to denote the input of the LCG module,
andx
ϕ
k−1
as its output, wherek∈ {n, n−2, n−4, ...}.
As shown in Fig. y∈ Y
S
,
we can leverageyto obtain a binary maskmcthat indicates
the region of thec-th class in the source image (c∈ Cand
Cdenotes all classes in the source dataset). Given the input
x
ϕ
k
, we can multiplyx
ϕ
k
withmcto obtain the pixels inside
thec-th class region, which can be fed into the pre-trained
segmentation modelgto predict the label probabilities of
the pixels. The output can be used to calculate the cross-
entropy loss with the source label, aiming to use the loss
gradient to guide the generation of pixels in thec-th class
region. As shown in Fig. gto

x0 xn
...
n-step diffusion process1n
x

− 2n
x

− 3n
x

− 0
x

LCG GSGk
x

Pre-trained
Diffusion
Model

Label y
mc
Pre-trained
Segmentation
Model
Llocal
ymc
Lglobal
Label yk
x
 1k
x

−
~
adjust
x Lglobal

adjust
x Llocal
~1,kc
x

−
Input:
Output:k
x

Input:
Output:k
x
 1k
x

−
LCG GSG
(b) Details of LCG (c) Details of GSG
(a) Baseline with SGG
Pre-trained
Segmentation
Model
Pre-trained
Diffusion
Model
one step of reverse process
g g1 1,k c k c
c
x m x

−−
= ˆ

 (,ˆ)

 ˆ

 (,ˆ)


GSG Figure 3.(a)The overall architecture of our baseline model with Semantic Gradient Guidance (SGG). SGG alternately adopts the Local
Class-regional Guidance (LCG) and the Global Scene-harmonious Guidance (GSG) at each reverse step of the baseline model.(b)In LCG,
we first obtain the loss gradient for class regionmc, and then use the gradient to adjust the diffusion model’s outputµθfor generating label-
related content. For clarity, we only show the workflow of one class region. We obtain the guided results for every class region and finally
combine them together viax
ϕ
k−1
=
P
c
mc⊙x
ϕ
k−1,c
.(c)In GSG, we calculate the gradient for the whole image and then use the gradient
to adjustµθfor a global-harmony appearance.
compute the cross-entropy lossLlocalfor pixels in the local
c-th class region, which is formulated as:
ψψL⌢ılocal℘ψ↼x⌣♮phiψ⌢ık℘,ψy,ψc↽ψ/ψL⌢ıce℘♮Bigψ↼gψ↼x⌣♮phiψ⌢ık℘ψ♮odotψm⌢c↽,ψyψ♮odotψm⌢c♮Bigψ↽,ψ♮labelψıeq.local⌢loss℘ψ(6)
whereLceis the cross-entropy loss,⊙denotes the element-
wise multiplication andy⊙mcis the source label of the
classc.
In the standard reverse process, given the noisy image
x
ϕ
k
, the diffusion model denoises it by estimating the mean
valueµθ(Eq.) and then using it to obtain the denoised
imagex
ϕ
k−1
via sampling (Eq.). Therefore, the mean µθ
will determine the content of the denoised image. Based on
this, we guide the reverse process by using the computed
loss gradient∇
x
ϕ
k
Llocalto adjustµθ, as shown in Fig.
(b). As the loss gradient is computed based on the source
label, it can enable the generated content to correspond to
the label class. We follow the previous gradient guidance
method [11] to formulate the adjustment operation as:
ψψ♮hatψı♮muψ℘⌢Ψhetaψ↼x⌣♮phiψ⌢k,ψk,ψc↽ψ/ψ♮muψ⌢Ψhetaψ↼x⌣♮phiψ⌢k,ψk↽ψ⇁ψ♮lambdaψ♮Sigmaψ⌢ΨhetaψΩablaψ⌢ıx⌣♮phiψ⌢k℘ψL⌢ılocal℘↼x⌣♮phiψ⌢ık℘,ψy,ψc↽,ψ♮labelψıeq.mu℘ψ(7)
whereµθis the original output of the pre-trained diffusion
model ,Σθis a fixed value already defined in Eq. λis
a hyper-parameter which controls the guidance scale.
Next, we adopt the adjusted mean valueˆµθto sample the
denoised image:
ψψx⌣♮phiψ⌢ık-1,c℘ψ♮simψ♮mathcalψıN℘↼♮hatψı♮muψ℘⌢Ψhetaψ↼x⌣♮phiψ⌢k,ψk,ψc↽,ψ♮Sigmaψ⌢Ψhetaψ↽▷ψ♮labelψıeq.fake⌢guided℘ψ(8)
With the above process, the guided denoised image
x
ϕ
k−1,c
can preserve the semantics for pixels in thec-th class
region. We parallelly perform the above operations for dif-
ferentc, and combine different guided regions into a com-
plete imagex
ϕ
k−1
:
ψψx⌣♮phiψ⌢ık-1℘/♮sumψ⌢ıc℘m⌢ıc℘♮odotψx⌣♮phiψ⌢ık-1,c℘,ψ♮labelψıeq.real⌢guided℘ψ(9)
wherex
ϕ
k−1
is the output of the LCG module.
(2) Global Scene-harmonious Guidance (GSG)
The LCG is performed locally, achieving a per-category
guidance for image translation. As a complement, we also
propose a global guidance module to enhance the global
harmony of the translated image, which further boosts the
translation quality.
As shown in Fig.
lows the similar operation flow as LCG, i.e., “computing
loss gradient→adjusting mean value→sampling new im-
age”. Givenx
ϕ
k
(k∈ {n−1, n−3, n−5, ...}) as the input
of the GSG module, the loss function is computed based on
the whole image to achieve the global guidance:
ψψL⌢ıglobal℘ψ↼x⌣♮phiψ⌢ık℘,ψy↽ψ/ψL⌢ıce℘Φigψ↼ψgψ↼x⌣♮phiψ⌢ık℘↽,ψyψΦigψ↽▷ψ♮labelψıeq.global⌢loss℘ψ(10)
In SGG, we apply the LCG or GSG at each reverse
(denoising) step of the baseline model. By doing so, we
can gradually apply the guidance in a step-wise manner,
which is suitable for the step-wise translation process of the
baseline model. To achieve both local-precise and global-
harmony guidance, we incorporate an equal number of LCG
and GSG modules into our approach. Specifically, since the
baseline model hasnreverse steps, we use
n
2
LCG modules

and
n
2
GSG modules. As for the arrangement of LCG and
GSG modules, we design several plausible options: (1) first
using
n
2
LCG modules and then
n
2
GSG modules, (2) first
using
n
2
GSG modules and then
n
2
LCG modules, (3) alter-
nating between LCG and GSG, and (4) randomly mixing
the two types of modules. We observe that all options per-
form at the same level. Since the alternating option shows a
slightly better performance than others, we adopt the alter-
nating scheme for SGG.
4.3. Progressive Translation Learning (PTL)
The proposed SGG method is based on the assumption
that we already have a segmentation model trained over do-
mains. That is, if we want to guide the translation towards
the target domain, we need to have a segmentation model
that is pre-trained on the target domain. However, in DASS,
we only have labeled images in the source domain for train-
ing a source-domain segmentation network. To enable the
SGG to reliably work for translation towards the target do-
main with the source-domain segmentation model, we pro-
pose a Progressive Translation Learning (PTL) strategy.
As mentioned before, our baseline model can directly
translate images by first diffusing fornsteps and then re-
versing fornsteps (see Fig.). We formulate this pro-
cess as:X
n
=ϕ
n
(X
S
), whereX
S
denotes all source-
domain images,ϕ
n
denotes the baseline model that han-
dlesn-step diffusion and reverse process,X
n
denotes the
translated output images which is related to the parameter
n. As the diffusion model can denoise the image towards
the target domain, the parameternessentially controls the
amount of target-domain information added to the source
image. Inspired by [15], we can flexibly change the do-
main of translated images by varying the value ofn. When
nis small, we can obtain translated imagesX
n
with mi-
nor differences to source-domain imagesX
S
. Conversely,
whennis large enough, the translated imagesX
n
can com-
pletely approach the target domain. By walking throughn
from 1 toN, our baseline model can generate a set of image
domains{X
n
|n= 1,2, ..., N}, whereX
N
approximately
follows the target-domain distribution. Since each step of
diffusion/reverse operation only adds/removes a very small
amount of noise, the translated images from neighboring
domains, i.e.,X
n
andX
n+1
, show very slight differences.
Specifically, we start with the source-domain segmenta-
tion modelg0, which is well-trained on source imagesX
S
with source labelsY
S
. As the domain gap between two
adjacent domains is very small, the segmentation network
trained on source domainX
S
can also work well for the ad-
jacent domainX
1
. Based on this property, we can useg0
to carry out SGG guidance for image translation of the next
domainX
1
. We formulate this operation in a general rep-
resentation, i.e., givengnwhich is well-trained onXn, we
can guide the translation ofXn+1as follows:
ψψ♮mathcalψıX℘⌣ın⇁1℘ψ/ψ♮mathrmψıSGG℘ψ↼♮phiψ⌣ın⇁1℘ψ↼♮mathcalψıX℘⌣S↽ψ,ψg⌢n↽,ψ♮labelψıeq.training⌢1℘ψ(11)
whereSGG(ϕ
n+1
, gn)denotes the SGG-guided baseline
model with(n+ 1)-step diffusion and reverse process,
which is equipped withgn.
Under theg0-based SGG guidance, the translated images
X
1
have fine-grained semantic consistency with source la-
belsY
S
. Then, we can use the guided imagesX
1
to fine-
tune the segmentation modelg0, making it further fit with
this domain. This operation is also formulated in a general
representation, i.e., given the guided imagesX
n+1
, we can
fine-tune the segmentation modelgnas follows:
ψψL⌢ıft℘/L⌢ıce℘Φigψ↼g⌢nψ↼♮mathcalψıX℘⌣ın⇁1℘ψ↽,ψ♮mathcalψıY℘⌣ıS℘ψΦigψ↽⇁L⌢ıce℘Φigψ↼g⌢nψ↼♮mathcalψıX℘⌣ıS℘ψ↽,ψ♮mathcalψıY℘⌣ıS℘ψΦigψ↽▷ψ♮labelψıeq.training⌢2℘ψ(12)
We name the fine-tuned segmentation model asgn+1. The
former loss component aims to use the guided imagesX
n+1
to adapt the segmentation model to the new domain. The
latter component is to train the segmentation model on the
source domain imagesX
S
. We empirically observe com-
bining both that augments the model’s learning scope leads
to slightly better results than using the former loss only.
After fine-tuning the segmentation modelg0onX
1
, we
can perform SGG-guided image translation again using the
fine-tuned model (namelyg1), obtaining guided imagesX
2
.
Then we can useX
2
to further fine-tune the segmentation
modelg1for adapting to a new domain. In this manner, we
can perform Eq. nincreasing
from1toN, finally obtaining the translated target-domain
imagesX
N
and target-fine-tuned segmentation modelgN.
We use the segmentation modelgNfor test-time inference.
Throughout the whole process of PTL, the segmentation
model always performs SGG guidance at the adjacent do-
main with little domain discrepancy, which is the key to
enabling the SGG to work across a large domain gap.
4.4. Training and Inference
Training.First, we train a standard diffusion model with
the unlabeled target-domain imagesX
T
via Eq.. After
that, we keep the pre-trained diffusion model frozen and use
it to denoise the source-diffused image, which forms our
basic image translation model, namely baseline modelϕ
n
(Sec.). Our baseline model ϕ
n
can translate source im-
agesX
S
into the imagesX
n
. The modelϕ
n
is flexible with
a variable parametern, which can control the domain of the
translated imagesX
n
. Whenn= 1, the translated images
are extremely close to the source domain. Whenn=N,
the translated images approximately approach the target do-
main. We use the source-domain labeled data{X
S
,Y
S
}to
train a segmentation modelg0. Then, we setn= 1for
the baseline modelϕ
n
and use it to translate source im-
agesX
S
into the adjacent-domain imagesX
1
. During the

translation process, we use the segmentation modelg0to
carry out SGG guidance (Eq.11), obtaining the guided im-
agesX
1
. Then, we use the guided images and source la-
bels{X
1
,Y
S
}to fine-tune the segmentation modelg0(Eq.
12), obtaining fine-tuned segmentation modelg1. We set
n= 2for the baseline modelϕ
n
, and useg1to carry out
SGG guidance, obtaining the guided imagesX
2
. Next, we
can use{X
2
,Y
S
}to execute the fine-tuning again. In this
way, we alternate the SGG guidance and the fine-tuning of
segmentation model withngrowing from 1 toN(i.e., PTL
strategy). Finally, we obtain the translated target-domain
imagesX
N
and the target-fine-tuned segmentation model
gN.Inference.After training, we directly use the segmen-
tation modelgNfor inference.
5. Experiment
5.1. Implementation Details
Datasets.As a common practice in DASS, we evaluate
our framework on two standard benchmarks (GTA5 [57] →
Cityscapes [10], and SYNTHIA [60] →Cityscapes [10]).
GTA5 and SYNTHIA provide 24,996 and 9400 labeled im-
ages, respectively. Cityscapes consists of 2975 and 500 im-
ages for training and validation respectively. Following the
standard protocol [24], we report the mean intersection over
Union (mIoU) on 19 classes for GTA5→Cityscapes and 16
classes for Synthia→Cityscapes.
Network Architecture.To make a fair and comprehen-
sive comparison, we test three typical types of networks as
the segmentation modelg: (1) DeepLab-V2 [3] architec-
ture with ResNet-101 [21] backbone; (2) FCN-8s [47] ar-
chitecture with VGG16 [63] backbone; (3) DAFormer [25]
architecture with SegFormer [75] backbone. All of them are
initialized with the network pre-trained on ImageNet [36].
For the diffusion model, we follow the previous work [22]
to adopt a U-Net [59] architecture with Wide-ResNet [80]
backbone. The diffusion model is trained from scratch with-
out loading any pre-trained parameters.
Parameter Setting.Following DDPM [22], we train the
diffusion model with a batch size of 4, using the Adam
[35] as the optimizer with learning rate as2×10
−5
and
momentum as 0.9 and 0.99. Specifically, for the hyper-
parameters of diffusion model, we still follow [22] to set
βtfromβ1= 10
−4
toβT= 0.02linearly andT= 1000.
The covarianceΣθis fixed and defined asΣθ=βtI. We
train the segmentation modelgby using the SGD optimizer
[32] with a batch size of 2, a learning rate of2.5×10
−4
and a momentum of 0.9. In SGG, we setλ= 80. In PTL,
we setN= 600. We pre-train the segmentation model on
the source domain for 20,000 iterations and fine-tune it on
other intermediate domains for 300 iterations each. Since
the discrepancy between adjacent domains is minor, fine-
tuning for 300 iterations is enough to effectively adapt to a
Table 1. Performance comparison in terms of mIoU (%). Three
backbone networks, ResNet-101 (Res), VGG16 (Vgg) and Seg-
Former (Seg), are used in our study. The results of the task “GTA5
→Cityscapes” (“SYNTHIA→Cityscapes”) are averaged over 19
(16) classes.∗denotes previous image translation methods.
Method Venue
GTA5→Cityscapes SYN.→Cityscapes
Res Vgg Seg Res Vgg Seg
Source only - 37.3 27.1 45.4 33.5 22.9 40.7
CyCADA [23] ∗ICML’18 42.7 35.4 - - - -
IIT [54] ∗ CVPR’18 - 35.7 - - - -
CrDoCo [6] ∗ CVPR’19 - 38.1 - - 38.2 -
BDL [44] ∗ CVPR’19 48.5 41.3 - - 33.2 -
CLAN [49] CVPR’19 43.2 36.6 - - - -
Intra [56] CVPR’20 45.3 34.1 - 41.7 33.8 -
SUIT [41] ∗ PR’20 45.3 40.6 - 40.9 38.1 -
LDR [76] ECCV’20 49.5 43.6 - - 41.1 -
CCM [40] ECCV’20 49.9 - - 45.2 - -
ProDA [81] CVPR’21 57.5 - - 55.5 - -
CTF [50] CVPR’21 56.1 - - 48.2 - -
CADA [77] ∗ WACV’21 49.2 44.9 - - 40.8 -
UPLR [71] ICCV’21 52.6 - - 48.0 - -
DPL [8] ICCV’21 53.3 46.5 - 47.0 43.0 -
BAPA [46] ICCV’21 57.4 - - 53.3 - -
CIR [16] ∗ AAAI’21 49.1 - - 43.9 - -
CRAM [82] TITS’22 48.6 41.8 - 45.8 38.7 -
ADAS [39] ∗ CVPR’22 45.8 - - - - -
CPSL [42] CVPR’22 55.7 - - 54.4 - -
FDA [78] CVPR’22 50.5 42.2 - - 40.0 -
DAP [30] CVPR’22 59.8 - - 59.8 - -
DAFormer [25] CVPR’22 - - 68.3 - - 60.9
HRDA [26] ECCV’22 - - 73.8 - - 65.8
ProCA [31] ECCV’22 56.3 - - 53.0 - -
Bi-CL [38] ECCV’22 57.1 - - 55.6 - -
Deco-Net [37] ECCV’22 59.1 - - 57.0 - -
Ours ICCV’23 61.9 48.1 75.3 61.0 44.2 66.5
new domain. As the SGG guidance needs the segmentation
modelgto classify for noisy and masked images, during the
training of the segmentation model, we also generate noisy
and masked images for data augmentation to train a robust
segmentation model.
5.2. Comparative Studies
We compared the results of our method with the state-
of-the-art methods on two DASS tasks, i.e., “GTA5→
Cityscapes” and “SYNTHIA→Cityscapes”, using three
different backbone networks: ResNet-101 [21], VGG16
[63] and SegFormer [75]. To ensure reliability, we run
our model five times and report the average accuracy. The
comparative results are reported in Tab.. Overall, our
method consistently gains the best performance across all
settings and backbones. Remarkably, when compared to
existing image translation methods (marked with∗), our
method brings 3.2%∼20.1% improvement on all back-
bones and settings, which demonstrates the superiority of
our diffusion-based approach.
5.3. Ablation Studies
Effect of Baseline model.In Sec., we propose a
Diffusion-based Image Translation Baseline, i,e, the base-
line modelϕ
n
. To study the effectiveness of the baseline
modelϕ
n
, we setn=Nto directly translate source images

Source-Domain
Image
Source Semantic
Label n=100 n=200 n=400 n=500
Images from Intermediate Domains
n=300
Translated Target-
Domain Image Figure 4. The translated images of our approach (including images from intermediate domains).ndenotes the serial number of intermediate
domains. We can see that the semantic knowledge is transferred smoothly and precisely.
Table 2. Ablation studies of different components on the task
“GTA5→Cityscapes” (G→C) and “SYNTHIA→Cityscapes”
(S→C) with backbone Res-101.
Model G →C S→C
(a) Baseline Model 48.3 45.1
(b) Ours w/o SGG 55.4 52.7
(c) Ours w/o PTL 58.3 56.5
(d) Ours w/o LCG 59.8 59.1
(e) Ours w/o GSG 60.5 59.4
(f) Ours (full) 61.9 61.0
into the target domain. After translation, we use the trans-
lated images and source images, along with the source la-
bels, to train a segmentation model for inference. As shown
in Tab.
mIoU on two DASS tasks, respectively. This performance
is comparable to the results of previous image translation
methods (marked with∗in Tab.), which demonstrates the
effectiveness of our baseline model.
Effect of SGG.To demonstrate the effectiveness of SGG,
during the PTL training, we remove the SGG guidance and
directly use the unguided images of each intermediate do-
main to fine-tune the segmentation model. By comparing
the results in Tab.
ing SGG, our framework shows a significant improvement
(more than 6%) on both tasks, clearly demonstrating the ef-
fectiveness of SGG in preserving semantics.
Effect of PTL.The key design of PTL is to generate in-
termediate domains to decompose the large gap into small
ones. To investigate the effectiveness of PTL, we construct
an ablation experiment by removing the generation of in-
termediate domains, i.e., we setn=Nfor the baseline
modelϕ
n
and use it to translate source images into the tar-
get domain, i.e.,X
N
=ϕ
N
(X
S
). During the translation,
we directly use the source-trained segmentation model to
execute SGG guidance to guide the translation process. Fi-
nally, we use the guided target-domain images to fine-tune
the segmentation model. From Tab.
see that our framework without PTL strategy consistently
incurs an obvious performance drop on both DASS tasks,
which demonstrates the effectiveness of PTL strategy. We
show in Fig.
corresponding intermediate domains. We can see that the
Figure 5. Parameter analysis onNandλ.
image domain is gradually transferred, and all images con-
tain semantics that are well preserved in a fine-grained man-
ner, which demonstrates that our PTL can enable the SGG
to reliably work on each intermediate domain.
Effect of LCG and GSG.In order to evaluate the impact
of the LCG and GSG modules, we conduct ablation exper-
iments by removing each module respectively. As shown
in Tab.
without LCG (or GSG) degrades, which demonstrates their
usefulness. Furthermore, to comprehensively demonstrate
the effect of LCG and GSG modules, we provide some qual-
itative ablation results inSupplementary.
Effect of the second loss component of fine-tuning loss
in PTL.As mentioned in Sec., in our PTL strategy,
we propose a fine-tuning lossLf tto progressively fine-tune
the segmentation model towards the target domain, which
is formulated as follows:
ψψ♮labelψıeq.second⌢loss℘ψL⌢ıft℘ψ/ψ♮underbraceψıL⌢ıce℘Φigψ↼g⌢nψ↼♮mathcalψıX℘⌣ın⇁1℘ψ↽,ψ♮mathcalψıY℘⌣ıS℘ψΦigψ↽℘⌢ı♮LargeψıL⌢ıada℘℘℘ψ⇁ψ♮underbraceψıL⌢ıce℘Φigψ↼g⌢nψ↼♮mathcalψıX℘⌣ıS℘ψ↽,ψ♮mathcalψıY℘⌣ıS℘ψΦigψ↽℘⌢ıψL⌢ısrc℘℘▷ψ(13)
The fine-tuning loss has two components. The former loss
componentLadaaims to adapt the segmentation model to a
new domain, which is closer to the target domain. The later
componentLsrcis to train the segmentation model on the
source domain. As shown in Tab., compared to using the
former lossLadaonly, the combination of both components
can augment the model’s learning scope, leading to better
results (i.e., a gain of 0.4% and 0.3%).
Parameter Analysis.For parametersβ,TandΣθ, we fol-
low the previous work DDPM [22] to set their values. In
this paper, we only need to study the impact ofN(the num-

Test Image Baseline Model Baseline+SGG Baseline+SGG+PTL Ground TruthSource Only Figure 6. Qualitative semantic segmentation results of our framework. It can be seen that the segmentation performance is gradually
improved by adding our proposed components one by one. Best viewed in color.
Table 3. Ablation study on the loss componentLsrcin fine-tuning
lossLf t.
Model G →C S→C
(a)Lada 61.5 60.7
(b)Lada+Lsrc61.9 61.0
Table 4. Comparison of training and inference time.
Methods Training Time Inference Time Acc.
BDL [44] 1.8 days 58.40 ms 48.5
ProCA [31] 1.9 days 57.48 ms 56.3
Bi-CL [38] 1.7 days 57.91 ms 57.1
Deco-Net [37] 1.9 days 59.67 ms 59.1
Ours 2.3 days 57.68 ms 61.9
ber of domains in PTL) andλ(the guidance scale param-
eter in SGG). As forN, we desire the number of domains
as large as possible, as we want to smoothly bridge the do-
main gap. On the other hand, we need to avoid inducing
too much training time, i.e., the number of domains should
be limited. We present the ablation results ofNin Fig.
(blue curve). We find the accuracy begins to level off when
N >600. We thus set the optimal value ofNto 600 with
the consideration of saving training time. As forλ, we study
its impact with different values and show the results in Fig.
5 λisλ= 80.
Training & Inference Time.We compare our method with
other methods in terms of training and inference time on the
task “GTA5→Cityscapes”. Although our model is trained
acrossNintermediate domains (N= 600), the training
time does not increase much, as we only need to fine-tune
the model for several iterations on each intermediate do-
main (more details in Sec.). As shown in Tab., com-
pared to recent state-of-the-art models, though our method
achieves a significant performance gain, it only requires
slightly more training time. Since our approach does not
change the segmentation model’s structure, our method per-
forms inference almost the same as others.
Ablation on different arrangement options of LCG and
GSG modules in SGG.As shown in Fig., in our frame-
work, the LCG and GSG modules are arranged as a se-
quence withnsteps. Here, we conduct ablation studies
to investigate the impact of different arrangement options.
Tab.
Table 5. Ablation studies on different arrangement options of LCG
and GSG modules in SGG.
Model G →C S→C
(a) LCG/GSG 61.4 60.5
(b) GSG/LCG 61.3 60.5
(c) Alternate61.9 61.0
(d) RandMix 61.6 60.7
notes first using
n
2
LCG modules and then
n
2
GSG mod-
ules, “GSG/LCG” represents first using
n
2
GSG modules
and then
n
2
LSG modules, “Alternate” refers to alternating
between LCG and GSG, and “RandMix” represents ran-
domly arranging the two types of modules. We can see that
the alternating option performs best, outperforming others
by 0.3%∼0.6%. Therefore, we adopt the alternating option
to arrange LCG and GSG modules in the SGG scheme.
Qualitative Segmentation Results.Fig.
qualitative segmentation results. The “Source Only” results
were obtained by directly applying the segmentation model
trained on the source domain to the target domain. We can
see that even only using our baseline model, the segmenta-
tion results are improved obviously (e.g., the road and car).
After adding the SGG scheme to our baseline, the boundary
of small objects (e.g., the person) becomes more precise,
which demonstrates the SGG’s effectiveness in preserving
details. When further using the PTL strategy, our approach
can provide more accurate segmentation results, indicating
that the PTL can help SGG to preserve details better.
6. Conclusion
In this paper, we have proposed a novel label-guided im-
age translation framework to tackle DASS. Our approach
leverages the diffusion model and incorporates a Semantic
Gradient Guidance (SGG) scheme to guide the image trans-
lation based on source labels. We have also proposed a Pro-
gressive Translation Learning (PTL) strategy to facilitate
our SGG in working across a large domain gap. Extensive
experiments have shown the effectiveness of our method.
AcknowledgementThis research is supported by the Na-
tional Research Foundation, Singapore under its AI Singa-
pore Programme (AISG Award No: AISG2-PhD-2022-01-
027).

References
[1]
age editing via multi-stage blended diffusion.arXiv preprint
arXiv:2210.12965, 2022.,
[2]
low, and Rianne van den Berg. Structured denoising diffu-
sion models in discrete state-spaces.Advances in Neural In-
formation Processing Systems (NeurIPS), 34:17981–17993,
2021.
[3]
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic im-
age segmentation with deep convolutional nets, atrous con-
volution, and fully connected crfs.Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 40(4):834–848,
2017.,
[4]
siondet: Diffusion model for object detection.arXiv preprint
arXiv:2211.09788, 2022.
[5]
ented adaptation for semantic segmentation of urban scenes.
InProceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 7892–7901, 2018.
2,
[6]
Bin Huang. Crdoco: Pixel-level domain transfer with
cross-domain consistency. InProceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 1791–1800, 2019.,,
[7]
Yu-Chiang Frank Wang, and Min Sun. No more discrimina-
tion: Cross city adaptation of road scene segmenters. InPro-
ceedings of the IEEE International Conference on Computer
Vision (ICCV), pages 1992–2001, 2017.,
[8]
Wen, and Wenqiang Zhang. Dual path learning for do-
main adaptation of semantic segmentation. InProceedings of
the IEEE/CVF International Conference on Computer Vision
(ICCV), pages 9082–9091, 2021.,
[9]
Gwon, and Sungroh Yoon. Ilvr: Conditioning method for
denoising diffusion probabilistic models. InProceedings of
the IEEE/CVF International Conference on Computer Vision
(ICCV), pages 14367–14376, October 2021.,
[10]
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
Franke, Stefan Roth, and Bernt Schiele. The cityscapes
dataset for semantic urban scene understanding. InProceed-
ings of the IEEE conference on Computer Vision and Pattern
Recognition (CVPR), pages 3213–3223, 2016.
[11]
beat gans on image synthesis.Advances in Neural Informa-
tion Processing Systems (NeurIPS), 34:8780–8794, 2021.,
3,,
[12]
angyang Xue, Qibao Zheng, Xiaoqing Ye, and Xiaolin
Zhang. Ssf-dan: Separated semantic feature based domain
adaptation network for semantic segmentation. InProceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision (ICCV), pages 982–991, 2019.,
[13]
Distribution-aligned diffusion for human mesh recovery. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), 2023.
[14]
cal Germain, Hugo Larochelle, Franc¸ois Laviolette, Mario
Marchand, and Victor Lempitsky. Domain-adversarial train-
ing of neural networks.The journal of machine learning
research, 17(1):2096–2030, 2016.
[15]
Shelhamer, and Dequan Wang. Back to the source:
Diffusion-driven test-time adaptation.arXiv preprint
arXiv:2207.03442, 2022.,
[16]
gap via content invariant representation for semantic seg-
mentation. InProceedings of the AAAI Conference on Ar-
tificial Intelligence (AAAI), pages 7528–7536, 2021.,,,
7,
[17]
Rahmani, and Jun Liu. Diffpose: Toward more reliable 3d
pose estimation.arXiv preprint arXiv:2211.16940, 2022.
[18]
and LingPeng Kong. Diffuseq: Sequence to sequence
text generation with diffusion models.arXiv preprint
arXiv:2210.08933, 2022.
[19]
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. InAdvances
in Neural Information Processing Systems (NeurIPS), vol-
ume 27. Curran Associates, Inc., 2014.
[20]
ming Rao, Jie Zhou, and Jiwen Lu. Stochastic trajectory
prediction via motion indeterminacy diffusion. InProceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 17113–17122, 2022.
[21]
Deep residual learning for image recognition. InProceed-
ings of the IEEE conference on Computer Vision and Pattern
Recognition (CVPR), pages 770–778, 2016.
[22]
sion probabilistic models.Advances in Neural Information
Processing Systems (NeurIPS), 33:6840–6851, 2020.,,,
8
[23]
Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Dar-
rell. Cycada: Cycle-consistent adversarial domain adap-
tation. InInternational Conference on Machine Learning
(ICML), pages 1989–1998. Pmlr, 2018.,,
[24]
Fcns in the wild: Pixel-level adversarial and constraint-based
adaptation.arXiv preprint arXiv:1612.02649, 2016.,,
[25]
Improving network architectures and training strategies for
domain-adaptive semantic segmentation. InProceedings

of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 9924–9935, 2022.
[26]
Context-aware high-resolution domain-adaptive semantic
segmentation. InProceedings of the European conference
on computer vision (ECCV), pages 372–391. Springer, 2022.
7
[27]
Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang,
Jiahui Yu, Christian Frank, et al. Noise2music: Text-
conditioned music generation with diffusion models.arXiv
preprint arXiv:2302.03917, 2023.
[28]
Cui, and Yi Ren. Prodiff: Progressive fast diffusion model
for high-quality text-to-speech. InProceedings of the 30th
ACM International Conference on Multimedia (ACM MM),
pages 2595–2605, 2022.
[29]
Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross
attention for semantic segmentation. InProceedings of the
IEEE/CVF International Conference on Computer Vision
(ICCV), pages 603–612, 2019.
[30]
Houqiang Li, and Qi Tian. Domain-agnostic prior for trans-
fer semantic segmentation. InProceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 7075–7085, 2022.,
[31]
Wang, Ying Tai, and Chengjie Wang. Prototypical contrast
adaptation for domain adaptive semantic segmentation. In
Proceedings of the European conference on computer vision
(ECCV), pages 36–54. Springer, 2022.,
[32]
the maximum of a regression function.The Annals of Math-
ematical Statistics, pages 462–466, 1952.
[33]
fusionclip: Text-guided diffusion models for robust image
manipulation. InProceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages
2426–2435, 2022.
[34]
tts: A diffusion model for text-to-speech via classifier guid-
ance. InInternational Conference on Machine Learning
(ICML), pages 11119–11133. PMLR, 2022.
[35]
stochastic optimization.arXiv preprint arXiv:1412.6980,
2014.
[36]
Imagenet classification with deep convolutional neural net-
works.Communications of the ACM, 60(6):84–90, 2017.
[37]
Liu, Hengshuang Zhao, Liwei Wang, and Jiaya Jia. Decou-
plenet: Decoupled network for domain adaptive semantic
segmentation. InProceedings of the European conference
on computer vision (ECCV), pages 369–387. Springer, 2022.
3,,
[38]
Bumsub Ham. Bi-directional contrastive learning for domain
adaptive semantic segmentation. InProceedings of the Euro-
pean conference on computer vision (ECCV), pages 38–55.
Springer, 2022.,,,
[39]
Choi, and Sunghoon Im. Adas: A direct adaptation strategy
for multi-target domain adaptive semantic segmentation. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 19196–19206,
2022.
[40]
Yi Yang. Content-consistent matching for domain adap-
tive semantic segmentation. InProceedings of the Euro-
pean conference on computer vision (ECCV), pages 440–
456. Springer, 2020.,,
[41]
Wong. Simplified unsupervised image translation for se-
mantic segmentation adaptation.Pattern Recognition (PR),
105:107343, 2020.,
[42]
and Lei Zhang. Class-balanced pixel-level self-labeling for
domain adaptive semantic segmentation. InProceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 11593–11603, 2022.,,
[43]
Liang, and Tatsunori B Hashimoto. Diffusion-lm im-
proves controllable text generation.arXiv preprint
arXiv:2205.14217, 2022.
[44]
learning for domain adaptation of semantic segmentation.
InProceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 6936–6945,
2019.,,,
[45]
Badour AlBahar, and Jia-Bin Huang. Text-driven vi-
sual synthesis with latent diffusion prior.arXiv preprint
arXiv:2302.08510, 2023.
[46]
Duan. Bapa-net: Boundary adaptation and prototype align-
ment for cross-domain semantic segmentation. InProceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision (ICCV), pages 8801–8811, 2021.
[47]
convolutional networks for semantic segmentation. InPro-
ceedings of the IEEE conference on Computer Vision and
Pattern Recognition (CVPR), pages 3431–3440, 2015.,
[48]
Significance-aware information bottleneck for domain adap-
tive semantic segmentation. InProceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), pages
6778–6787, 2019.,
[49]
Yang. Taking a closer look at domain shift: Category-level
adversaries for semantics consistent domain adaptation. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 2507–2516,
2019.,,
[50]
to-fine domain adaptive semantic segmentation with photo-

metric alignment and category-center regularization. InPro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 4051–4060, 2021.
3,
[51]
Suzuki, and Takuya Narihira. Fine-grained image editing by
pixel-wise guidance using diffusion models.arXiv preprint
arXiv:2212.02024, 2022.,
[52]
jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided
image synthesis and editing with stochastic differential equa-
tions. InInternational Conference on Learning Representa-
tions (ICLR), 2021.,,
[53]
Yuichi Yoshida. Spectral normalization for generative ad-
versarial networks. InInternational Conference on Learning
Representations (ICLR), 2018.,
[54]
mamoorthi, and Kyungnam Kim. Image to image translation
for domain adaptation. InProceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 4500–4509, 2018.,,
[55]
denoising diffusion probabilistic models. InInternational
Conference on Machine Learning (ICML), pages 8162–
8171. PMLR, 2021.,
[56]
In So Kweon. Unsupervised intra-domain adaptation for se-
mantic segmentation through self-supervision. InProceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 3764–3773, 2020.,
[57]
Koltun. Playing for data: Ground truth from computer
games. InEuropean Conference on Computer Vision
(ECCV), pages 102–118. Springer, 2016.,
[58]
Patrick Esser, and Bj¨orn Ommer. High-resolution image
synthesis with latent diffusion models. InProceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10684–10695, 2022.
[59]
net: Convolutional networks for biomedical image segmen-
tation. InProceedings of the International Conference on
Medical Image Computing and Computer-Assisted Interven-
tion (MICCAI), pages 234–241. Springer, 2015.,,
[60]
Vazquez, and Antonio M Lopez. The synthia dataset: A
large collection of synthetic images for semantic segmenta-
tion of urban scenes. InProceedings of the IEEE conference
on Computer Vision and Pattern Recognition (CVPR), pages
3234–3243, 2016.,
[61]
Ser Nam Lim, and Rama Chellappa. Unsupervised do-
main adaptation for semantic segmentation with gans.arXiv
preprint arXiv:1711.06969, 2(2):2, 2017.,
[62]
networks (gans) challenges, solutions, and future directions.
ACM Computing Surveys (CSUR), 54(3):1–42, 2021.,
[63]
lutional networks for large-scale image recognition.arXiv
preprint arXiv:1409.1556, 2014.
[64]
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. InInternational Con-
ference on Machine Learning (ICML), pages 2256–2265.
PMLR, 2015.
[65]
Dual diffusion implicit bridges for image-to-image transla-
tion. InInternational Conference on Learning Representa-
tions (ICLR), 2022.
[66]
ping Shi, and Lizhuang Ma. Not all areas are equal: Trans-
fer learning for semantic segmentation via hierarchical re-
gion selection. InProceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages
4360–4369, 2019.,
[67]
based editing for controllable text-to-speech.arXiv preprint
arXiv:2110.02584, 2021.
[68]
han Chandraker. Domain adaptation for structured output
via discriminative patch representations. InProceedings of
the IEEE/CVF International Conference on Computer Vision
(ICCV), pages 1456–1465, 2019.,
[69]
generative modeling in latent space.Advances in Neural In-
formation Processing Systems (NeurIPS), 34:11287–11302,
2021.,
[70]
Sketch-guided text-to-image diffusion models.arXiv
preprint arXiv:2211.13752, 2022.
[71]
Uncertainty-aware pseudo label refinery for domain
adaptive semantic segmentation. InProceedings of the
IEEE/CVF International Conference on Computer Vision
(ICCV), pages 9092–9101, 2021.,
[72]
jun Xiong, Wen-mei Hwu, Thomas S Huang, and Honghui
Shi. Differential treatment for stuff and things: A simple un-
supervised domain adaptation method for semantic segmen-
tation. InProceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages 12635–
12644, 2020.,
[73]
Driftrec: Adapting diffusion models to blind image restora-
tion tasks.arXiv preprint arXiv:2211.06757, 2022.
[74] ˆowave: Itˆo stochastic differen-
tial equation is all you need for wave generation. InICASSP
2022-2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 8422–8426.
IEEE, 2022.
[75]
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transform-
ers.Advances in Neural Information Processing Systems
(NeurIPS), 34:12077–12090, 2021.

[76]
Chaochao Yan, and Junzhou Huang. Label-driven recon-
struction for domain adaptation in semantic segmentation.
InEuropean Conference on Computer Vision (ECCV), pages
480–498. Springer, 2020.,
[77]
zhou Huang. Context-aware domain adaptation in semantic
segmentation. InProceedings of the IEEE/CVF Winter Con-
ference on Applications of Computer Vision (WACV), pages
514–524, 2021.
[78]
adaptation for semantic segmentation. InProceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 4085–4095, 2020.
[79]
Diffgar: Model-agnostic restoration from generative arti-
facts using image-to-image diffusion models.arXiv preprint
arXiv:2210.08573, 2022.
[80]
works.arXiv preprint arXiv:1605.07146, 2016.
[81]
and Fang Wen. Prototypical pseudo label denoising and tar-
get structure learning for domain adaptive semantic segmen-
tation. InProceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages 12414–
12424, 2021.,,
[82]
Haofeng Zhang, and Yudong Zhang. Confidence-and-
refinement adaptation model for cross-domain semantic seg-
mentation.IEEE Transactions on Intelligent Transportation
Systems (TITS), 23(7):9529–9542, 2022.,
[83]
Mei. Fully convolutional adaptation networks for seman-
tic segmentation. InProceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
6810–6818, 2018.,
[84]
Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang
Huang, and Philip HS Torr. Conditional random fields as
recurrent neural networks. InProceedings of the IEEE In-
ternational Conference on Computer Vision (ICCV), pages
1529–1537, 2015.
[85]
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. InProceedings of the IEEE
International Conference on Computer Vision (ICCV), pages
2223–2232, 2017.,
[86]
Unsupervised domain adaptation for semantic segmentation
via class-balanced self-training. InProceedings of the Eu-
ropean conference on computer vision (ECCV), pages 289–
305, 2018.,
[87]
Jinsong Wang. Confidence regularized self-training. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), pages 5982–5991, 2019.,

Source
Image
Source
Label
Ours w/o
GSG
Ours w/o
LCG
Ours (full) Figure 7. The visualization of translated images when removing
the LCG and GSG modules, respectively.
A. Effect of LCG and GSG modules
In the main paper, we already demonstrate the effective-
ness of LCG and GSG quantitatively. To provide a more
comprehensive ablation study, we present additional quali-
tative results to visually illustrate the effect of each module.
As shown in the third row of Fig., the model without
GSG produces translated images lacking global harmony,
where objects such as cars and trees appear independent
from their surroundings, indicating that GSG plays a key
role in harmonizing the entire scene. In contrast, the model
without LCG can ensure global harmony but cannot pre-
serve the details well. As shown in the fourth row of Fig.,
parts of the building and bridge are missing (marked with
red boxes). This observation demonstrates the importance
of LCG in preserving local details. By incorporating both
modules, our approach achieves both global-harmony and
local-precision image translation, as shown in the last row
of Fig.. These qualitative results demonstrate the comple-
mentary nature of LCG and GSG and show their effects in
achieving high-quality image translation results.
B. Training stability analysis
In Fig.
state-of-the-art GAN-based image translation method [16]
and our diffusion-based method, respectively. We can ob-
serve that our method exhibits a more stable training pro-
cess. During the training process, we also use the trans-(a) Training Loss
(b) Adaptation Performance
Figure 8. The loss and performance curves of the GAN-based im-
age translation method [16] and ours.
lated images to train a target-domain segmentation model
and then evaluate its adaptation performance on the target
domain. As shown in Fig.
achieves a more stable and higher adaptation performance
than the GAN-based method, suggesting that our method is
not only stable but also effective.
C. Ablation on data augmentation
As discussed in Section
age translation framework is designed to handle noisy and
masked images. To improve the model’s robustness, we
generate additional noisy and masked images for data aug-
mentation. Specifically, we follow the Eqn.
paper to add noise to the training data, and then mask the
noisy images using binary masks provided by source labels.
We randomly select 10% of the training data to execute the
data augmentation. As shown in Tab., the augmentation
can bring a slight improvement of 0.2%∼0.3%.
Table 6. Ablation study on data augmentation.
Model G →C S→C
(a) Ours w/o augmentation 61.7 60.7
(b) Ours 61.9 61.0