UDT is an unsupervised framework that uses parent-class guided noise decomposition to achieve breed-to-breed translation across diverse domains like dogs, cats, flowers, and birds.
Diffusion models demonstrate excellent performance in image generation and synthesis.
Effective and controllable editing with these models requires a deep understanding of their latent spaces, a primary focus of prior research.
However, existing unsupervised exploration methods like NoiseCLR often remain at attribute-level edits, revealing limitations in complex fine-grained class transformations.
To address this, we propose UDT
(Unsupervised Discovery of Transformations), a novel framework that, while operating in a fully unsupervised setting, supports fine-grained class transformations by structuring the latent space in a hierarchy-aware manner.
UDT
leverages hierarchy-informed contrastive learning to disentangle class-defining traits using parent class guidance.
This systematic approach structures the latent space to support diverse and meaningful transformations while ensuring semantic consistency and pose preservation.
Experiments on dog, cat, bird, and flower datasets demonstrate that UDT
's performance over existing methods in generating coherent, diverse, and semantically accurate edits.
These results suggest UDT
is a scalable and effective approach for semantic latent space exploration in diffusion models.
The core of the framework is the decomposition of the predicted noise based on a parent class prompt $p$ (e.g. dog), a process that isolates class-specific signal $\Delta \mathcal{T}_k^n$ from the general parent-class attributes $\Delta \mathcal{P}_k^n$. The contrastive learning process then operates on these isolated $\Delta \mathcal{T}_k^n$ vectors. As illustrated in (a) and (b), the framework attracts positive pairs by pulling together $\Delta \mathcal{T}_k^n$ vectors that originate from different images but correspond to the same learnable direction (e.g., $c_1$ or $c_k$). Conversely, (c) shows the repulsion of negative pairs, where $\Delta \mathcal{T}_k^n$ vectors from the same source image but associated with different directions (e.g., $c_1$ or $c_k$) are pushed apart. This allows the framework to discover specific and consistent directions, such as $c_1$ for the "Bull Mastiff" and $c_k$ for "Golden Retriever".
UDT discovers diverse and interpretable transformation directions within a single category and effectively generalizes this capability across various fine-grained domains such as dogs, cats, birds, and flowers. For instance, specific directions can transform a dog's breed into a Bulldog or a Golden Retriever with their distinct features, while others can accurately modify detailed attributes like a cat's fur color or a bird's plumage.
UDT demonstrates clear advantages in qualitative comparisons against recent methods. Unsupervised approaches like NoiseCLR and Concept Discovery often fail to find accurate transformation directions for specific dog breeds, a task where UDT consistently succeeds. Furthermore, UDT's performance is competitive with established editing methods. For instance, it renders the curly texture of a Toy Poodle more faithfully than LEDITS++ and excels at maintaining the original pose during edits, unlike Null-Text.
@inproceedings{choi2025udt,
title={UDT: Unsupervised Discovery of Transformations between Fine-Grained Classes in Diffusion Models},
author={Choi, Youngjae and Koh, Hyunseo and Jeong, Hojae and Chae, Byungkwan and Park, Sungyong and Kim, Heewon},
booktitle={British Machine Vision Conference (BMVC)},
year={2025}
}