UDT: Unsupervised Discovery of Transformations

UDT: Unsupervised Discovery of Transformations between Fine-Grained Classes in Diffusion Models

Soongsil University, Seoul, Republic of Korea
BMVC 2025
^*Indicates Equal Contribution ^†Indicates Corresponding author

Abstract

Diffusion models demonstrate excellent performance in image generation and synthesis. Effective and controllable editing with these models requires a deep understanding of their latent spaces, a primary focus of prior research. However, existing unsupervised exploration methods like NoiseCLR often remain at attribute-level edits, revealing limitations in complex fine-grained class transformations. To address this, we propose UDT (Unsupervised Discovery of Transformations), a novel framework that, while operating in a fully unsupervised setting, supports fine-grained class transformations by structuring the latent space in a hierarchy-aware manner. UDT leverages hierarchy-informed contrastive learning to disentangle class-defining traits using parent class guidance. This systematic approach structures the latent space to support diverse and meaningful transformations while ensuring semantic consistency and pose preservation. Experiments on dog, cat, bird, and flower datasets demonstrate that UDT's performance over existing methods in generating coherent, diverse, and semantically accurate edits. These results suggest UDT is a scalable and effective approach for semantic latent space exploration in diffusion models.

Motivation

Existing unsupervised exploration methods, like NoiseCLR, struggle with fine-grained classes such as specific dog breeds. These approaches often remain at attribute-level edits, offering limited control over detailed visual attributes. For example, transformations using NoiseCLR frequently result in outcomes unrelated to the intended target or lead to only minor changes, such as a slight adjustment in fur color.

Method

The core of the framework is the decomposition of the predicted noise based on a parent class prompt $p$ (e.g. dog), a process that isolates class-specific signal $\Delta \mathcal{T}_k^n$ from the general parent-class attributes $\Delta \mathcal{P}_k^n$. The contrastive learning process then operates on these isolated $\Delta \mathcal{T}_k^n$ vectors. As illustrated in (a) and (b), the framework attracts positive pairs by pulling together $\Delta \mathcal{T}_k^n$ vectors that originate from different images but correspond to the same learnable direction (e.g., $c_1$ or $c_k$). Conversely, (c) shows the repulsion of negative pairs, where $\Delta \mathcal{T}_k^n$ vectors from the same source image but associated with different directions (e.g., $c_1$ or $c_k$) are pushed apart. This allows the framework to discover specific and consistent directions, such as $c_1$ for the "Bull Mastiff" and $c_k$ for "Golden Retriever".

Quality Results of UDT

UDT discovers diverse and interpretable transformation directions within a single category and effectively generalizes this capability across various fine-grained domains such as dogs, cats, birds, and flowers. For instance, specific directions can transform a dog's breed into a Bulldog or a Golden Retriever with their distinct features, while others can accurately modify detailed attributes like a cat's fur color or a bird's plumage.

Qualitative Comparison

UDT demonstrates clear advantages in qualitative comparisons against recent methods. Unsupervised approaches like NoiseCLR and Concept Discovery often fail to find accurate transformation directions for specific dog breeds, a task where UDT consistently succeeds. Furthermore, UDT's performance is competitive with established editing methods. For instance, it renders the curly texture of a Toy Poodle more faithfully than LEDITS++ and excels at maintaining the original pose during edits, unlike Null-Text.

BibTeX

@inproceedings{choi2025udt, title={UDT: Unsupervised Discovery of Transformations between Fine-Grained Classes in Diffusion Models}, author={Choi, Youngjae and Koh, Hyunseo and Jeong, Hojae and Chae, Byungkwan and Park, Sungyong and Kim, Heewon}, booktitle={British Machine Vision Conference (BMVC)}, year={2025} }