UDT: Unsupervised Discovery of Transformations between Fine-Grained Classes in Diffusion Models

Soongsil University, Seoul, Republic of Korea
BMVC 2025

*Indicates Equal Contribution   Indicates Corresponding author

TL;DR

UDT is an unsupervised framework that uses parent-class guided noise decomposition to achieve breed-to-breed translation directions across diverse domains like dogs, cats, flowers, and birds.

Abstract

Diffusion models achieve impressive image synthesis, yet unsupervised methods for latent space exploration remain limited in fine-grained class translation. Existing approaches struggle with fine-grained class translation, often producing low-diversity outputs within parent classes or inconsistent child-class mappings across images. We propose UDT (Unsupervised Discovery of Transformations), a framework that incorporates hierarchical structure into unsupervised direction discovery. UDT leverages parent-class prompts to decompose predicted noise into class-general and class-specific components, ensuring translations remain within the parent domain while enabling disentangled child-class transformations. A hierarchy-aware contrastive loss further enforces consistency, with each direction corresponding to a distinct child class. Experiments on dogs, cats, birds, and flowers show that UDT outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, UDT supports controllable interpolation, allowing for the smooth generation of intermediate classes (e.g., mixed breeds). These results demonstrate UDT as a general and effective solution for fine-grained image translation.

Motivation

Motivation Figure

Recent unsupervised approaches for diffusion models attempt to discover semantic directions by exploring intermediate features of the U-Net or operating in the predicted noise space. However, despite these advances, unsupervised methods face notable shortcomings when applied to fine-grained class translation. First, the diversity of generated outputs is insufficient; discovered directions often tend to generate only low-diversity variations within a parent class (e.g., different types of dog) or drift into unrelated classes (e.g., cat, food). Second, discovered directions lack consistency across images: the same direction may corre spond todifferent child classes depending on the input, requiring users to search for a desirable transformation manually. These limitations limit their applicability in scenarios that require reliable, fine-grained control, such as breed-to-breed transformations

Method

method_figure

Overview of the UDT framework. (a) UDT decomposes predicted noise divergence $\Delta\epsilon_k^n$ into a parent-class component $\Delta\mathcal{P}_k^n$ (general attributes) and a child-class component $\Delta\mathcal{T}_k^n$ (fine-grained traits). (b) Contrastive learning is then applied only to the child-class vectors, ensuring each discovered direction corresponds to a consistent child class. This hierarchical formulation enables UDT to discover interpretable directions for fine-grained class translation, later used in the image translation pipeline (c).

Visualizing discovered transformations by UDT

method_figure

UDT discovers semantically distinct and interpretable transformation directions across diverse domains, including animals, flowers, and human faces. For instance, in the dog domain, it successfully discovers numerous distinct breed transformations, producing the characteristic wrinkled faces of a Bulldog or the flowing golden coats of a Golden Retriever. Similar capabilities are observed in other categories; UDT accurately alters breed-defining traits in cats, such as fur color and ear length, and for human faces, it can produce either Asian or more Western facial features. These results demonstrate UDT's capability to discover semantically meaningful features and identify distinctive visual variations within each domain without any explicit labels during training.

Qualitative Comparison

Qualitative comparison with other methods

UDT was qualitatively compared against state-of-the-art unsupervised, self-supervised, and image editing methods. While competing unsupervised and self-supervised methods struggled to find adequate transformation directions for specific target dog breeds, UDT successfully transformed images to the intended breed. Furthermore, UDT demonstrated superior performance over existing editing methods in certain aspects. Specifically, it better represented the fine curly texture of a Toy Poodle than LEDITS++ and was more effective at maintaining the original pose during edits compared to Null-Text.

Classification Accuracy on Class Translation

Qualitative comparison with other methods

To quantitatively evaluate UDT's effectiveness in capturing breed-specific characteristics, 100 learned translation directions were applied to 100 Pug images. The shifts in classification probability for these translated images were measured using a CLIP classifier, and the representative direction for each target breed was selected as the one yielding the highest CLIP score. The results confirmed UDT's effectiveness, with diagonal entries in the results table showing substantial CLIP confidence boosts for the intended target breeds. For instance, a +43.24 confidence boost for the 'Golden Retriever' transformation aligns with the ideal expectation that targeted semantic confidence should increase significantly while other semantic alterations are minimal.

Class Diversity on Class Translation

Qualitative comparison with other methods

To compare the diversity of breed transformations, UDT' and NoiseCLR were used to generate images from 100 distinct translation directions, with a CLIP classifier predicting the resulting breed for each image. UDT achieved an average of 50.43 distinct predicted breeds, substantially outperforming NoiseCLR's average of 15.57. This result highlights UDT's superior capability in discovering a richer and more diverse set of breed-specific transformation directions.

BibTeX

@inproceedings{choi2025udt,
  title={UDT: Unsupervised Discovery of Transformations between Fine-Grained Classes in Diffusion Models},
  author={Choi, Youngjae and Koh, Hyunseo and Jeong, Hojae and Chae, Byungkwan and Park, Sungyong and Kim, Heewon},
  booktitle={British Machine Vision Conference (BMVC)},
  year={2025}
}