UDT is an unsupervised framework that uses parent-class guided noise decomposition to achieve breed-to-breed translation directions across diverse domains like dogs, cats, flowers, and birds.
Diffusion models achieve impressive image synthesis, yet unsupervised methods for latent space exploration remain limited in fine-grained class translation. Existing approaches struggle with fine-grained class translation, often producing low-diversity outputs within parent classes or inconsistent child-class mappings across images.
We propose UDT (Unsupervised Discovery of Transformations), a framework that incorporates hierarchical structure into unsupervised direction discovery. UDT leverages parent-class prompts to decompose predicted noise into class-general and class-specific components, ensuring translations remain within the parent domain while enabling disentangled child-class transformations. A hierarchy-aware contrastive loss further enforces consistency, with each direction corresponding to a distinct child class.
Experiments on dogs, cats, birds, and flowers show that UDT outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, UDT supports controllable interpolation, allowing for the smooth generation of intermediate classes (e.g., mixed breeds). These results demonstrate UDT as a general and effective solution for fine-grained image translation.
Overview of the UDT framework. (a) UDT decomposes predicted noise divergence $\Delta\epsilon_k^n$ into a parent-class component $\Delta\mathcal{P}_k^n$ (general attributes) and a child-class component $\Delta\mathcal{T}_k^n$ (fine-grained traits). (b) Contrastive learning is then applied only to the child-class vectors, ensuring each discovered direction corresponds to a consistent child class. This hierarchical formulation enables UDT to discover interpretable directions for fine-grained class translation, later used in the image translation pipeline (c).
UDT discovers semantically distinct and interpretable transformation directions across diverse domains, including animals, flowers, and human faces.
For instance, in the dog domain, it successfully discovers numerous distinct breed transformations, producing the characteristic wrinkled faces of a Bulldog or the flowing golden coats of a Golden Retriever.
Similar capabilities are observed in other categories; UDT accurately alters breed-defining traits in cats, such as fur color and ear length, and for human faces, it can produce either Asian or more Western facial features.
These results demonstrate UDT's capability to discover semantically meaningful features and identify distinctive visual variations within each domain without any explicit labels during training.
UDT was qualitatively compared against state-of-the-art unsupervised, self-supervised, and image editing methods.
While competing unsupervised and self-supervised methods struggled to find adequate transformation directions for specific target dog breeds, UDT successfully transformed images to the intended breed.
Furthermore, UDT demonstrated superior performance over existing editing methods in certain aspects.
Specifically, it better represented the fine curly texture of a Toy Poodle than LEDITS++ and was more effective at maintaining the original pose during edits compared to Null-Text.
To quantitatively evaluate UDT's effectiveness in capturing breed-specific characteristics, 100 learned translation directions were applied to 100 Pug images.
The shifts in classification probability for these translated images were measured using a CLIP classifier, and the representative direction for each target breed was selected as the one yielding the highest CLIP score.
The results confirmed UDT's effectiveness, with diagonal entries in the results table showing substantial CLIP confidence boosts for the intended target breeds.
For instance, a +43.24 confidence boost for the 'Golden Retriever' transformation aligns with the ideal expectation that targeted semantic confidence should increase significantly while other semantic alterations are minimal.
To compare the diversity of breed transformations, UDT' and NoiseCLR were used to generate images from 100 distinct translation directions, with a CLIP classifier predicting the resulting breed for each image.
UDT achieved an average of 50.43 distinct predicted breeds, substantially outperforming NoiseCLR's average of 15.57.
This result highlights UDT's superior capability in discovering a richer and more diverse set of breed-specific transformation directions.
@inproceedings{choi2025udt,
title={UDT: Unsupervised Discovery of Transformations between Fine-Grained Classes in Diffusion Models},
author={Choi, Youngjae and Koh, Hyunseo and Jeong, Hojae and Chae, Byungkwan and Park, Sungyong and Kim, Heewon},
booktitle={British Machine Vision Conference (BMVC)},
year={2025}
}