FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

Xiao, Guangxuan; Yin, Tianwei; Freeman, William T.; Durand, Frédo; Han, Song

dc.contributor.author	Xiao, Guangxuan
dc.contributor.author	Yin, Tianwei
dc.contributor.author	Freeman, William T.
dc.contributor.author	Durand, Frédo
dc.contributor.author	Han, Song
dc.date.accessioned	2026-03-17T14:41:12Z
dc.date.available	2026-03-17T14:41:12Z
dc.date.issued	2024-09-19
dc.identifier.uri	https://hdl.handle.net/1721.1/165200
dc.description.abstract	Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300 $$\times $$ × –2500 $$\times $$ × speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here ( https://github.com/mit-han-lab/fastcomposer ).	en_US
dc.publisher	Springer US	en_US
dc.relation.isversionof	https://doi.org/10.1007/s11263-024-02227-z	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	Springer US	en_US
dc.title	FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention	en_US
dc.type	Article	en_US
dc.identifier.citation	Xiao, G., Yin, T., Freeman, W.T. et al. FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention. Int J Comput Vis 133, 1175–1194 (2025).	en_US
dc.relation.journal	International Journal of Computer Vision	en_US
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2024-09-22T03:14:04Z
dc.language.rfc3066	en
dc.rights.holder	The Author(s)
dspace.embargo.terms	N
dspace.date.submission	2024-09-22T03:14:04Z
mit.journal.volume	133	en_US
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 11263_2024_Article_2227.pdf
Size:: 13.22Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record