Show simple item record

dc.contributor.authorXiao, Guangxuan
dc.contributor.authorYin, Tianwei
dc.contributor.authorFreeman, William T.
dc.contributor.authorDurand, Frédo
dc.contributor.authorHan, Song
dc.date.accessioned2026-03-17T14:41:12Z
dc.date.available2026-03-17T14:41:12Z
dc.date.issued2024-09-19
dc.identifier.urihttps://hdl.handle.net/1721.1/165200
dc.description.abstractDiffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300 $$\times $$ × –2500 $$\times $$ × speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here ( https://github.com/mit-han-lab/fastcomposer ).en_US
dc.publisherSpringer USen_US
dc.relation.isversionofhttps://doi.org/10.1007/s11263-024-02227-zen_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.sourceSpringer USen_US
dc.titleFastComposer: Tuning-Free Multi-subject Image Generation with Localized Attentionen_US
dc.typeArticleen_US
dc.identifier.citationXiao, G., Yin, T., Freeman, W.T. et al. FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention. Int J Comput Vis 133, 1175–1194 (2025).en_US
dc.relation.journalInternational Journal of Computer Visionen_US
dc.identifier.mitlicensePUBLISHER_CC
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dc.date.updated2024-09-22T03:14:04Z
dc.language.rfc3066en
dc.rights.holderThe Author(s)
dspace.embargo.termsN
dspace.date.submission2024-09-22T03:14:04Z
mit.journal.volume133en_US
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record