Towards a Unified Framework for Visual Recognition and Generation via Masked Generative Modeling

Li, Tianhong

dc.contributor.advisor	Katabi, Dina
dc.contributor.author	Li, Tianhong
dc.date.accessioned	2025-03-12T16:55:51Z
dc.date.available	2025-03-12T16:55:51Z
dc.date.issued	2024-09
dc.date.submitted	2025-03-04T18:31:54.590Z
dc.identifier.uri	https://hdl.handle.net/1721.1/158500
dc.description.abstract	Recognition and generation are two key tasks in computer vision. However, recognition and generative models are typically trained independently, which ignores the complementary nature of the two tasks. In this thesis, we present a unified framework for visual data recognition and generation via masked generative modeling, and further demonstrate its superior power to address challenges across various applications. We will begin with MAGE, a novel framework that unifies image generation and recognition while achieving state-ofthe-art performance on both tasks. We then extend it into vision-language multi-modal training through ITIT, which utilizes unpaired image and text data to train models capable of high-quality, bidirectional image-text generation – the recognition power enables accurate image-to-text captioning, while the generation power enables realistic text-to-image generation. Moreover, inspired by the synergy between image generation and recognition observed in MAGE, we introduce RCG, a framework that enhances the quality of unconditional image generation to the same level of class-conditional generation, by using representations learned in a self-supervised manner to guide the generative process. Lastly, we introduce Reparo to address the challenge of packet loss in video conferencing with the help of masked generative modeling, enabling the reconstruction of lost video data without traditional error correction methods. This ensures high-quality communication even under conditions of substantial data loss. These works demonstrate the power of the proposed unified framework, to not only push forward the state-of-the-art in individual downstream applications but also to provide robust, versatile solutions adaptable to a wide range of real-world problems in computer vision and beyond.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Towards a Unified Framework for Visual Recognition and Generation via Masked Generative Modeling
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: li-tianhong-phd-eecs-2024-thes ...
Size:: 65.01Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record