Show simple item record

dc.contributor.advisorKatabi, Dina
dc.contributor.authorLi, Tianhong
dc.date.accessioned2025-03-12T16:55:51Z
dc.date.available2025-03-12T16:55:51Z
dc.date.issued2024-09
dc.date.submitted2025-03-04T18:31:54.590Z
dc.identifier.urihttps://hdl.handle.net/1721.1/158500
dc.description.abstractRecognition and generation are two key tasks in computer vision. However, recognition and generative models are typically trained independently, which ignores the complementary nature of the two tasks. In this thesis, we present a unified framework for visual data recognition and generation via masked generative modeling, and further demonstrate its superior power to address challenges across various applications. We will begin with MAGE, a novel framework that unifies image generation and recognition while achieving state-ofthe-art performance on both tasks. We then extend it into vision-language multi-modal training through ITIT, which utilizes unpaired image and text data to train models capable of high-quality, bidirectional image-text generation – the recognition power enables accurate image-to-text captioning, while the generation power enables realistic text-to-image generation. Moreover, inspired by the synergy between image generation and recognition observed in MAGE, we introduce RCG, a framework that enhances the quality of unconditional image generation to the same level of class-conditional generation, by using representations learned in a self-supervised manner to guide the generative process. Lastly, we introduce Reparo to address the challenge of packet loss in video conferencing with the help of masked generative modeling, enabling the reconstruction of lost video data without traditional error correction methods. This ensures high-quality communication even under conditions of substantial data loss. These works demonstrate the power of the proposed unified framework, to not only push forward the state-of-the-art in individual downstream applications but also to provide robust, versatile solutions adaptable to a wide range of real-world problems in computer vision and beyond.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleTowards a Unified Framework for Visual Recognition and Generation via Masked Generative Modeling
dc.typeThesis
dc.description.degreePh.D.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeDoctoral
thesis.degree.nameDoctor of Philosophy


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record