MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Towards a Unified Framework for Visual Recognition and Generation via Masked Generative Modeling

Author(s)
Li, Tianhong
Thumbnail
DownloadThesis PDF (65.01Mb)
Advisor
Katabi, Dina
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Recognition and generation are two key tasks in computer vision. However, recognition and generative models are typically trained independently, which ignores the complementary nature of the two tasks. In this thesis, we present a unified framework for visual data recognition and generation via masked generative modeling, and further demonstrate its superior power to address challenges across various applications. We will begin with MAGE, a novel framework that unifies image generation and recognition while achieving state-ofthe-art performance on both tasks. We then extend it into vision-language multi-modal training through ITIT, which utilizes unpaired image and text data to train models capable of high-quality, bidirectional image-text generation – the recognition power enables accurate image-to-text captioning, while the generation power enables realistic text-to-image generation. Moreover, inspired by the synergy between image generation and recognition observed in MAGE, we introduce RCG, a framework that enhances the quality of unconditional image generation to the same level of class-conditional generation, by using representations learned in a self-supervised manner to guide the generative process. Lastly, we introduce Reparo to address the challenge of packet loss in video conferencing with the help of masked generative modeling, enabling the reconstruction of lost video data without traditional error correction methods. This ensures high-quality communication even under conditions of substantial data loss. These works demonstrate the power of the proposed unified framework, to not only push forward the state-of-the-art in individual downstream applications but also to provide robust, versatile solutions adaptable to a wide range of real-world problems in computer vision and beyond.
Date issued
2024-09
URI
https://hdl.handle.net/1721.1/158500
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.