Towards a Unified Framework for Visual Recognition and Generation via Masked Generative Modeling

Li, Tianhong

Author(s)

Li, Tianhong

DownloadThesis PDF (65.01Mb)

Advisor

Katabi, Dina

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Recognition and generation are two key tasks in computer vision. However, recognition and generative models are typically trained independently, which ignores the complementary nature of the two tasks. In this thesis, we present a unified framework for visual data recognition and generation via masked generative modeling, and further demonstrate its superior power to address challenges across various applications. We will begin with MAGE, a novel framework that unifies image generation and recognition while achieving state-ofthe-art performance on both tasks. We then extend it into vision-language multi-modal training through ITIT, which utilizes unpaired image and text data to train models capable of high-quality, bidirectional image-text generation – the recognition power enables accurate image-to-text captioning, while the generation power enables realistic text-to-image generation. Moreover, inspired by the synergy between image generation and recognition observed in MAGE, we introduce RCG, a framework that enhances the quality of unconditional image generation to the same level of class-conditional generation, by using representations learned in a self-supervised manner to guide the generative process. Lastly, we introduce Reparo to address the challenge of packet loss in video conferencing with the help of masked generative modeling, enabling the reconstruction of lost video data without traditional error correction methods. This ensures high-quality communication even under conditions of substantial data loss. These works demonstrate the power of the proposed unified framework, to not only push forward the state-of-the-art in individual downstream applications but also to provide robust, versatile solutions adaptable to a wide range of real-world problems in computer vision and beyond.

Date issued

2024-09

URI

https://hdl.handle.net/1721.1/158500

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses