Show simple item record

dc.contributor.advisorGlass, James
dc.contributor.advisorKarlinsky, Leonid
dc.contributor.authorHansen, Jacob A.
dc.date.accessioned2025-04-14T14:06:19Z
dc.date.available2025-04-14T14:06:19Z
dc.date.issued2025-02
dc.date.submitted2025-04-03T14:06:16.894Z
dc.identifier.urihttps://hdl.handle.net/1721.1/159112
dc.description.abstractVisual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many such VisIT datasets are available, most of them are constructed via ad hoc techniques, separately proposed by different groups, commonly poorly documented, without available (reproducible) code, and employing paid closed-source model APIs like GPT-4, Gemini, or Claud to convert image metadata (labels) to VisIT instructions. This incurs significant cost and difficulty to scale, improve quality, or produce VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach, Instructify, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage Instructify features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or improve the data quality of the available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ∼3% on average and up to 21% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. We further show that our approach enables effective performance scaling (in terms of resulting LMM performance on a large variety of benchmarks) of the produced VisIT data both in terms of quantity and quality. In addition, we explore the impact of multiple factors, including conversation format, base model selection, and resampling strategies.
dc.publisherMassachusetts Institute of Technology
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleInstructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion Supplementary Materials
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record