MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion Supplementary Materials

Author(s)
Hansen, Jacob A.
Thumbnail
DownloadThesis PDF (5.219Mb)
Advisor
Glass, James
Karlinsky, Leonid
Terms of use
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many such VisIT datasets are available, most of them are constructed via ad hoc techniques, separately proposed by different groups, commonly poorly documented, without available (reproducible) code, and employing paid closed-source model APIs like GPT-4, Gemini, or Claud to convert image metadata (labels) to VisIT instructions. This incurs significant cost and difficulty to scale, improve quality, or produce VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach, Instructify, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage Instructify features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or improve the data quality of the available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ∼3% on average and up to 21% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. We further show that our approach enables effective performance scaling (in terms of resulting LMM performance on a large variety of benchmarks) of the produced VisIT data both in terms of quantity and quality. In addition, we explore the impact of multiple factors, including conversation format, base model selection, and resampling strategies.
Date issued
2025-02
URI
https://hdl.handle.net/1721.1/159112
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.