Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion Supplementary Materials

Hansen, Jacob A.

Author(s)

Hansen, Jacob A.

DownloadThesis PDF (5.219Mb)

Advisor

Glass, James

Karlinsky, Leonid

Terms of use

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/

Metadata

Show full item record

Abstract

Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many such VisIT datasets are available, most of them are constructed via ad hoc techniques, separately proposed by different groups, commonly poorly documented, without available (reproducible) code, and employing paid closed-source model APIs like GPT-4, Gemini, or Claud to convert image metadata (labels) to VisIT instructions. This incurs significant cost and difficulty to scale, improve quality, or produce VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach, Instructify, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage Instructify features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or improve the data quality of the available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ∼3% on average and up to 21% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. We further show that our approach enables effective performance scaling (in terms of resulting LMM performance on a large variety of benchmarks) of the produced VisIT data both in terms of quantity and quality. In addition, we explore the impact of multiple factors, including conversation format, base model selection, and resampling strategies.

Date issued

2025-02

URI

https://hdl.handle.net/1721.1/159112

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses