Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion Supplementary Materials
Author(s)
Hansen, Jacob A.
DownloadThesis PDF (5.219Mb)
Advisor
Glass, James
Karlinsky, Leonid
Terms of use
Metadata
Show full item recordAbstract
Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many such VisIT datasets are available, most of them are constructed via ad hoc techniques, separately proposed by different groups, commonly poorly documented, without available (reproducible) code, and employing paid closed-source model APIs like GPT-4, Gemini, or Claud to convert image metadata (labels) to VisIT instructions. This incurs significant cost and difficulty to scale, improve quality, or produce VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach, Instructify, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage Instructify features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or improve the data quality of the available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ∼3% on average and up to 21% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. We further show that our approach enables effective performance scaling (in terms of resulting LMM performance on a large variety of benchmarks) of the produced VisIT data both in terms of quantity and quality. In addition, we explore the impact of multiple factors, including conversation format, base model selection, and resampling strategies.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology