Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion Supplementary Materials

Hansen, Jacob A.

dc.contributor.advisor	Glass, James
dc.contributor.advisor	Karlinsky, Leonid
dc.contributor.author	Hansen, Jacob A.
dc.date.accessioned	2025-04-14T14:06:19Z
dc.date.available	2025-04-14T14:06:19Z
dc.date.issued	2025-02
dc.date.submitted	2025-04-03T14:06:16.894Z
dc.identifier.uri	https://hdl.handle.net/1721.1/159112
dc.description.abstract	Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many such VisIT datasets are available, most of them are constructed via ad hoc techniques, separately proposed by different groups, commonly poorly documented, without available (reproducible) code, and employing paid closed-source model APIs like GPT-4, Gemini, or Claud to convert image metadata (labels) to VisIT instructions. This incurs significant cost and difficulty to scale, improve quality, or produce VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach, Instructify, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage Instructify features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or improve the data quality of the available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ∼3% on average and up to 21% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. We further show that our approach enables effective performance scaling (in terms of resulting LMM performance on a large variety of benchmarks) of the produced VisIT data both in terms of quantity and quality. In addition, we explore the impact of multiple factors, including conversation format, base model selection, and resampling strategies.
dc.publisher	Massachusetts Institute of Technology
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion Supplementary Materials
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: hansen-jahansen-meng-eecs-2025 ...
Size:: 5.219Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record