Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries

Gupta, Sejal

dc.contributor.advisor	Cafarella, Michael
dc.contributor.author	Gupta, Sejal
dc.date.accessioned	2024-09-16T13:51:22Z
dc.date.available	2024-09-16T13:51:22Z
dc.date.issued	2024-05
dc.date.submitted	2024-07-11T14:37:05.851Z
dc.identifier.uri	https://hdl.handle.net/1721.1/156824
dc.description.abstract	Recent advancements in visual language models (VLMs) have transformed the way we interpret and interact with digital imagery, bridging the gap between visual and textual data. However, these models, like Bard, GPT4-v, and LLava, often struggle with specialized fields, particularly when processing scientific imagery such as plots and graphs in scientific literature. In this thesis, we discuss the development of a pioneering reconstruction pipeline to extract metadata, regenerate plot data, and filter out extraneous noise like legends from plot images. Ultimately, the collected information is presented to the VLM in structured, textual manner to assist in answering domain specific queries. The efficacy of this pipeline is evaluated using a novel dataset comprised of scientific plots extracted from battery domain literature, alongside the existing benchmark datasets including PlotQA and ChartQA. Results about the component accuracy, task accuracy, and question-answering with augmented inputs to a VLM show promise in the future capabilities of this work. By assisting VLMs with scientific imagery, we aim to not only enhance the capabilities of VLMs in specialized scientific areas but also to transform the performance of VLMs in domain specific areas as a whole. This thesis provides a detailed overview of the work, encompassing a literature review, methodology, results, and recommendations for future work.
dc.publisher	Massachusetts Institute of Technology
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: gupta-sejalg-meng-eecs-2024-th ...
Size:: 6.907Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record