Domain Adaptation of VLM for Soccer Video Understanding
Author(s)
Jiang, Tiancheng(Tony)
DownloadThesis PDF (8.397Mb)
Advisor
Zarandi, Mohammad Fazel
Williams, John
Chuang, Isaac
Terms of use
Metadata
Show full item recordAbstract
Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video under- standing VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains underexplored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science; Sloan School of ManagementPublisher
Massachusetts Institute of Technology