Domain Adaptation of VLM for Soccer Video Understanding

Jiang, Tiancheng(Tony)

Author(s)

Jiang, Tiancheng(Tony)

DownloadThesis PDF (8.397Mb)

Advisor

Zarandi, Mohammad Fazel

Williams, John

Chuang, Isaac

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video under- standing VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains underexplored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/163257

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science; Sloan School of Management

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses