Text-Free Audio Captions of Short Videos from Latent Space Representation
Author(s)
Agarwal, Anisha
DownloadThesis PDF (3.629Mb)
Advisor
Oliva, Aude
Terms of use
Metadata
Show full item recordAbstract
In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech.
Date issued
2022-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology