Text-Free Audio Captions of Short Videos from Latent Space Representation

dc.contributor.advisor	Oliva, Aude
dc.contributor.author	Agarwal, Anisha
dc.date.accessioned	2022-08-29T16:17:43Z
dc.date.available	2022-08-29T16:17:43Z
dc.date.issued	2022-05
dc.date.submitted	2022-05-27T16:18:36.566Z
dc.identifier.uri	https://hdl.handle.net/1721.1/144873
dc.description.abstract	In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright MIT
dc.rights.uri	http://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Text-Free Audio Captions of Short Videos from Latent Space Representation
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science