Show simple item record

dc.contributor.advisorOliva, Aude
dc.contributor.authorAgarwal, Anisha
dc.date.accessioned2022-08-29T16:17:43Z
dc.date.available2022-08-29T16:17:43Z
dc.date.issued2022-05
dc.date.submitted2022-05-27T16:18:36.566Z
dc.identifier.urihttps://hdl.handle.net/1721.1/144873
dc.description.abstractIn this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright MIT
dc.rights.urihttp://rightsstatements.org/page/InC-EDU/1.0/
dc.titleText-Free Audio Captions of Short Videos from Latent Space Representation
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record