| dc.contributor.advisor | Oliva, Aude | |
| dc.contributor.author | Agarwal, Anisha | |
| dc.date.accessioned | 2022-08-29T16:17:43Z | |
| dc.date.available | 2022-08-29T16:17:43Z | |
| dc.date.issued | 2022-05 | |
| dc.date.submitted | 2022-05-27T16:18:36.566Z | |
| dc.identifier.uri | https://hdl.handle.net/1721.1/144873 | |
| dc.description.abstract | In this thesis, we re-implement previous work exploring image to speech captioning. We expand upon the work to implement video to speech captioning. Specifically, we implement a text-free image to speech captioning pipeline that integrates four distinct machine learning models. We alter the models to process video data rather than image data and analyze the resulting speech captions. We conduct experiments on the Wav2Vec2 and HuBERT Automatic Speech Recognition models, and identify which works best with synthesized speech. | |
| dc.publisher | Massachusetts Institute of Technology | |
| dc.rights | In Copyright - Educational Use Permitted | |
| dc.rights | Copyright MIT | |
| dc.rights.uri | http://rightsstatements.org/page/InC-EDU/1.0/ | |
| dc.title | Text-Free Audio Captions of Short Videos from Latent Space Representation | |
| dc.type | Thesis | |
| dc.description.degree | M.Eng. | |
| dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
| mit.thesis.degree | Master | |
| thesis.degree.name | Master of Engineering in Electrical Engineering and Computer Science | |