MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Self-Supervised Audio-Visual Speech Diarization and Recognition

Author(s)
Wongprommoon, Arun
Thumbnail
DownloadThesis PDF (4.440Mb)
Advisor
Glass, James
Terms of use
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address the diarization problem while maintaining word recognition accuracy. The proposed Audio-Visual Cocktail (AVC) HuBERT model extends video input dimenions, lengthens feature size, and adds projection layers to split outputs into corresponding speakers. A complementary synthesized dataset is constructed by mixing audio and video samples from LRS3 at varying overlap thresholds, resulting in the LRS3Mix dataset. This is used to train the model, whose weights are transferred from AV-HuBERT. Computing several word error rate (WER) metrics to measure recognition and diarization performance of several versions of AVC-HuBERT models demonstrates that the method improves diarization, albeit with a small tradeoff in word recognition. Augmenting the synthesized mixed dataset with the original clean single-speaker dataset boosts recognition ability, and the same effect can be observed when the dataset size increases.
Date issued
2024-05
URI
https://hdl.handle.net/1721.1/156767
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.