dc.contributor.author | Northcutt, Curtis George | |
dc.contributor.author | Zha, Shengxin | |
dc.contributor.author | Lovegrove, Steven | |
dc.contributor.author | Newcombe, Richard | |
dc.date.accessioned | 2021-06-07T14:17:15Z | |
dc.date.available | 2021-06-07T14:17:15Z | |
dc.date.issued | 2020-09 | |
dc.identifier.issn | 0162-8828 | |
dc.identifier.issn | 2160-9292 | |
dc.identifier.issn | 1939-3539 | |
dc.identifier.uri | https://hdl.handle.net/1721.1/130907 | |
dc.description.abstract | Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5% of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79% relative to a single perspective. Both applications exploit EgoCom's synchronous multi-perspective data to augment performance of embodied AI tasks. | en_US |
dc.publisher | Institute of Electrical and Electronics Engineers (IEEE) | en_US |
dc.relation.isversionof | http://dx.doi.org/10.1109/tpami.2020.3025105 | en_US |
dc.rights | Creative Commons Attribution 4.0 International license | en_US |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | en_US |
dc.source | Curtis Northcutt | en_US |
dc.title | EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset | en_US |
dc.type | Article | en_US |
dc.identifier.citation | Northcutt, Curtis G. et al. "EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset." Forthcoming in IEEE Transactions on Pattern Analysis and Machine Intelligence (September 2020): dx.doi.org/10.1109/tpami.2020.3025105. | en_US |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | en_US |
dc.contributor.approver | Northcutt, Curtis George | en_US |
dc.relation.journal | IEEE Transactions on Pattern Analysis and Machine Intelligence | en_US |
dc.eprint.version | Final published version | en_US |
dc.type.uri | http://purl.org/eprint/type/JournalArticle | en_US |
eprint.status | http://purl.org/eprint/status/PeerReviewed | en_US |
dspace.date.submission | 2021-06-05T19:58:03Z | |
mit.license | PUBLISHER_CC | |
mit.metadata.status | Complete | |