MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Expectation-based comprehension of linguistic input: facilitation from visual context

Author(s)
Pushpita, Subha Nawer
Thumbnail
DownloadThesis PDF (20.39Mb)
Advisor
Levy, Roger
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Context fundamentally shapes real-time human language processing, creating linguistic expectations that drive efficient processing and accurate disambiguation (Kuperberg and Jaeger, 2016). In naturalistic language understanding, the visual scene often provides crucial context (Ferreira et al., 2013; Huettig et al., 2011). We know that visual context guides spoken word recognition (Allopenna et al., 1998), syntactic disambiguation (Tanenhaus et al., 1995), and prediction (Altmann and Kamide, 1999), but much about how visual context shapes real-time language comprehension remains unknown. In this project, we investigate how visual information penetrates the language processing system and real-time language understanding. Here we show that relevant visual context significantly facilitates reading comprehension, with the amount of facilitation modulated by a word’s degree of grounding in that visual context or image in our case. Our results also demonstrate that the facilitation is largely mediated by the effect of multimodal surprisal(the relative entropy induced by the word between the distributions over interpretations of the previous words in the sentence and the image). We also found that the errors that people are prone to make in reading comprehension tasks can be largely predicted by the amount of multimodal surprisal. The results also highlight the strong correlation between a word’s degree of grounding and reduction of surprisal for the presence of an image. Our work offers new possibilities for how multimodal large language models may be used in psycholinguistic research to investigate how visual context affects language processing. This work will also pioneer questions about how information processed in different modalities such as audio, video, or structured visuals like graphs and diagrams shape our upcoming linguistic comprehension or even language generation, providing fundamental theoretical insights into the understanding of the way we use language to navigate in a complex world.
Date issued
2024-05
URI
https://hdl.handle.net/1721.1/156826
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.