Expectation-based comprehension of linguistic input: facilitation from visual context

Pushpita, Subha Nawer

Author(s)

Pushpita, Subha Nawer

DownloadThesis PDF (20.39Mb)

Advisor

Levy, Roger

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Context fundamentally shapes real-time human language processing, creating linguistic expectations that drive efficient processing and accurate disambiguation (Kuperberg and Jaeger, 2016). In naturalistic language understanding, the visual scene often provides crucial context (Ferreira et al., 2013; Huettig et al., 2011). We know that visual context guides spoken word recognition (Allopenna et al., 1998), syntactic disambiguation (Tanenhaus et al., 1995), and prediction (Altmann and Kamide, 1999), but much about how visual context shapes real-time language comprehension remains unknown. In this project, we investigate how visual information penetrates the language processing system and real-time language understanding. Here we show that relevant visual context significantly facilitates reading comprehension, with the amount of facilitation modulated by a word’s degree of grounding in that visual context or image in our case. Our results also demonstrate that the facilitation is largely mediated by the effect of multimodal surprisal(the relative entropy induced by the word between the distributions over interpretations of the previous words in the sentence and the image). We also found that the errors that people are prone to make in reading comprehension tasks can be largely predicted by the amount of multimodal surprisal. The results also highlight the strong correlation between a word’s degree of grounding and reduction of surprisal for the presence of an image. Our work offers new possibilities for how multimodal large language models may be used in psycholinguistic research to investigate how visual context affects language processing. This work will also pioneer questions about how information processed in different modalities such as audio, video, or structured visuals like graphs and diagrams shape our upcoming linguistic comprehension or even language generation, providing fundamental theoretical insights into the understanding of the way we use language to navigate in a complex world.

Date issued

2024-05

URI

https://hdl.handle.net/1721.1/156826

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses