Grounded SCAN Human: A Benchmark for Zero-Shot Generalizations
Author(s)
Sleeper, Dylan
DownloadThesis PDF (5.591Mb)
Advisor
Katz, Boris
Terms of use
Metadata
Show full item recordAbstract
In this work, we collect a new human annotated dataset called Grounded SCAN Human (gSCAN Human) as an extension of the original Grounded SCAN (gSCAN) dataset. The original gSCAN dataset was created to test various compositional generalizations by holding out certain examples during train time. During test time, models must zero-shot execute commands that require the agent to move in new directions, commands that contain novel combinations of objects and adjectives, and other such generalizations in different test sets called splits. However, gSCAN does not contain splits that test zero-shot generalizations to new sentence structures and a whole new vocabulary. The gSCAN Human dataset was created to test these generalizations: can a model trained using a simple grammar generalize to human annotations? We collected and verified a total of 1, 391 human annotations across all of the gSCAN splits (excluding the test and dev split) and evaluated various models on each of the splits. We test the original gSCAN baseline with several modifications, including the baseline with a transformer replacing the encoder, and one with early multimodal fusion of the sentence encoding with the visual embedding. We also test a multimodal transformer similar to VilBERT, which is the state of the art on the original gSCAN splits. We find that the models are somewhat robust to varying sentence structure and new vocabulary; however the models are far less successful given a combination of the two, as evaluated by the human data.
Date issued
2022-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology