Analyzing Inconsistent Results of Table Transformer for Improved Data Extraction in Childhood Obesity Intervention Literature
Author(s)
Neupane, Pragya
DownloadThesis PDF (8.659Mb)
Advisor
Hosoi, Anette E. "Peko"
Terms of use
Metadata
Show full item recordAbstract
Tables in scientific literature are rich sources of structured data, yet their complex and variable formats pose challenges for automated extraction. This thesis focuses on improving the reliability of Table Structure Recognition (TSR) using the Table Transformer (TATR) model, with a specific application to childhood obesity intervention studies. While fine-tuning TATR on a domain-specific dataset improves detection metrics, persistent errors such as overlapping rows and misclassified header columns remain. Through a systematic post-hoc error analysis of 175 scientific tables, we identify these dominant failure modes and develop lightweight post-processing modules: an overlap-aware row filtering algorithm and an OCR-enhanced column boundary correction method. Importantly, instead of relying on computationally expensive large language models (LLMs), this approach leverages efficient, interpretable techniques tailored to the domain-specific structure of public health tables. Our combined method reduces the proportion of structurally erroneous tables from 46.3% to an estimated 9.7–12.6%, improving the semantic alignment and interpretability of model outputs. This work contributes a transparent and scalable pipeline that enhances the trustworthiness of automated table extraction systems, with direct relevance to evidence-based decision-making in public health.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science; Massachusetts Institute of Technology. Institute for Data, Systems, and SocietyPublisher
Massachusetts Institute of Technology