Injection of Domain-Specific Knowledge for EnterpriseText-to-SQL
Author(s)
Choi, Justin J.
DownloadThesis PDF (739.5Kb)
Advisor
Stonebraker, Michael R.
Terms of use
Metadata
Show full item recordAbstract
This work examines the current state of using large language models (LLMs) to solve Text-to-SQL tasks on databases in an enterprise setting. Benchmarks on publicly available datasets do not fully capture the difficulty and complexity of this task in a real-world, enterprise setting. This study examines the critical steps needed to work with enterprise data as well as using knowledge-injection to enhance the performance of LLMs on Text-to-SQL tasks. We begin by evaluating the baseline performance of LLMs on enterprise databases, revealing that a predominant source of failure stems from a lack of domain-specific knowledge. To improve performance, we explore knowledge-injection: the process of incorporating internal and external knowledge. Internal knowledge consists of database-specific information such as join logic, while external knowledge refers to institutional acronyms or group names. We present a hybrid retrieval pipeline that combines embedding and text based searching with LLM-guided ranking to supply models with relevant external knowledge during Text-to-SQL generation. We evaluate the impact of the knowledge-injection by testing the performance of LLMs on the table retrieval task after being augmented with appropriate external knowledge. We demonstrate that knowledge-injection significantly improves accuracy on table retrieval using BEAVER: an enterprise-level Text-to-SQL benchmark. Our findings highlight the importance of domain-specific knowledge-injection and retrieval augmentation in bringing LLMs closer to deployment in enterprise-grade database systems, as well as common failure modes that occur when executing enterprise Text-to-SQL.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology