MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

MOBLLM: Model Building LLMs via Symbolic Regression and Experimental Design

Author(s)
Binbas, Berkin
Thumbnail
DownloadThesis PDF (1.793Mb)
Advisor
Englund, Dirk
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Large language models (LLMs) have recently emerged for daily use and have already been extensively utilized for various tasks. They are shown to be able to carry out more and more complex tasks every day, including those that require a high level of formal/mathematical reasoning at human or superhuman levels. In particular, their in-context learning capabilities and the domain-specific knowledge they have via their vast pretraining corpus, as well as their fine-tunability for specific tasks drove a lot of attention and research in the field. However, applications of LLMs to the frontiers of scientific research remains an underexplored direction. In this work, we investigate how one can leverage LLMs to aid with building compact mathematical models and experimental design. Specifically, we propose a framework for using LLMs as a guide to concurrently handle the experimental design and symbolic regression tasks for data obtained from 1) a black box 1D function and 2) a black box physical system. We propose further modifications to our base framework, and perform experiments to analyze how it performs under different experiment variants, across different LLM tiers. Our experiments reveal that while larger models (of around 70b parameters) do not always achieve better downstream performance compared to smaller models (of around 8b parameters), they are able to utilize the given information and/or physical context when designing experiments and proposing symbolic expressions, and perform better than random-design baselines. We also observe that natural language constraints do not consistently improve symbolic regression accuracy. These results underscore both the challenges and the potential of integrating LLM agents into the scientific discovery process, particularly as proposers of experiments and symbolic expressions.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/162509
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.