MOBLLM: Model Building LLMs via Symbolic Regression and Experimental Design

Binbas, Berkin

Author(s)

Binbas, Berkin

DownloadThesis PDF (1.793Mb)

Advisor

Englund, Dirk

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Large language models (LLMs) have recently emerged for daily use and have already been extensively utilized for various tasks. They are shown to be able to carry out more and more complex tasks every day, including those that require a high level of formal/mathematical reasoning at human or superhuman levels. In particular, their in-context learning capabilities and the domain-specific knowledge they have via their vast pretraining corpus, as well as their fine-tunability for specific tasks drove a lot of attention and research in the field. However, applications of LLMs to the frontiers of scientific research remains an underexplored direction. In this work, we investigate how one can leverage LLMs to aid with building compact mathematical models and experimental design. Specifically, we propose a framework for using LLMs as a guide to concurrently handle the experimental design and symbolic regression tasks for data obtained from 1) a black box 1D function and 2) a black box physical system. We propose further modifications to our base framework, and perform experiments to analyze how it performs under different experiment variants, across different LLM tiers. Our experiments reveal that while larger models (of around 70b parameters) do not always achieve better downstream performance compared to smaller models (of around 8b parameters), they are able to utilize the given information and/or physical context when designing experiments and proposing symbolic expressions, and perform better than random-design baselines. We also observe that natural language constraints do not consistently improve symbolic regression accuracy. These results underscore both the challenges and the potential of integrating LLM agents into the scientific discovery process, particularly as proposers of experiments and symbolic expressions.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/162509

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses