Advances in Symbolic Regression: From Generalized Formulation to Density Estimation and Inverse Problem

Tohme, Tony

Author(s)

Tohme, Tony

DownloadThesis PDF (9.768Mb)

Advisor

Youcef-Toumi, Kamal

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

In this thesis, we explore the field of Symbolic Regression (SR), a middle ground between simple linear regression and complex inscrutable black box regressors such as neural networks. In essence, SR searches the space of mathematical expressions to find a model that best captures the relationship between inputs and outputs of a given dataset. While SR has not gained mainstream popularity due to its computational intricacy and reliance on heuristics, its potential for generating explicit, concise, and interpretable mathematical models deserves further attention. This work presents a series of advancements in Symbolic Regression, extending its applicability and demonstrating its potential across diverse domains and problem settings. Initially, we introduce GSR, a Generalized Symbolic Regression method that redefines the traditional SR optimization problem to discover analytical mappings from the input space to a transformed output space. The proposed GSR approach achieves promising performance compared to existing SR methods across established benchmark datasets, as well as a more challenging dataset introduced in this study, called SymSet. Next, we delve into the task of recovering underlying partial differential equations (PDEs) from data through the use of the adjoint method. We begin by considering a family of parameterized PDEs encompassing linear, nonlinear, and spatial derivative candidate terms. We then formulate a PDE-constrained optimization problem aimed at minimizing the error of the PDE solution from data, and elegantly derive the corresponding adjoint equations. We showcase the efficacy of the proposed approach in selecting the appropriate candidate terms, thereby discovering the governing PDEs from data. We also compare its performance with a commonly employed method for PDE discovery. Furthermore, we introduce MESSY Estimation, a Maximum-Entropy based Stochastic and Symbolic densitY estimation method. The proposed approach infers probability density functions symbolically from samples by leveraging the Maximum Entropy Distribution (MED) principle. We uncover three key contributions: (i) the Lagrange multipliers, inherent in the MED ansatz, can be efficiently computed by simply solving a linear system of equations, (ii) the density recovery task is enhanced through matching more unconventional low-order (symbolic) moments, rather than necessarily matching higher-order (raw) moments, and (iii) the proposed symbolic density estimation framework leads to increased interpretability and better conditioning. 3Finally, we introduce ISR, an Invertible Symbolic Regression (ISR) approach, which bridges the concepts of SR and invertible maps. Specifically, ISR seamlessly combines the principles of Invertible Neural Networks (INNs) and Equation Learner (EQL), a neural network-based symbolic architecture for function learning. Demonstrating its versatility, ISR also serves as a symbolic normalizing flow for density estimation tasks. Additionally, we showcase its applicability in solving inverse problems, including a benchmark inverse kinematics problem, and notably, a geoacoustic inversion problem in oceanography aimed at inferring posterior distributions of underlying seabed parameters from acoustic signals. The diverse findings of this thesis not only contribute to advancing the field of Symbolic Regression, but also underscore its versatility and potential across various domains. A shift to explicit symbolic models, as demonstrated in this thesis, could unveil hidden patterns within the plethora of datasets available today, offering new insights and directions in the evolving field of machine learning and data analysis.

Date issued

2024-05

URI

https://hdl.handle.net/1721.1/155864

Department

Massachusetts Institute of Technology. Department of Mechanical Engineering; Massachusetts Institute of Technology. Center for Computational Science and Engineering

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses