Efficient Prediction of Quantum Chemical Properties with Multitask Gaussian Process Regression

Fisher, Katharine

Author(s)

Fisher, Katharine

DownloadThesis PDF (2.051Mb)

Advisor

Marzouk, Youssef

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Multitask inference offers an efficient approach to bringing together multiple sources of information to train a surrogate model to predict chemical properties. In this thesis, we explore the task of inferring probability distributions on quantities of interest when we have access to a limited amount of highly accurate CCSD(T) data as well as data obtained using a range of approximations to the exchange-correlation functional in density functional theory (DFT). A CCSD(T) calculation can incur 1000 to one million times the computational cost of a DFT calculation, so an inference model which leverages both types of predictions can benefit from the accuracy of CCSD(T) and the relative efficiency of DFT. We specifically focus on inference methods based on Gaussian process (GP) regression. One example of such an approach, the Delta method, uses GP regression to model the difference between two different observation data sets, in our case CCSD(T) and DFT. The multitask method, by contrast, models a regression problem for each observational data set and assumes some relationship between the problems so that all relevant data sets can support the primary regression task. We test the performance of the Delta and multitask methods in the tasks of predicting the ionization potential of small organic molecules and the interaction energies of water dimers. The Delta method outperforms the multitask approach for data sets where it can be applied, but this approach requires CCSD(T) and DFT data sets to correspond to the same set of molecules and must have access to DFT data for target molecules to make final predictions. The multitask method can use information from CCSD(T) and DFT data sets which correspond to different molecules and can be applied without any DFT insight into the target molecule. For a given training set generation cost, the multitask method produces more accurate predictions than a GP regression model trained only on CCSD(T). The true training set generation cost may be smaller than the listed cost since the flexibility of the multitask method allows it to make use of already existing data sets. Additionally, we find that we can increase accuracy at low computational cost by increasing the number of DFT observation data sets used to inform the model. Finally, we consider the accuracy of the variances of the distributions predicted by GP inference methods as uncertainty indicators for the models. Though these indicators can capture uncertainty due to limited data set size and extrapolation, they are not designed to capture the impact of the disparity between modeling assumptions and reality. Future work may seek to better understand and represent this reality.

Date issued

2023-06

URI

https://hdl.handle.net/1721.1/151484

Department

Massachusetts Institute of Technology. Center for Computational Science and Engineering

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses