Characterizing Uncertainty in Machine Learning for Chemistry

Heid, Esther; McGill, Charles J; Vermeire, Florence H; Green, William H

dc.contributor.author	Heid, Esther
dc.contributor.author	McGill, Charles J
dc.contributor.author	Vermeire, Florence H
dc.contributor.author	Green, William H
dc.date.accessioned	2025-07-08T19:28:01Z
dc.date.available	2025-07-08T19:28:01Z
dc.date.issued	2023-06-20
dc.identifier.uri	https://hdl.handle.net/1721.1/159977
dc.description.abstract	Characterizing uncertainty in machine learning models has recently gained interest in the context of machine learning reliability, robustness, safety, and active learning. Here, we separate the total uncertainty into contributions from noise in the data (aleatoric) and shortcomings of the model (epistemic), further dividing epistemic uncertainty into model bias and variance contributions. We systematically address the influence of noise, model bias, and model variance in the context of chemical property predictions, where the diverse nature of target properties and the vast chemical chemical space give rise to many different distinct sources of prediction error. We demonstrate that different sources of error can each be significant in different contexts and must be individually addressed during model development. Through controlled experiments on data sets of molecular properties, we show important trends in model performance associated with the level of noise in the data set, size of the data set, model architecture, molecule representation, ensemble size, and data set splitting. In particular, we show that 1) noise in the test set can limit a model's observed performance when the actual performance is much better, 2) using size-extensive model aggregation structures is crucial for extensive property prediction, and 3) ensembling is a reliable tool for uncertainty quantification and improvement specifically for the contribution of model variance. We develop general guidelines on how to improve an underperforming model when falling into different uncertainty contexts.	en_US
dc.language.iso	en
dc.publisher	American Chemical Society	en_US
dc.relation.isversionof	10.1021/acs.jcim.3c00373	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	American Chemical Society	en_US
dc.title	Characterizing Uncertainty in Machine Learning for Chemistry	en_US
dc.type	Article	en_US
dc.identifier.citation	Heid, Esther, McGill, Charles J, Vermeire, Florence H and Green, William H. 2023. "Characterizing Uncertainty in Machine Learning for Chemistry." Journal of Chemical Information and Modeling, 63 (13).
dc.contributor.department	Massachusetts Institute of Technology. Department of Chemical Engineering	en_US
dc.relation.journal	Journal of Chemical Information and Modeling	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2025-07-08T19:02:11Z
dspace.orderedauthors	Heid, E; McGill, CJ; Vermeire, FH; Green, WH	en_US
dspace.date.submission	2025-07-08T19:02:13Z
mit.journal.volume	63	en_US
mit.journal.issue	13	en_US
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: heid-et-al-2023-characterizing ...
Size:: 3.709Mb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record