What are the most informative data points for predicting extreme events?

Champenois, Bianca; Sapsis, Themistoklis P.

dc.contributor.author	Champenois, Bianca
dc.contributor.author	Sapsis, Themistoklis P.
dc.date.accessioned	2025-11-17T16:30:27Z
dc.date.available	2025-11-17T16:30:27Z
dc.date.issued	2025-09-22
dc.identifier.uri	https://hdl.handle.net/1721.1/163673
dc.description.abstract	The growing availability of large datasets that describe complex dynamical systems, such as climate models and turbulence simulations, has made machine learning an increasingly popular tool for modeling and analysis, but the inherent low representation of extreme events poses a major challenge for model accuracy in the tails of the distribution. This raises a fundamental question: Given a large dataset, which data points should we use to train machine learning models that effectively learn extremes? To address this question, we study a likelihood-weighted active data selection framework that identifies the most informative data points for model training. The framework improves predictions of extreme values of a target observable, scales to high-dimensional systems, and is model-agnostic. Unlike traditional active learning, which assumes the ability to query new data, our method is designed for problems where the dataset is fixed but vast, focusing on selection rather than acquisition. Points are scored using a likelihood-weighted uncertainty sampling criterion that prioritizes samples expected to reduce model uncertainty and improve predictions in the tails of the distribution for systems with non-Gaussian statistics. When applied to a machine learning climate model with input dimensionality on the order of tens of thousands, we find that the likelihood-weighted active data selection algorithm most accurately captures the statistics of extreme events using only a fraction of the original dataset. We also introduce analysis techniques to further interpret the optimally selected points. Looking ahead, the approach can serve as a compression algorithm that preserves information associated with extreme events in vast datasets.	en_US
dc.publisher	Springer Netherlands	en_US
dc.relation.isversionof	https://doi.org/10.1007/s11071-025-11825-6	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	Springer Netherlands	en_US
dc.title	What are the most informative data points for predicting extreme events?	en_US
dc.type	Article	en_US
dc.identifier.citation	Champenois, B., Sapsis, T.P. What are the most informative data points for predicting extreme events?. Nonlinear Dyn 113, 34167–34189 (2025).	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Mechanical Engineering	en_US
dc.contributor.department	Massachusetts Institute of Technology. Center for Computational Science and Engineering	en_US
dc.relation.journal	Nonlinear Dynamics	en_US
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2025-11-16T04:43:52Z
dc.language.rfc3066	en
dc.rights.holder	The Author(s)
dspace.embargo.terms	N
dspace.date.submission	2025-11-16T04:43:52Z
mit.journal.volume	113	en_US
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 11071_2025_Article_11825.pdf
Size:: 6.011Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record