What are the most informative data points for predicting extreme events?
Author(s)
Champenois, Bianca; Sapsis, Themistoklis P.
Download11071_2025_Article_11825.pdf (6.011Mb)
Publisher with Creative Commons License
Publisher with Creative Commons License
Creative Commons Attribution
Terms of use
Metadata
Show full item recordAbstract
The growing availability of large datasets that describe complex dynamical systems, such as climate models and turbulence simulations, has made machine learning an increasingly popular tool for modeling and analysis, but the inherent low representation of extreme events poses a major challenge for model accuracy in the tails of the distribution. This raises a fundamental question: Given a large dataset, which data points should we use to train machine learning models that effectively learn extremes? To address this question, we study a likelihood-weighted active data selection framework that identifies the most informative data points for model training. The framework improves predictions of extreme values of a target observable, scales to high-dimensional systems, and is model-agnostic. Unlike traditional active learning, which assumes the ability to query new data, our method is designed for problems where the dataset is fixed but vast, focusing on selection rather than acquisition. Points are scored using a likelihood-weighted uncertainty sampling criterion that prioritizes samples expected to reduce model uncertainty and improve predictions in the tails of the distribution for systems with non-Gaussian statistics. When applied to a machine learning climate model with input dimensionality on the order of tens of thousands, we find that the likelihood-weighted active data selection algorithm most accurately captures the statistics of extreme events using only a fraction of the original dataset. We also introduce analysis techniques to further interpret the optimally selected points. Looking ahead, the approach can serve as a compression algorithm that preserves information associated with extreme events in vast datasets.
Date issued
2025-09-22Department
Massachusetts Institute of Technology. Department of Mechanical Engineering; Massachusetts Institute of Technology. Center for Computational Science and EngineeringJournal
Nonlinear Dynamics
Publisher
Springer Netherlands
Citation
Champenois, B., Sapsis, T.P. What are the most informative data points for predicting extreme events?. Nonlinear Dyn 113, 34167–34189 (2025).
Version: Final published version