| dc.contributor.author | Champenois, Bianca | |
| dc.contributor.author | Sapsis, Themistoklis P. | |
| dc.date.accessioned | 2025-11-17T16:30:27Z | |
| dc.date.available | 2025-11-17T16:30:27Z | |
| dc.date.issued | 2025-09-22 | |
| dc.identifier.uri | https://hdl.handle.net/1721.1/163673 | |
| dc.description.abstract | The growing availability of large datasets that describe complex dynamical systems, such as climate models and turbulence simulations, has made machine learning an increasingly popular tool for modeling and analysis, but the inherent low representation of extreme events poses a major challenge for model accuracy in the tails of the distribution. This raises a fundamental question: Given a large dataset, which data points should we use to train machine learning models that effectively learn extremes? To address this question, we study a likelihood-weighted active data selection framework that identifies the most informative data points for model training. The framework improves predictions of extreme values of a target observable, scales to high-dimensional systems, and is model-agnostic. Unlike traditional active learning, which assumes the ability to query new data, our method is designed for problems where the dataset is fixed but vast, focusing on selection rather than acquisition. Points are scored using a likelihood-weighted uncertainty sampling criterion that prioritizes samples expected to reduce model uncertainty and improve predictions in the tails of the distribution for systems with non-Gaussian statistics. When applied to a machine learning climate model with input dimensionality on the order of tens of thousands, we find that the likelihood-weighted active data selection algorithm most accurately captures the statistics of extreme events using only a fraction of the original dataset. We also introduce analysis techniques to further interpret the optimally selected points. Looking ahead, the approach can serve as a compression algorithm that preserves information associated with extreme events in vast datasets. | en_US |
| dc.publisher | Springer Netherlands | en_US |
| dc.relation.isversionof | https://doi.org/10.1007/s11071-025-11825-6 | en_US |
| dc.rights | Creative Commons Attribution | en_US |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | en_US |
| dc.source | Springer Netherlands | en_US |
| dc.title | What are the most informative data points for predicting extreme events? | en_US |
| dc.type | Article | en_US |
| dc.identifier.citation | Champenois, B., Sapsis, T.P. What are the most informative data points for predicting extreme events?. Nonlinear Dyn 113, 34167–34189 (2025). | en_US |
| dc.contributor.department | Massachusetts Institute of Technology. Department of Mechanical Engineering | en_US |
| dc.contributor.department | Massachusetts Institute of Technology. Center for Computational Science and Engineering | en_US |
| dc.relation.journal | Nonlinear Dynamics | en_US |
| dc.identifier.mitlicense | PUBLISHER_CC | |
| dc.eprint.version | Final published version | en_US |
| dc.type.uri | http://purl.org/eprint/type/JournalArticle | en_US |
| eprint.status | http://purl.org/eprint/status/PeerReviewed | en_US |
| dc.date.updated | 2025-11-16T04:43:52Z | |
| dc.language.rfc3066 | en | |
| dc.rights.holder | The Author(s) | |
| dspace.embargo.terms | N | |
| dspace.date.submission | 2025-11-16T04:43:52Z | |
| mit.journal.volume | 113 | en_US |
| mit.license | PUBLISHER_CC | |
| mit.metadata.status | Authority Work and Publication Information Needed | en_US |