Show simple item record

dc.contributor.authorChampenois, Bianca
dc.contributor.authorSapsis, Themistoklis P.
dc.date.accessioned2025-11-17T16:30:27Z
dc.date.available2025-11-17T16:30:27Z
dc.date.issued2025-09-22
dc.identifier.urihttps://hdl.handle.net/1721.1/163673
dc.description.abstractThe growing availability of large datasets that describe complex dynamical systems, such as climate models and turbulence simulations, has made machine learning an increasingly popular tool for modeling and analysis, but the inherent low representation of extreme events poses a major challenge for model accuracy in the tails of the distribution. This raises a fundamental question: Given a large dataset, which data points should we use to train machine learning models that effectively learn extremes? To address this question, we study a likelihood-weighted active data selection framework that identifies the most informative data points for model training. The framework improves predictions of extreme values of a target observable, scales to high-dimensional systems, and is model-agnostic. Unlike traditional active learning, which assumes the ability to query new data, our method is designed for problems where the dataset is fixed but vast, focusing on selection rather than acquisition. Points are scored using a likelihood-weighted uncertainty sampling criterion that prioritizes samples expected to reduce model uncertainty and improve predictions in the tails of the distribution for systems with non-Gaussian statistics. When applied to a machine learning climate model with input dimensionality on the order of tens of thousands, we find that the likelihood-weighted active data selection algorithm most accurately captures the statistics of extreme events using only a fraction of the original dataset. We also introduce analysis techniques to further interpret the optimally selected points. Looking ahead, the approach can serve as a compression algorithm that preserves information associated with extreme events in vast datasets.en_US
dc.publisherSpringer Netherlandsen_US
dc.relation.isversionofhttps://doi.org/10.1007/s11071-025-11825-6en_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.sourceSpringer Netherlandsen_US
dc.titleWhat are the most informative data points for predicting extreme events?en_US
dc.typeArticleen_US
dc.identifier.citationChampenois, B., Sapsis, T.P. What are the most informative data points for predicting extreme events?. Nonlinear Dyn 113, 34167–34189 (2025).en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Mechanical Engineeringen_US
dc.contributor.departmentMassachusetts Institute of Technology. Center for Computational Science and Engineeringen_US
dc.relation.journalNonlinear Dynamicsen_US
dc.identifier.mitlicensePUBLISHER_CC
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dc.date.updated2025-11-16T04:43:52Z
dc.language.rfc3066en
dc.rights.holderThe Author(s)
dspace.embargo.termsN
dspace.date.submission2025-11-16T04:43:52Z
mit.journal.volume113en_US
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record