Machine Learning Validation via Rational DatasetSampling with astartes
Author(s)
Burns, Jackson W; Spiekermann, Kevin A; Bhattacharjee, Himaghna; Vlachos, Dionisios G; Green, William H
DownloadPublished version (242.3Kb)
Publisher with Creative Commons License
Publisher with Creative Commons License
Creative Commons Attribution
Terms of use
Metadata
Show full item recordAbstract
Machine Learning (ML) has become an increasingly popular tool to accelerate traditional
workflows. Critical to the use of ML is the process of splitting datasets into training, validation,
and testing subsets that are used to develop and evaluate models. Common practice in the
literature is to assign these subsets randomly. Although this approach is fast and efficient, it
only measures a model’s capacity to interpolate. Testing errors from random splits may be
overly optimistic if given new data that is dissimilar to the scope of the training set; thus,
there is a growing need to easily measure performance for extrapolation tasks. To address this
issue, we report astartes, an open-source Python package that implements many similarityand distance-based algorithms to partition data into more challenging splits. Separate from
astartes, users can then use these splits to better assess out-of-sample performance with any
ML model of choice. This publication focuses on use-cases within cheminformatics. However,
astartes operates on arbitrary vector inputs, so its principals and workflow are generalizable
to other ML domains as well. astartes is available via the Python package managers pip
and conda and is publicly hosted on GitHub (github.com/JacksonBurns/astartes).
Date issued
2023-11-05Department
Massachusetts Institute of Technology. Center for Computational Science and Engineering; Massachusetts Institute of Technology. Department of Chemical EngineeringJournal
Journal of Open Source Software
Publisher
The Open Journal
Citation
Burns et al., (2023). Machine Learning Validation via Rational Dataset Sampling with astartes. Journal of Open Source Software, 8(91), 5996.
Version: Final published version