Show simple item record

dc.contributor.authorBurns, Jackson W
dc.contributor.authorSpiekermann, Kevin A
dc.contributor.authorBhattacharjee, Himaghna
dc.contributor.authorVlachos, Dionisios G
dc.contributor.authorGreen, William H
dc.date.accessioned2025-07-08T18:44:53Z
dc.date.available2025-07-08T18:44:53Z
dc.date.issued2023-11-05
dc.identifier.urihttps://hdl.handle.net/1721.1/159975
dc.description.abstractMachine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets that are used to develop and evaluate models. Common practice in the literature is to assign these subsets randomly. Although this approach is fast and efficient, it only measures a model’s capacity to interpolate. Testing errors from random splits may be overly optimistic if given new data that is dissimilar to the scope of the training set; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many similarityand distance-based algorithms to partition data into more challenging splits. Separate from astartes, users can then use these splits to better assess out-of-sample performance with any ML model of choice. This publication focuses on use-cases within cheminformatics. However, astartes operates on arbitrary vector inputs, so its principals and workflow are generalizable to other ML domains as well. astartes is available via the Python package managers pip and conda and is publicly hosted on GitHub (github.com/JacksonBurns/astartes).en_US
dc.language.isoen
dc.publisherThe Open Journalen_US
dc.relation.isversionof10.21105/joss.05996en_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.sourceThe Open Journalen_US
dc.titleMachine Learning Validation via Rational DatasetSampling with astartesen_US
dc.typeArticleen_US
dc.identifier.citationBurns et al., (2023). Machine Learning Validation via Rational Dataset Sampling with astartes. Journal of Open Source Software, 8(91), 5996.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Center for Computational Science and Engineeringen_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Chemical Engineeringen_US
dc.relation.journalJournal of Open Source Softwareen_US
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dc.date.updated2025-07-08T18:35:05Z
dspace.orderedauthorsBurns, JW; Spiekermann, KA; Bhattacharjee, H; Vlachos, DG; Green, WHen_US
dspace.date.submission2025-07-08T18:35:06Z
mit.journal.volume8en_US
mit.journal.issue91en_US
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record