Show simple item record

dc.contributor.authorAi, Qianxiang
dc.contributor.authorMeng, Fanwang
dc.contributor.authorShi, Jiale
dc.contributor.authorPelkie, Brenden
dc.contributor.authorColey, Connor W
dc.date.accessioned2024-11-04T20:37:19Z
dc.date.available2024-11-04T20:37:19Z
dc.date.issued2024-09-11
dc.identifier.urihttps://hdl.handle.net/1721.1/157469
dc.description.abstractThe popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.en_US
dc.language.isoen
dc.publisherRoyal Society of Chemistryen_US
dc.relation.isversionof10.1039/d4dd00091aen_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.sourceRoyal Society of Chemistryen_US
dc.titleExtracting structured data from organic synthesis procedures using a fine-tuned large language modelen_US
dc.typeArticleen_US
dc.identifier.citationDigital Discovery, 2024,3, 1822-1831en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Chemical Engineeringen_US
dc.relation.journalDigital Discoveryen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dc.date.updated2024-11-04T20:27:16Z
dspace.orderedauthorsAi, Q; Meng, F; Shi, J; Pelkie, B; Coley, CWen_US
dspace.date.submission2024-11-04T20:27:19Z
mit.journal.volume3en_US
mit.journal.issue9en_US
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record