Show simple item record

dc.contributor.authorKevlishvili, Ilia
dc.contributor.authorSt. Michel, Roland G
dc.contributor.authorGarrison, Aaron G
dc.contributor.authorToney, Jacob W
dc.contributor.authorAdamji, Husain
dc.contributor.authorJia, Haojun
dc.contributor.authorRomán-Leshkov, Yuriy
dc.contributor.authorKulik, Heather J
dc.date.accessioned2024-10-30T20:03:36Z
dc.date.available2024-10-30T20:03:36Z
dc.date.issued2024-09-20
dc.identifier.urihttps://hdl.handle.net/1721.1/157447
dc.description.abstractThe breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and the derived computational database tmQM is not conducive to application-specific modeling and the development of structure–property relationships. Here, we employ both supervised and unsupervised natural language processing (NLP) techniques to link experimentally synthesized compounds in the tmQM database to their respective applications. Leveraging NLP models, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism. Analyzing the chemical substructures within each dataset reveals common chemical motifs in each of the designated applications. We then use these common chemical structures to augment our initial datasets for each application, yielding a total of 21 631 compounds in tmCAT, 4599 in tmPHOTO, 2782 in tmBIO, and 983 in tmSCO. These datasets are expected to accelerate the more targeted computational screening and development of refined structure–property relationships with machine learning.en_US
dc.language.isoen
dc.publisherRoyal Society of Chemistryen_US
dc.relation.isversionof10.1039/d4fd00087ken_US
dc.rightsCreative Commons Attribution-Noncommercialen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/en_US
dc.sourceRoyal Society of Chemistryen_US
dc.titleLeveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexesen_US
dc.typeArticleen_US
dc.identifier.citationKevlishvili, Ilia, St. Michel, Roland G, Garrison, Aaron G, Toney, Jacob W, Adamji, Husain et al. 2024. "Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes." Faraday Discussions.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Chemical Engineeringen_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Materials Science and Engineeringen_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Chemistryen_US
dc.relation.journalFaraday Discussionsen_US
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dc.date.updated2024-10-30T19:53:46Z
dspace.orderedauthorsKevlishvili, I; St. Michel, RG; Garrison, AG; Toney, JW; Adamji, H; Jia, H; Román-Leshkov, Y; Kulik, HJen_US
dspace.date.submission2024-10-30T19:53:52Z
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record