Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes

Kevlishvili, Ilia; St. Michel, Roland G; Garrison, Aaron G; Toney, Jacob W; Adamji, Husain; Jia, Haojun; Román-Leshkov, Yuriy; Kulik, Heather J

dc.contributor.author	Kevlishvili, Ilia
dc.contributor.author	St. Michel, Roland G
dc.contributor.author	Garrison, Aaron G
dc.contributor.author	Toney, Jacob W
dc.contributor.author	Adamji, Husain
dc.contributor.author	Jia, Haojun
dc.contributor.author	Román-Leshkov, Yuriy
dc.contributor.author	Kulik, Heather J
dc.date.accessioned	2024-10-30T20:03:36Z
dc.date.available	2024-10-30T20:03:36Z
dc.date.issued	2024-09-20
dc.identifier.uri	https://hdl.handle.net/1721.1/157447
dc.description.abstract	The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and the derived computational database tmQM is not conducive to application-specific modeling and the development of structure–property relationships. Here, we employ both supervised and unsupervised natural language processing (NLP) techniques to link experimentally synthesized compounds in the tmQM database to their respective applications. Leveraging NLP models, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism. Analyzing the chemical substructures within each dataset reveals common chemical motifs in each of the designated applications. We then use these common chemical structures to augment our initial datasets for each application, yielding a total of 21 631 compounds in tmCAT, 4599 in tmPHOTO, 2782 in tmBIO, and 983 in tmSCO. These datasets are expected to accelerate the more targeted computational screening and development of refined structure–property relationships with machine learning.	en_US
dc.language.iso	en
dc.publisher	Royal Society of Chemistry	en_US
dc.relation.isversionof	10.1039/d4fd00087k	en_US
dc.rights	Creative Commons Attribution-Noncommercial	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	en_US
dc.source	Royal Society of Chemistry	en_US
dc.title	Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes	en_US
dc.type	Article	en_US
dc.identifier.citation	Kevlishvili, Ilia, St. Michel, Roland G, Garrison, Aaron G, Toney, Jacob W, Adamji, Husain et al. 2024. "Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes." Faraday Discussions.
dc.contributor.department	Massachusetts Institute of Technology. Department of Chemical Engineering	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Materials Science and Engineering	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Chemistry	en_US
dc.relation.journal	Faraday Discussions	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2024-10-30T19:53:46Z
dspace.orderedauthors	Kevlishvili, I; St. Michel, RG; Garrison, AG; Toney, JW; Adamji, H; Jia, H; Román-Leshkov, Y; Kulik, HJ	en_US
dspace.date.submission	2024-10-30T19:53:52Z
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: d4fd00087k.pdf
Size:: 2.703Mb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record