Automated patent extraction powers generative modeling in focused chemical spaces

Subramanian, Akshay; P. Greenman, Kevin; Gervaix, Alexis; Yang, Tzuhsiung; Gómez-Bombarelli, Rafael

dc.contributor.author	Subramanian, Akshay
dc.contributor.author	P. Greenman, Kevin
dc.contributor.author	Gervaix, Alexis
dc.contributor.author	Yang, Tzuhsiung
dc.contributor.author	Gómez-Bombarelli, Rafael
dc.date.accessioned	2024-09-20T18:31:33Z
dc.date.available	2024-09-20T18:31:33Z
dc.date.issued	2023
dc.identifier.uri	https://hdl.handle.net/1721.1/156922
dc.description.abstract	Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of new materials prior to their publication in journals, and are a vast source of scientific knowledge that has remained relatively untapped in the field of data-driven molecular design. Because patents are filed seeking to protect specific uses, molecules in patents can be considered to be weakly labeled into application classes. Furthermore, patents published by the US Patent and Trademark Office (USPTO) are downloadable and have machine-readable text and molecular structures. In this work, we train domain-specific generative models using patent data sources by developing an automated pipeline to go from USPTO patent digital files to the generation of novel candidates with minimal human intervention. We test the approach on two in-class extracted datasets, one in organic electronics and another in tyrosine kinase inhibitors. We then evaluate the ability of generative models trained on these in-class datasets on two categories of tasks (distribution learning and property optimization), identify strengths and limitations, and suggest possible explanations and remedies that could be used to overcome these in practice.	en_US
dc.language.iso	en
dc.publisher	Royal Society of Chemistry	en_US
dc.relation.isversionof	10.1039/d3dd00041a	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/	en_US
dc.source	Royal Society of Chemistry	en_US
dc.title	Automated patent extraction powers generative modeling in focused chemical spaces	en_US
dc.type	Article	en_US
dc.identifier.citation	Digital Discovery, 2023,2, 1006-1015	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Materials Science and Engineering	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Chemical Engineering	en_US
dc.relation.journal	Digital Discovery	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2024-09-20T18:16:38Z
dspace.orderedauthors	Subramanian, A; P. Greenman, K; Gervaix, A; Yang, T; Gómez-Bombarelli, R	en_US
dspace.date.submission	2024-09-20T18:16:40Z
mit.journal.volume	2	en_US
mit.journal.issue	4	en_US
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: d3dd00041a.pdf
Size:: 1.250Mb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record