Generative pretrained autoregressive transformer graph neural network applied to the analysis and discovery of novel proteins

Buehler, Markus J

dc.contributor.author	Buehler, Markus J
dc.date.accessioned	2024-09-18T18:33:59Z
dc.date.available	2024-09-18T18:33:59Z
dc.date.issued	2023-08-28
dc.identifier.uri	https://hdl.handle.net/1721.1/156895
dc.description.abstract	We report a flexible language-model-based deep learning strategy, applied here to solve complex forward and inverse problems in protein modeling, based on an attention neural network that integrates transformer and graph convolutional architectures in a causal multi-headed graph mechanism, to realize a generative pretrained model. The model is applied to predict the secondary structure content (per-residue level and overall content), protein solubility, and sequencing tasks. Further trained on inverse tasks, the model is rendered capable of designing proteins with these properties as target features. The model is formulated as a general framework, completely prompt-based, and can be adapted for a variety of downstream tasks. We find that adding additional tasks yields emergent synergies that the model exploits in improving overall performance, beyond what would be possible by training a model on each dataset alone. Case studies are presented to validate the method, yielding protein designs specifically focused on structural materials, but also exploring the applicability in the design of soluble, antimicrobial biomaterials. While our model is trained to ultimately perform eight distinct tasks, with available datasets, it can be extended to solve additional problems. In a broader sense, this study illustrates a form of multiscale modeling that relates a set of ultimate building blocks (here, byte-level utf8 characters that define the nature of the physical system at hand) to complex output. This materiomic scheme captures complex emergent relationships between universal building block and resulting properties, via a synergizing learning capacity, to express a set of potentialities embedded in the knowledge used in training via the interplay of universality and diversity.</jats:p> <jats:p>Significance statement: Predicting the properties of materials based on a flexible description of their structure, environment, or process, is a long-standing challenge in multiscale modeling. Our MaterioFormer language model, trained to solve forward and inverse tasks, incorporates a deep learning capacity through attention and graph strategies to yield a multimodal approach to model and design materials. Since our model is prompt-based and information is encoded consistently via byte-level utf8 tokenization, it can process diverse modalities of information, such as sequence data, description of tasks, and numbers, and offers a flexible workflow that integrates human intelligence and artificial intelligence. Autoregressive training, using pre-training against a large unlabeled dataset, allows for straightforward adjustment of specific objectives.	en_US
dc.language.iso	en
dc.publisher	AIP Publishing	en_US
dc.relation.isversionof	10.1063/5.0157367	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	AIP Publishing	en_US
dc.title	Generative pretrained autoregressive transformer graph neural network applied to the analysis and discovery of novel proteins	en_US
dc.type	Article	en_US
dc.identifier.citation	Markus J. Buehler; Generative pretrained autoregressive transformer graph neural network applied to the analysis and discovery of novel proteins. J. Appl. Phys. 28 August 2023; 134 (8): 084902.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Laboratory for Atomistic and Molecular Mechanics	en_US
dc.contributor.department	Massachusetts Institute of Technology. Center for Computational Science and Engineering	en_US
dc.relation.journal	Journal of Applied Physics	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2024-09-18T18:23:06Z
dspace.orderedauthors	Buehler, MJ	en_US
dspace.date.submission	2024-09-18T18:23:09Z
mit.journal.volume	134	en_US
mit.journal.issue	8	en_US
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 084902_1_5.0157367.pdf
Size:: 4.900Mb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record