Data-mining natural language materials syntheses
Author(s)Kim, Edward Soo.
Massachusetts Institute of Technology. Department of Materials Science and Engineering.
MetadataShow full item record
Discovering, designing, and developing a novel material is an arduous task, involving countless hours of human effort and ingenuity. While some aspects of this process have been vastly accelerated by the advent of first-principles-based computational techniques and high throughput experimental methods, a vast ocean of untapped historical knowledge lies dormant in the scientific literature. Namely, the precise methods by which many inorganic compounds are synthesized are recorded only as text within journal articles. This thesis aims to realize the potential of this data for informing the syntheses of inorganic materials through the use of data-mining algorithms. Critically, the methods used and produced in this thesis are fully automated, thus maximizing the impact for accelerated synthesis planning by human researchers.There are three primary objectives of this thesis: 1) aggregate and codify synthesis knowledge contained within scientific literature, 2) identify synthesis "driving factors" for different synthesis outcomes (e.g., phase selection) and 3) autonomously learn synthesis hypotheses from the literature and extend these hypotheses to predicted syntheses for novel materials. Towards the first goal of this thesis, a pipeline of algorithms is developed in order to extract and codify materials synthesis information from journal articles into a structured, machine readable format, analogous to existing databases for materials structures and properties. To efficiently guide the extraction of materials data, this pipeline leverages domain knowledge regarding the allowable relations between different types of information (e.g., concentrations often correspond to solutions).Both unsupervised and supervised machine learning algorithms are also used to rapidly extract synthesis information from the literature. To examine the autonomous learning of driving factors for morphology selection during hydrothermal syntheses, TiO₂ nanotube formation is found to be correlated with NaOH concentrations and reaction temperatures, using models that are given no internal chemistry knowledge. Additionally, the capacity for transfer learning is shown by predicting phase symmetry in materials systems unseen by models during training, outperforming heuristic physically-motivated baseline stratgies, and again with chemistry-agnostic models. These results suggest that synthesis parameters possess some intrinsic capability for predicting synthesis outcomes. The nature of this linkage between synthesis parameters and synthesis outcomes is then further explored by performing virtual synthesis parameter screening using generative models.Deep neural networks (variational autoencoders) are trained to learn low-dimensional representations of synthesis routes on augmented datasets, created by aggregated synthesis information across materials with high structural similarity. This technique is validated by predicting ion-mediated polymorph selection effects in MnO₂, using only data from the literature (i.e., without knowledge of competing free energies). This method of synthesis parameter screening is then applied to suggest a new hypothesis for solvent-driven formation of the rare TiO₂ phase, brookite. To extend the capability of synthesis planning with literature-based generative models, a sequence-based conditional variational autoencoder (CVAE) neural network is developed. The CVAE allows a materials scientist to query the model for synthesis suggestions of arbitrary materials, including those that the model has not observed before.In a demonstrative experiment, the CVAE suggests the correct precursors for literature-reported syntheses of two perovskite materials using training data published more than a decade prior to the target syntheses. Thus, the CVAE is used as an additional materials synthesis screening utility that is complementary to techniques driven by density functional theory calculations. Finally, this thesis provides a broad commentary on the status quo for the reporting of written materials synthesis methods, and suggests a new format which improves both human and machine readability. The thesis concludes with comments on promising future directions which may build upon the work described in this document.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Thesis: Ph. D., Massachusetts Institute of Technology, Department of Materials Science and Engineering, 2019Cataloged from student-submitted PDF version of thesis.Includes bibliographical references.
DepartmentMassachusetts Institute of Technology. Department of Materials Science and Engineering
Massachusetts Institute of Technology
Materials Science and Engineering.