Scalable Structure Learning, Inference, and Analysis with Probabilistic Programs
Author(s)Saad, Feras Ahmad Khaled
Mansinghka, Vikash K.
MetadataShow full item record
How can we automate and scale up the processes of learning accurate probabilistic models of complex data and obtaining principled solutions to probabilistic inference and analysis queries? This thesis presents efficient techniques for addressing these fundamental challenges grounded in probabilistic programming, that is, by representing probabilistic models as computer programs in specialized programming languages. First, I introduce scalable methods for real-time synthesis of probabilistic programs in domain-specific data modeling languages, by performing Bayesian structure learning over hierarchies of symbolic program representations. These methods let us automatically discover accurate and interpretable models in a variety of settings, including cross-sectional data, relational data, and univariate and multivariate time series data; as well as models whose structures are generated by probabilistic context-free grammars. Second, I describe SPPL, a probabilistic programming language that integrates knowledge compilation and symbolic analysis to compute sound exact answers to many Bayesian inference queries about both hand-written and machine-synthesized probabilistic programs. Third, I present fast algorithms for analyzing statistical properties of probabilistic programs in cases where exact inference is intractable. These algorithms operate entirely through black-box computational interfaces to probabilistic programs and solve challenging problems such as estimating bounds on the information flow between arbitrary sets of program variables and testing the convergence of sampling-based algorithms for approximate posterior inference. A large collection of empirical evaluations establish that, taken together, these techniques can outperform multiple state-of-the-art systems across diverse real-world data science problems, which include adapting to extreme novelty in streaming time series data; imputing and forecasting sparse multivariate flu rates; discovering commonsense clusters in relational and temporal macroeconomic data; generating synthetic satellite records with realistic orbital physics; finding information-theoretically optimal medical tests for liver disease and diabetes; and verifying the fairness of machine learning classifiers.
DepartmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology