Show simple item record

dc.contributor.authorShafto, Patrick
dc.contributor.authorJonas, Eric
dc.contributor.authorPetschulat, Cap
dc.contributor.authorGasner, Max
dc.contributor.authorMansinghka, Vikash K
dc.contributor.authorTenenbaum, Joshua B
dc.date.accessioned2017-12-07T15:45:12Z
dc.date.available2017-12-07T15:45:12Z
dc.date.issued2016-01
dc.identifier.issn1532-4435
dc.identifier.issn1533-7928
dc.identifier.urihttp://hdl.handle.net/1721.1/112621
dc.description.abstractThere is a widespread need for statistical methods that can analyze high-dimensional datasets without imposing restrictive or opaque modeling assumptions. This paper describes a domain-general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparametric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian network structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.en_US
dc.publisherMIT Pressen_US
dc.relation.isversionofhttps://dl.acm.org/citation.cfm?id=3007091en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alikeen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en_US
dc.sourcearXiven_US
dc.titleCrossCat: A fully Bayesian nonparametric method for analyzing heterogeneous, high dimensional dataen_US
dc.typeArticleen_US
dc.identifier.citationMansinghka, Vikash et al. "CrossCat: A fully Bayesian nonparametric method for analyzing heterogeneous, high dimensional data." Journal of Machine Learning Research 17, 1 (January 2016): 4760-4808 © 2016 Vikash Mansingkha, Patrick Shafto, Eric Jonas, Cap Petschulat, Max Gasner, and Joshua B. Tenenbaumen_US
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratoryen_US
dc.contributor.mitauthorMansinghka, Vikash K
dc.contributor.mitauthorTenenbaum, Joshua B
dc.relation.journalJournal of Machine Learning Researchen_US
dc.eprint.versionOriginal manuscripten_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2017-12-06T14:44:48Z
dspace.orderedauthorsMansinghka, Vikash; Shafto, Patrick; Jonas, Eric; Petschulat, Cap; Gasner, Max; Tenenbaum, Joshua B.en_US
dspace.embargo.termsNen_US
dc.identifier.orcidhttps://orcid.org/0000-0002-1925-2035
mit.licenseOPEN_ACCESS_POLICYen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record