Show simple item record

dc.contributor.authorCastro Fernandez, Raul
dc.contributor.authorMadden, Samuel R
dc.date.accessioned2021-12-20T16:52:10Z
dc.date.available2021-11-05T15:21:26Z
dc.date.available2021-12-20T16:52:10Z
dc.date.issued2019
dc.identifier.urihttps://hdl.handle.net/1721.1/137523.2
dc.description.abstract© 2019 ACM. Data-driven analysis is important in virtually every modern organization. Yet, most data is underutilized because it remains locked in silos inside of organizations; large organizations have thousands of databases, and billions of files that are not integrated together in a single, queryable repository. Despite 40+ years of continuous effort by the database community, data integration still remains an open challenge. In this paper, we advocate a different approach: rather than trying to infer a common schema, we aim to find another common representation for diverse, heterogeneous data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness. We present Termite, a prototype we have built to learn the best embedding from the data. Because the best representation is learned, this allows Termite to avoid much of the human effort associated with traditional data integration tasks. On top of Termite, we have implemented a Termite-Join operator, which allows people to identify related concepts, even when these are stored in databases with different schemas and in unstructured data such as text files, webpages, etc. Finally, we show preliminary evaluation results of our prototype via a user study, and describe a list of future directions we have identified.en_US
dc.language.isoen
dc.publisherAssociation for Computing Machinery (ACM)en_US
dc.relation.isversionof10.1145/3329859.3329877en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alikeen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en_US
dc.sourcearXiven_US
dc.titleTermite: a system for tunneling through heterogeneous dataen_US
dc.typeArticleen_US
dc.identifier.citationFernandez, Raul Castro and Madden, Samuel. 2019. "Termite: a system for tunneling through heterogeneous data." Proceedings of the ACM SIGMOD International Conference on Management of Data.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratoryen_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.relation.journalProceedings of the ACM SIGMOD International Conference on Management of Dataen_US
dc.eprint.versionAuthor's final manuscripten_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2021-01-29T19:04:37Z
dspace.orderedauthorsFernandez, RC; Madden, Sen_US
dspace.date.submission2021-01-29T19:04:39Z
mit.licenseOPEN_ACCESS_POLICY
mit.metadata.statusPublication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

VersionItemDateSummary

*Selected version