Notice

This is not the latest version of this item. The latest version can be found at:https://dspace.mit.edu/handle/1721.1/132281.2

Show simple item record

dc.contributor.authorHulsebos, Madelon
dc.contributor.authorHu, Kevin
dc.contributor.authorBakker, Michiel
dc.contributor.authorZgraggen, Emanuel
dc.contributor.authorSatyanarayan, Arvind
dc.contributor.authorKraska, Tim
dc.contributor.authorDemiralp, Çagatay
dc.contributor.authorHidalgo, César
dc.date.accessioned2021-09-20T18:21:39Z
dc.date.available2021-09-20T18:21:39Z
dc.identifier.urihttps://hdl.handle.net/1721.1/132281
dc.description.abstract© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on 686, 765 data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1, 588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.en_US
dc.language.isoen
dc.publisherAssociation for Computing Machinery (ACM)en_US
dc.relation.isversionof10.1145/3292500.3330993en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alikeen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en_US
dc.sourcearXiven_US
dc.titleSherlock: A Deep Learning Approach to Semantic Data Type Detectionen_US
dc.typeArticleen_US
dc.relation.journalProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Miningen_US
dc.eprint.versionOriginal manuscripten_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2021-01-11T16:43:42Z
dspace.orderedauthorsHulsebos, M; Hu, K; Bakker, M; Zgraggen, E; Satyanarayan, A; Kraska, T; Demiralp, Ç; Hidalgo, Cen_US
dspace.date.submission2021-01-11T16:43:48Z
mit.licenseOPEN_ACCESS_POLICY
mit.metadata.statusAuthority Work and Publication Information Needed


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

VersionItemDateSummary

*Selected version