Repository logo
Log in(current)
Repository logoMIT Open ScholarshipDSpace@MIT
  1. Home
  2. MIT Open Access Articles
  3. MIT Open Access Articles
  4. Sherlock: A Deep Learning Approach to Semantic Data Type Detection

Sherlock: A Deep Learning Approach to Semantic Data Type Detection

Thumbnail Image
Download
Name

1905.10688.pdf

Description
Submitted version
Size

2.21 MB

Format

Unknown

Checksum (MD5)

54e98423eca623a27288a4a1c3af213e

sword-2021-01-11T16:43:42.original.xml (130 B)
Original SWORD entry document
Author(s)
Hulsebos, Madelon
•
Hu, Kevin
•
Bakker, Michiel A
•
Zgraggen, Emanuel
•
Satyanarayan, Arvind
•
Kraska, Tim
•
Demiralp, Cagatay
•
Hidalgo, Cesar Augusto
Date Issued
2019
Journal
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Publisher
Association for Computing Machinery (ACM)
Version
Original manuscript
Abstract
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on 686, 765 data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1, 588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.
MIT Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology. Media Laboratory
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Terms of Use
Creative Commons Attribution-Noncommercial-Share Alike
http://creativecommons.org/licenses/by-nc-sa/4.0/
Persistent DSpace Link
https://hdl.handle.net/1721.1/132281.2
DOI of Published Version
10.1145/3292500.3330993
Repository logo
PrivacyPermissionsAccessibilityContact us
Repository logo
Notify us about copyright concerns.