Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery

Castro Fernandez, Raul; Mansour, Essam; Qahtan, Abdulhakim A.; Elmagarmid, Ahmed; Ilyas, Ihab; Madden, Samuel; Ouzzani, Mourad; Stonebraker, Michael; Tang, Nan

dc.contributor.author	Castro Fernandez, Raul
dc.contributor.author	Mansour, Essam
dc.contributor.author	Qahtan, Abdulhakim A.
dc.contributor.author	Elmagarmid, Ahmed
dc.contributor.author	Ilyas, Ihab
dc.contributor.author	Madden, Samuel
dc.contributor.author	Ouzzani, Mourad
dc.contributor.author	Stonebraker, Michael
dc.contributor.author	Tang, Nan
dc.date.accessioned	2021-11-09T12:48:12Z
dc.date.available	2021-11-09T12:48:12Z
dc.date.issued	2018-04
dc.identifier.uri	https://hdl.handle.net/1721.1/137849
dc.description.abstract	© 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.	en_US
dc.language.iso	en
dc.publisher	IEEE	en_US
dc.relation.isversionof	10.1109/icde.2018.00093	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	website	en_US
dc.title	Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery	en_US
dc.type	Article	en_US
dc.identifier.citation	Castro Fernandez, Raul, Mansour, Essam, Qahtan, Abdulhakim A., Elmagarmid, Ahmed, Ilyas, Ihab et al. 2018. "Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery."
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2019-06-18T17:15:24Z
dspace.date.submission	2019-06-18T17:15:25Z
mit.license	OPEN_ACCESS_POLICY
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: icde2018semantic.pdf
Size:: 486.5Kb
Format:: PDF
Description:: Accepted version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record