Notice

This is not the latest version of this item. The latest version can be found at:https://dspace.mit.edu/handle/1721.1/133951.2

Show simple item record

dc.contributor.authorQahtan, Abdulhakim
dc.contributor.authorTang, Nan
dc.contributor.authorOuzzani, Mourad
dc.contributor.authorCao, Yang
dc.contributor.authorStonebraker, Michael
dc.date.accessioned2021-10-27T19:57:20Z
dc.date.available2021-10-27T19:57:20Z
dc.date.issued2020
dc.identifier.urihttps://hdl.handle.net/1721.1/133951
dc.description.abstractPatterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like “1980-“ might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) functional dependencies and denial constraints, e.g., a ZIP code uniquely determines a city in the UK, have been widely used in data cleaning. However, a promising, but not yet explored, direction is to combine regex- and IC-based theories to capture data dependencies involving partial attribute values. For example, in an employee ID such as“F-9-107“, “F“ is sufficient to determine the finance department. Inspired by the above observation, we propose a novel class of ICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, analogous to Armstrong's axioms for FDs, and study the complexity of consistency and implication analysis of PFDs. Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of errors in the data. Our extensive experiments on 15 real-world datasets show that our approach can effectively discover valid and useful PFDs over dirty data, which can then be used to detect data errors that are hard to capture by other types of ICs.en_US
dc.language.isoen
dc.publisherVLDB Endowmenten_US
dc.relation.isversionof10.14778/3377369.3377377en_US
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivs Licenseen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/en_US
dc.sourceVLDB Endowmenten_US
dc.titlePattern functional dependencies for data cleaningen_US
dc.typeArticleen_US
dc.relation.journalProceedings of the VLDB Endowmenten_US
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2021-09-24T13:57:25Z
dspace.orderedauthorsQahtan, A; Tang, N; Ouzzani, M; Cao, Y; Stonebraker, Men_US
dspace.date.submission2021-09-24T13:57:26Z
mit.journal.volume13en_US
mit.journal.issue5en_US
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US
mit.metadata.statusAuthority Work and Publication Information Needed


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

VersionItemDateSummary

*Selected version