Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion

Pit-Claudel, Clément; Mariet, Zelda; Harding, Rachael; Madden, Sam

dc.contributor.advisor	Adam Chlipala
dc.contributor.author	Pit-Claudel, Clément	fr_FR
dc.contributor.author	Mariet, Zelda	en_US
dc.contributor.author	Harding, Rachael	en_US
dc.contributor.author	Madden, Sam	en_US
dc.contributor.other	Programming Languages and Verification	en
dc.date.accessioned	2016-02-10T18:45:06Z
dc.date.available	2016-02-10T18:45:06Z
dc.date.issued	2016-02-08
dc.identifier.uri	http://hdl.handle.net/1721.1/101150
dc.description.abstract	Rapidly developing areas of information technology are generating massive amounts of data. Human errors, sensor failures, and other unforeseen circumstances unfortunately tend to undermine the quality and consistency of these datasets by introducing outliers -- data points that exhibit surprising behavior when compared to the rest of the data. Characterizing, locating, and in some cases eliminating these outliers offers interesting insight about the data under scrutiny and reinforces the confidence that one may have in conclusions drawn from otherwise noisy datasets. In this paper, we describe a tuple expansion procedure which reconstructs rich information from semantically poor SQL data types such as strings, integers, and floating point numbers. We then use this procedure as the foundation of a new user-guided outlier detection framework, dBoost, which relies on inference and statistical modeling of heterogeneous data to flag suspicious fields in database tuples. We show that this novel approach achieves good classification performance, both in traditional numerical datasets and in highly non-numerical contexts such as mostly textual datasets. Our implementation is publicly available, under version 3 of the GNU General Public License.	en_US
dc.format.extent	12 p.	en_US
dc.relation.ispartofseries	MIT-CSAIL-TR-2016-002
dc.title	Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion	en_US
dc.date.updated	2016-02-10T18:45:07Z

Files in this item

Name:: MIT-CSAIL-TR-2016-002.pdf
Size:: 2.583Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

CSAIL Technical Reports (July 1, 2003 - present)

Show simple item record