Human-in-the-loop Outlier Detection

Chai, Chengliang; Cao, Lei; Li, Guoliang; Li, Jian; Luo, Yuyu; Madden, Samuel R

Notice

This is not the latest version of this item. The latest version can be found at:https://dspace.mit.edu/handle/1721.1/130072.2

Show simple item record

dc.contributor.author	Chai, Chengliang
dc.contributor.author	Cao, Lei
dc.contributor.author	Li, Guoliang
dc.contributor.author	Li, Jian
dc.contributor.author	Luo, Yuyu
dc.contributor.author	Madden, Samuel R
dc.date.accessioned	2021-03-03T23:03:12Z
dc.date.available	2021-03-03T23:03:12Z
dc.date.issued	2020-05
dc.identifier.isbn	9781450367356
dc.identifier.uri	https://hdl.handle.net/1721.1/130072
dc.description.abstract	Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose a human-in-the-loop outlier detection approach HOD that effectively leverages human intelligence to discover the true outliers. There are two main challenges in HOD. The first is to design human-friendly questions such that humans can easily understand the questions even if humans know nothing about the outlier detection techniques. The second is to minimize the number of questions. To address the first challenge, we design a clustering-based method to effectively discover a small number of objects that are unlikely to be outliers (aka, inliers) and yet effectively represent the typical characteristics of the given dataset. HOD then leverages this set of inliers (called context inliers) to help humans understand the context in which the outliers occur. This ensures humans are able to easily identify the true outliers from the outlier candidates produced by the machine-based outlier detection techniques. To address the second challenge, we propose a bipartite graph-based question selection strategy that is theoretically proven to be able to minimize the number of questions needed to cover all outlier candidates. Our experimental results on real data sets show that HOD significantly outperforms the state-of-the-art methods on both human efforts and the quality of the discovered outliers.	en_US
dc.publisher	Association for Computing Machinery (ACM)	en_US
dc.relation.isversionof	http://dx.doi.org/10.1145/3318464.3389772	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	Lei Cao	en_US
dc.title	Human-in-the-loop Outlier Detection	en_US
dc.type	Article	en_US
dc.identifier.citation	Chai, Chengliang et al. "Human-in-the-loop Outlier Detection." Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, May 2020, Portland Oregon, Association for Computing Machinery, May 2020. © 2020 Association for Computing Machinery	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.relation.journal	Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data	en_US
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dspace.date.submission	2021-02-25T21:47:03Z
mit.license	OPEN_ACCESS_POLICY
mit.metadata.status	Complete