Learning Approximate Sequential Patterns for Classification

Syed, Zeeshan; Indyk, Piotr; Guttag, John

dc.contributor.author	Syed, Zeeshan
dc.contributor.author	Indyk, Piotr
dc.contributor.author	Guttag, John V.
dc.date.accessioned	2011-06-09T18:10:07Z
dc.date.available	2011-06-09T18:10:07Z
dc.date.issued	2009-08
dc.date.submitted	2009-03
dc.identifier.issn	1532-4435
dc.identifier.uri	http://hdl.handle.net/1721.1/63807
dc.description.abstract	In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistics improved the running time of the search algorithm by an order of magnitude without any noticeable effect on accuracy. These results suggest that our methods may allow for an unsupervised approach to efficiently learn interesting dissimilarities between positive and negative examples that may have a functional role.	en_US
dc.description.sponsorship	Center for Integration of Medicine and Innovative Technology	en_US
dc.description.sponsorship	Harvard University--MIT Division of Health Sciences and Technology	en_US
dc.description.sponsorship	Industrial Technology Research Institute	en_US
dc.description.sponsorship	Texas Instruments Incorporated	en_US
dc.language.iso	en_US
dc.publisher	MIT Press	en_US
dc.relation.isversionof	http://portal.acm.org/citation.cfm?id=1755849	en_US
dc.rights	Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.	en_US
dc.source	MIT web domain	en_US
dc.title	Learning Approximate Sequential Patterns for Classification	en_US
dc.type	Article	en_US
dc.identifier.citation	Syed, Zeeshan, Piotr Indyk and John Guttag. "Learning Approximate Sequential Patterns for Classification." Journal of Machine Learning Research, Volume 10 (2009) 1913-1936.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.approver	Indyk, Piotr
dc.contributor.mitauthor	Indyk, Piotr
dc.contributor.mitauthor	Guttag, John V.
dc.relation.journal	Journal of Machine Learning Research	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Syed, Zeeshan; Indyk, Piotr; Guttag, John
dc.identifier.orcid	https://orcid.org/0000-0003-0992-0906
dc.identifier.orcid	https://orcid.org/0000-0002-7983-9524
mit.license	PUBLISHER_POLICY	en_US
mit.metadata.status	Complete

Files in this item

Name:: Indyk_Learning approximate.pdf
Size:: 195.2Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record