Hopper: a mathematically optimal algorithm for sketching biological data

DeMeo, Benjamin; Berger, Bonnie

dc.contributor.author	DeMeo, Benjamin
dc.contributor.author	Berger, Bonnie
dc.date.accessioned	2022-09-27T18:42:27Z
dc.date.available	2022-09-27T18:42:27Z
dc.date.issued	2020
dc.identifier.uri	https://hdl.handle.net/1721.1/145594
dc.description.abstract	<jats:title>Abstract</jats:title> <jats:sec> <jats:title>Motivation</jats:title> <jats:p>Single-cell RNA-sequencing has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for today’s largest datasets. In addition, current methods often favor common cell types, and miss salient biological features captured by small cell populations.</jats:p> </jats:sec> <jats:sec> <jats:title>Results</jats:title> <jats:p>Here we present Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching. Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample. Unlike prior sketching methods, Hopper adds points iteratively and allows for additional sampling from regions of interest, enabling fast and targeted multi-resolution analyses. In a dataset of over 1.3 million mouse brain cells, Hopper detects a cluster of just 64 macrophages expressing inflammatory genes (0.004% of the full dataset) from a Hopper sketch containing just 5000 cells, and several other small but biologically interesting immune cell populations invisible to analysis of the full data. On an even larger dataset consisting of ∼2 million developing mouse organ cells, we show Hopper’s even representation of important cell types in small sketches, in contrast with prior sketching methods. We also introduce Treehopper, which uses spatial partitioning to speed up Hopper by orders of magnitude with minimal loss in performance. By condensing transcriptional information encoded in large datasets, Hopper and Treehopper grant the individual user with a laptop the analytic capabilities of a large consortium.</jats:p> </jats:sec> <jats:sec> <jats:title>Availability and implementation</jats:title> <jats:p>The code for Hopper is available at https://github.com/bendemeo/hopper. In addition, we have provided sketches of many of the largest single-cell datasets, available at http://hopper.csail.mit.edu.</jats:p> </jats:sec>	en_US
dc.language.iso	en
dc.publisher	Oxford University Press (OUP)	en_US
dc.relation.isversionof	10.1093/BIOINFORMATICS/BTAA408	en_US
dc.rights	Creative Commons Attribution NonCommercial License 4.0	en_US
dc.rights.uri	https://creativecommons.org/licenses/by-nc/4.0/	en_US
dc.source	Oxford University Press	en_US
dc.title	Hopper: a mathematically optimal algorithm for sketching biological data	en_US
dc.type	Article	en_US
dc.identifier.citation	DeMeo, Benjamin and Berger, Bonnie. 2020. "Hopper: a mathematically optimal algorithm for sketching biological data." Bioinformatics, 36 (Supplement_1).
dc.contributor.department	Massachusetts Institute of Technology. Department of Mathematics	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.relation.journal	Bioinformatics	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2022-09-27T18:31:29Z
dspace.orderedauthors	DeMeo, B; Berger, B	en_US
dspace.date.submission	2022-09-27T18:31:30Z
mit.journal.volume	36	en_US
mit.journal.issue	Supplement_1	en_US
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: btaa408.pdf
Size:: 583.3Kb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record