MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Hopper: a mathematically optimal algorithm for sketching biological data

Author(s)
DeMeo, Benjamin; Berger, Bonnie
Thumbnail
DownloadPublished version (583.3Kb)
Publisher with Creative Commons License

Publisher with Creative Commons License

Creative Commons Attribution

Terms of use
Creative Commons Attribution NonCommercial License 4.0 https://creativecommons.org/licenses/by-nc/4.0/
Metadata
Show full item record
Abstract
<jats:title>Abstract</jats:title> <jats:sec> <jats:title>Motivation</jats:title> <jats:p>Single-cell RNA-sequencing has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for today’s largest datasets. In addition, current methods often favor common cell types, and miss salient biological features captured by small cell populations.</jats:p> </jats:sec> <jats:sec> <jats:title>Results</jats:title> <jats:p>Here we present Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching. Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample. Unlike prior sketching methods, Hopper adds points iteratively and allows for additional sampling from regions of interest, enabling fast and targeted multi-resolution analyses. In a dataset of over 1.3 million mouse brain cells, Hopper detects a cluster of just 64 macrophages expressing inflammatory genes (0.004% of the full dataset) from a Hopper sketch containing just 5000 cells, and several other small but biologically interesting immune cell populations invisible to analysis of the full data. On an even larger dataset consisting of ∼2 million developing mouse organ cells, we show Hopper’s even representation of important cell types in small sketches, in contrast with prior sketching methods. We also introduce Treehopper, which uses spatial partitioning to speed up Hopper by orders of magnitude with minimal loss in performance. By condensing transcriptional information encoded in large datasets, Hopper and Treehopper grant the individual user with a laptop the analytic capabilities of a large consortium.</jats:p> </jats:sec> <jats:sec> <jats:title>Availability and implementation</jats:title> <jats:p>The code for Hopper is available at https://github.com/bendemeo/hopper. In addition, we have provided sketches of many of the largest single-cell datasets, available at http://hopper.csail.mit.edu.</jats:p> </jats:sec>
Date issued
2020
URI
https://hdl.handle.net/1721.1/145594
Department
Massachusetts Institute of Technology. Department of Mathematics; Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Journal
Bioinformatics
Publisher
Oxford University Press (OUP)
Citation
DeMeo, Benjamin and Berger, Bonnie. 2020. "Hopper: a mathematically optimal algorithm for sketching biological data." Bioinformatics, 36 (Supplement_1).
Version: Final published version

Collections
  • MIT Open Access Articles

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.