MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

CHuff: Conditional Huffman String Compression

Author(s)
Nagda, Bhavik
Thumbnail
DownloadThesis PDF (3.758Mb)
Advisor
Kraska, Tim
Terms of use
In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Columnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding is a light-weight compression scheme that replaces string tokens with fixed-size integers. In performing dictionary compression on a given column, database systems initially build a table of distinct values and then compress tokens into their corresponding table indices. This work focuses on optimizing compression for strings in columnar database stores. We introduce Conditional Huffman (CHuff) compression, a novel approach leveraging longstanding Huffman encoding and recent advances in hashing and storage paradigms. CHuff relies on low-entropy conditional relationships between consecutive characters in textual data to construct and apply Huffman-based compression models. The system additionally auto-tunes parameters for various corpus workloads, optimizing the compression rate while avoiding over-fitting. We demonstrate that on real-world data, CHuff performs favorably compared to similar string compressors, achieving an average 24% improvement in compression rate on our diverse experimental corpora.
Date issued
2021-09
URI
https://hdl.handle.net/1721.1/139926
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.