CHuff: Conditional Huffman String Compression

Nagda, Bhavik

Author(s)

Nagda, Bhavik

DownloadThesis PDF (3.758Mb)

Advisor

Kraska, Tim

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Columnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding is a light-weight compression scheme that replaces string tokens with fixed-size integers. In performing dictionary compression on a given column, database systems initially build a table of distinct values and then compress tokens into their corresponding table indices. This work focuses on optimizing compression for strings in columnar database stores. We introduce Conditional Huffman (CHuff) compression, a novel approach leveraging longstanding Huffman encoding and recent advances in hashing and storage paradigms. CHuff relies on low-entropy conditional relationships between consecutive characters in textual data to construct and apply Huffman-based compression models. The system additionally auto-tunes parameters for various corpus workloads, optimizing the compression rate while avoiding over-fitting. We demonstrate that on real-world data, CHuff performs favorably compared to similar string compressors, achieving an average 24% improvement in compression rate on our diverse experimental corpora.

Date issued

2021-09

URI

https://hdl.handle.net/1721.1/139926

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses