Show simple item record

dc.contributor.advisorKraska, Tim
dc.contributor.authorNagda, Bhavik
dc.date.accessioned2022-02-07T15:13:10Z
dc.date.available2022-02-07T15:13:10Z
dc.date.issued2021-09
dc.date.submitted2021-11-03T19:25:32.845Z
dc.identifier.urihttps://hdl.handle.net/1721.1/139926
dc.description.abstractColumnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding is a light-weight compression scheme that replaces string tokens with fixed-size integers. In performing dictionary compression on a given column, database systems initially build a table of distinct values and then compress tokens into their corresponding table indices. This work focuses on optimizing compression for strings in columnar database stores. We introduce Conditional Huffman (CHuff) compression, a novel approach leveraging longstanding Huffman encoding and recent advances in hashing and storage paradigms. CHuff relies on low-entropy conditional relationships between consecutive characters in textual data to construct and apply Huffman-based compression models. The system additionally auto-tunes parameters for various corpus workloads, optimizing the compression rate while avoiding over-fitting. We demonstrate that on real-world data, CHuff performs favorably compared to similar string compressors, achieving an average 24% improvement in compression rate on our diverse experimental corpora.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright MIT
dc.rights.urihttp://rightsstatements.org/page/InC-EDU/1.0/
dc.titleCHuff: Conditional Huffman String Compression
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record