Data formats in analytical DBMSs: performance trade-offs and future directions

Liu, Chunwei; Pavlenko, Anna; Interlandi, Matteo; Haynes, Brandon

Author(s)

Liu, Chunwei; Pavlenko, Anna; Interlandi, Matteo; Haynes, Brandon

Download778_2025_Article_911.pdf (3.114Mb)

Publisher with Creative Commons License

Terms of use

Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/

Metadata

Show full item record

Abstract

This paper evaluates the suitability of Apache Arrow, Parquet, and ORC as formats for subsumption in an analytical DBMS. We systematically identify and explore the high-level features that are important to support efficient querying in modern OLAP DBMSs and evaluate the ability of each format to support these features. We find that each format has trade-offs that make it more or less suitable for use as a format in a DBMS and identify opportunities to more holistically co-design a unified in-memory and on-disk data representation. Notably, for certain popular machine learning tasks, none of these formats perform optimally, highlighting significant opportunities for advancing format design. Our hope is that this study can be used as a guide for system developers designing and using these formats, as well as provide the community with directions to pursue for improving these common open formats.

Date issued

2025-03-19

URI

https://hdl.handle.net/1721.1/162354

Department

Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory

Journal

The VLDB Journal

Publisher

Springer Berlin Heidelberg

Citation

Liu, C., Pavlenko, A., Interlandi, M. et al. Data formats in analytical DBMSs: performance trade-offs and future directions. The VLDB Journal 34, 30 (2025).

Version: Final published version

Collections

MIT Open Access Articles