Parallel load and query processing in a distributed array database
Author(s)Long, Qian, M. Eng. Massachusetts Institute of Technology
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Samuel R. Madden.
MetadataShow full item record
Scientists across many research domains collect large amounts of multi-dimensional data in their day to day work. They require high performance, scalable systems to manage and process their data. Oftentimes, the underlying distribution of these types of data is skewed and sparse, rather than dense and uniform. As input data sizes continue to grow at a rapid rate, main memory and storage capacity become bottlenecks on single machines. Thus, we look to distributed array databases as a long term solution for managing and querying this type of data. This thesis presents Multinode-TileDB, a distributed framework that extends TileDB, a new array database management system designed, from the ground up, to handle skewed and sparse arrays. We design the overall distributed architecture and propose and implement parallel algorithms for load, join, subarray, and filter while focusing on load balance and performance. Our experiments show speedup gains as cluster size increases and how different data partitioning schemes benefit the different parallel queries.
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 63-64).
DepartmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.