Parallel load and query processing in a distributed array database
Author(s)
Long, Qian, M. Eng. Massachusetts Institute of Technology
DownloadFull printable version (935.8Kb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Samuel R. Madden.
Terms of use
Metadata
Show full item recordAbstract
Scientists across many research domains collect large amounts of multi-dimensional data in their day to day work. They require high performance, scalable systems to manage and process their data. Oftentimes, the underlying distribution of these types of data is skewed and sparse, rather than dense and uniform. As input data sizes continue to grow at a rapid rate, main memory and storage capacity become bottlenecks on single machines. Thus, we look to distributed array databases as a long term solution for managing and querying this type of data. This thesis presents Multinode-TileDB, a distributed framework that extends TileDB, a new array database management system designed, from the ground up, to handle skewed and sparse arrays. We design the overall distributed architecture and propose and implement parallel algorithms for load, join, subarray, and filter while focusing on load balance and performance. Our experiments show speedup gains as cluster size increases and how different data partitioning schemes benefit the different parallel queries.
Description
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015. This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Cataloged from student-submitted PDF version of thesis. Includes bibliographical references (pages 63-64).
Date issued
2015Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.