AdaptDB: Adaptive Partitioning for Distributed Joins

Lu, Yi; Shanbhag, Anil; Jindal, Alekh; Madden, Samuel

dc.contributor.author	Jundal, Alekh
dc.contributor.author	Lu, Yi
dc.contributor.author	Shanbhag, Anil Atmanand
dc.contributor.author	Madden, Samuel R
dc.date.accessioned	2018-06-18T13:28:40Z
dc.date.available	2018-06-18T13:28:40Z
dc.date.issued	2017-01
dc.identifier.issn	2150-8097
dc.identifier.uri	http://hdl.handle.net/1721.1/116354
dc.description.abstract	Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best partitioning scheme for a particular workload, rather than adapting to changes in the workload over time. In this paper, we present AdaptDB, an adaptive storage manager for analytical database workloads in a distributed setting. It works by partitioning datasets across a cluster and incrementally refining data partitioning as queries are run. AdaptDB introduces a novel hyper-join that avoids expensive data shuffling by identifying storage blocks of the joining tables that overlap on the join attribute, and only joining those blocks. Hyper-join performs well when each block in one table overlaps with few blocks in the other table, since that will minimize the number of blocks that have to be accessed. To minimize the number of overlapping blocks for common join queries, AdaptDB users smooth repartitioning to repartition small portions of the tables on join attributes as queries run. A prototype of AdaptDB running on top of Spark improves query performance by 2-3x on TPC-H as well as real-world dataset, versus a system that employs scans and shuffle-joins.	en_US
dc.language.iso	en_US
dc.publisher	Association for Computing Machinery (ACM)	en_US
dc.relation.isversionof	http://www.vldb.org/pvldb/vol10.html	en_US
dc.rights	Creative Commons Attribution-NonCommercial-NoDerivs License	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	en_US
dc.source	Proceedings of the Vldb Endowment	en_US
dc.title	AdaptDB: Adaptive Partitioning for Distributed Joins	en_US
dc.type	Article	en_US
dc.identifier.citation	Lu, Yi, Anil Shanbhag, Alekh Jindal and Samuel Madden. "AdaptDB: Adaptive Partitioning for Distributed Joins." Proceedings of the VLDB Endowment 10, no. 5 (2017): 589-600.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.mitauthor	Lu, Yi
dc.contributor.mitauthor	Shanbhag, Anil Atmanand
dc.contributor.mitauthor	Madden, Samuel R
dc.relation.journal	Proceedings of the VLDB Endowment	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Lu, Yi; Shanbhag, Anil; Jindal, Alekh; Madden, Samuel	en_US
dspace.embargo.terms	N	en_US
dc.identifier.orcid	https://orcid.org/0000-0002-2718-9443
dc.identifier.orcid	https://orcid.org/0000-0002-0925-1354
dc.identifier.orcid	https://orcid.org/0000-0002-7470-3265
mit.license	PUBLISHER_CC	en_US

Files in this item

Name:: AdaptDB.pdf
Size:: 527.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record