Show simple item record

dc.contributor.authorJundal, Alekh
dc.contributor.authorLu, Yi
dc.contributor.authorShanbhag, Anil Atmanand
dc.contributor.authorMadden, Samuel R
dc.date.accessioned2018-06-18T13:28:40Z
dc.date.available2018-06-18T13:28:40Z
dc.date.issued2017-01
dc.identifier.issn2150-8097
dc.identifier.urihttp://hdl.handle.net/1721.1/116354
dc.description.abstractBig data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best partitioning scheme for a particular workload, rather than adapting to changes in the workload over time. In this paper, we present AdaptDB, an adaptive storage manager for analytical database workloads in a distributed setting. It works by partitioning datasets across a cluster and incrementally refining data partitioning as queries are run. AdaptDB introduces a novel hyper-join that avoids expensive data shuffling by identifying storage blocks of the joining tables that overlap on the join attribute, and only joining those blocks. Hyper-join performs well when each block in one table overlaps with few blocks in the other table, since that will minimize the number of blocks that have to be accessed. To minimize the number of overlapping blocks for common join queries, AdaptDB users smooth repartitioning to repartition small portions of the tables on join attributes as queries run. A prototype of AdaptDB running on top of Spark improves query performance by 2-3x on TPC-H as well as real-world dataset, versus a system that employs scans and shuffle-joins.en_US
dc.language.isoen_US
dc.publisherAssociation for Computing Machinery (ACM)en_US
dc.relation.isversionofhttp://www.vldb.org/pvldb/vol10.htmlen_US
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivs Licenseen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/en_US
dc.sourceProceedings of the Vldb Endowmenten_US
dc.titleAdaptDB: Adaptive Partitioning for Distributed Joinsen_US
dc.typeArticleen_US
dc.identifier.citationLu, Yi, Anil Shanbhag, Alekh Jindal and Samuel Madden. "AdaptDB: Adaptive Partitioning for Distributed Joins." Proceedings of the VLDB Endowment 10, no. 5 (2017): 589-600.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.contributor.mitauthorLu, Yi
dc.contributor.mitauthorShanbhag, Anil Atmanand
dc.contributor.mitauthorMadden, Samuel R
dc.relation.journalProceedings of the VLDB Endowmenten_US
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dspace.orderedauthorsLu, Yi; Shanbhag, Anil; Jindal, Alekh; Madden, Samuelen_US
dspace.embargo.termsNen_US
dc.identifier.orcidhttps://orcid.org/0000-0002-2718-9443
dc.identifier.orcidhttps://orcid.org/0000-0002-0925-1354
dc.identifier.orcidhttps://orcid.org/0000-0002-7470-3265
mit.licensePUBLISHER_CCen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record