The SPRAWL distributed stream dissemination system
Author(s)
Mei, Yuan, Ph. D. Massachusetts Institute of Technology
DownloadFull printable version (10.37Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Samuel R. Madden.
Terms of use
Metadata
Show full item recordAbstract
Many large financial, news, and social media companies process and stream large quantities of data to customers, either through the public Internet or on their own internal networks. These customers often depend on that data being delivered in a timely and resource-efficient manner. In addition, many customers subscribe to the same or similar data products (e.g., particular types of financial feeds, or feeds of specific social media users). A naive implementation of a data dissemination network like this will cause redundant data to be processed and delivered repeatedly, wasting CPU and bandwidth, increasing network delays, and driving up costs. In this dissertation, we present SPRAWL, a distributed stream processing layer to address the wide-area data processing and dissemination problem. SPRAWL provides two key functions. First, it is able to generate a shared and distributed multi-query plan that transmits records through the network just once, and shares the computation of streaming operators that operate on the same subset of data. Second, it is able to compute an in-network placement of complex queries (each with dozens of operators) in wide-area networks (consisting of thousands of nodes). This placement is optimal within polynomial time and memory complexity when there are no resource (CPU, bandwidth) or query (latency) constraints. In addition, we develop several heuristics to guarantee the placement is near optimal when constraints are violated, and experimentally evaluate the performance of our algorithms versus an exhausting algorithm. We also design and implement a distributed version of the SPRAWL placement algorithm in order to support wide-area networks consisting of thousands of nodes, which centralized algorithms cannot handle. Finally, we show that SPRAWL can make complex query placement decisions on wide-area networks within seconds, and the placement can increase throughput by up to a factor of 5 and reduce dollar costs by a factor of 6 on a financial data stream processing task.
Description
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015. Cataloged from PDF version of thesis. Includes bibliographical references (pages 125-130).
Date issued
2015Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.