Sifter : a generalized, efficient, and scalable big data corpus generator
Author(s)
Wu, Sherwin Zhang
DownloadFull printable version (5.735Mb)
Alternative title
Generalized, efficient, and scalable big data corpus generator
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Samuel Madden.
Terms of use
Metadata
Show full item recordAbstract
Big data has reached the point where the volume, velocity, and variety of data place significant limitations on the computer systems which process and analyze them. Working with very large data sets has becoming increasingly unweildly. Therefore, our goal was to create a system that can support efficient extraction of data subsets to a size that can be manipulated on a single machine. Sifter was developed as a big data corpus generator for scientists to generate these smaller datasets from an original larger one. Sifter's three-layer architecture allows for client users to easily create their own custom data corpus jobs, while allowing administrative users to easily integrate additional core data sets into Sifter. This thesis presents the implemented Sifter system deployed on an initial Twitter dataset. We further show how we added support for a secondary MIMIC medical dataset, as well as demonstrate the scalability of Sifter with very large datasets.
Description
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015. Cataloged from PDF version of thesis. Includes bibliographical references (page 61).
Date issued
2015Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.