Show simple item record

dc.contributor.advisorSamuel Madden.en_US
dc.contributor.authorWu, Sherwin Zhangen_US
dc.contributor.otherMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2016-01-04T20:53:14Z
dc.date.available2016-01-04T20:53:14Z
dc.date.copyright2015en_US
dc.date.issued2015en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/100684
dc.descriptionThesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.en_US
dc.descriptionCataloged from PDF version of thesis.en_US
dc.descriptionIncludes bibliographical references (page 61).en_US
dc.description.abstractBig data has reached the point where the volume, velocity, and variety of data place significant limitations on the computer systems which process and analyze them. Working with very large data sets has becoming increasingly unweildly. Therefore, our goal was to create a system that can support efficient extraction of data subsets to a size that can be manipulated on a single machine. Sifter was developed as a big data corpus generator for scientists to generate these smaller datasets from an original larger one. Sifter's three-layer architecture allows for client users to easily create their own custom data corpus jobs, while allowing administrative users to easily integrate additional core data sets into Sifter. This thesis presents the implemented Sifter system deployed on an initial Twitter dataset. We further show how we added support for a secondary MIMIC medical dataset, as well as demonstrate the scalability of Sifter with very large datasets.en_US
dc.format.extent61 pagesen_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleSifter : a generalized, efficient, and scalable big data corpus generatoren_US
dc.title.alternativeGeneralized, efficient, and scalable big data corpus generatoren_US
dc.typeThesisen_US
dc.description.degreeM. Eng.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc933231825en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record