A container-based lightweight fault tolerance framework for high performance computing workloads

Sindi, Mohamad(Mohamad Othman)

dc.contributor.advisor	John R. Williams.	en_US
dc.contributor.author	Sindi, Mohamad(Mohamad Othman)	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Civil and Environmental Engineering.	en_US
dc.date.accessioned	2020-03-23T18:10:40Z
dc.date.available	2020-03-23T18:10:40Z
dc.date.copyright	2019	en_US
dc.date.issued	2019	en_US
dc.identifier.uri	https://hdl.handle.net/1721.1/124188
dc.description	Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Engineering, 2019	en_US
dc.description	Cataloged from PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 122-130).	en_US
dc.description.abstract	According to the latest world's top 500 supercomputers list, ~90% of the top High Performance Computing (HPC) systems are based on commodity hardware clusters, which are typically designed for performance rather than reliability. The Mean Time Between Failures (MTBF) for some current petascale systems has been reported to be several days, while studies estimate it may be less than 60 minutes for future exascale systems. One of the largest studies on HPC system failures showed that more than 50% of failures were due to hardware, and that failure rates grew with system size. Hence, running extended workloads on such systems is becoming more challenging as system sizes grow. In this work, we design and implement a lightweight fault tolerance framework to improve the sustainability of running workloads on HPC clusters. The framework mainly includes a fault prediction component and a remedy component.	en_US
dc.description.abstract	The fault prediction component is implemented using a parallel algorithm that proactively predicts hardware issues with no overhead. This allows remedial actions to be taken before failures impact workloads. The algorithm uses machine learning applied to supercomputer system logs. We test it on actual logs from systems from Sandia National Laboratories (SNL). The massive logs come from three supercomputers and consist of ~750 million logs (~86 GB data). The algorithm is also tested online on our test cluster. We demonstrate the algorithm's high accuracy and performance in predicting cluster nodes with potential issues. The remedy component is implemented using the Linux container technology. Container technology has proven its success in the microservices domain. We adapt it towards HPC workloads to make use of its resilience potential.	en_US
dc.description.abstract	By running workloads inside containers, we are able to migrate workloads from nodes predicted to have hardware issues, to healthy nodes while workloads are running. This does not introduce any major interruption or performance overhead to the workload, nor require application modification. We test with multiple real HPC applications that use the Message Passing Interface (MPI) standard. Tests are performed on various cluster platforms using different MPI types. Results demonstrate successful migration of HPC workloads, while maintaining integrity of results produced.	en_US
dc.description.statementofresponsibility	by Mohamad Sindi.	en_US
dc.format.extent	130 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Civil and Environmental Engineering.	en_US
dc.title	A container-based lightweight fault tolerance framework for high performance computing workloads	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph. D.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Civil and Environmental Engineering	en_US
dc.identifier.oclc	1144931624	en_US
dc.description.collection	Ph.D. Massachusetts Institute of Technology, Department of Civil and Environmental Engineering	en_US
dspace.imported	2020-03-23T18:10:40Z	en_US
mit.thesis.degree	Doctoral	en_US
mit.thesis.department	CivEng	en_US

Files in this item

Name:: 1144931624-MIT.pdf
Size:: 14.45Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Show simple item record