Cooperative checkpointing for supercomputing systems

Oliner, Adam Jamison

dc.contributor.advisor	José E. Moreira.	en_US
dc.contributor.author	Oliner, Adam Jamison	en_US
dc.contributor.other	Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2006-03-28T19:51:36Z
dc.date.available	2006-03-28T19:51:36Z
dc.date.copyright	2005	en_US
dc.date.issued	2005	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/32102
dc.description	Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.	en_US
dc.description	This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.	en_US
dc.description	Includes bibliographical references (p. 91-94).	en_US
dc.description.abstract	A system-level checkpointing mechanism, with global knowledge of the state and health of the machine, can improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. This thesis presents such a technique, called cooperative checkpointing, and models its behavior as an online algorithm. Where C is the checkpoint overhead and I is the request interval, a worst-case analysis proves a lower bound of (2 + [C/I])-competitiveness for deterministic cooperative checkpointing algorithms, and proves that a number of simple algorithms meet this bound. Using an expected-case analysis, this thesis proves that an optimal periodic checkpointing algorithm that assumes an exponential failure distribution may be arbitrarily bad relative to an optimal cooperative checkpointing algorithm that permits a general failure distribution. Calculations suggest that, under realistic conditions, an application using cooperative checkpointing may make progress 4 times faster than one using periodic checkpointing. Finally, the thesis suggests an embodiment of cooperative checkpointing for a large-scale high performance computer system and presents the results of some preliminary simulations. These results show that, in extreme cases, cooperative checkpointing improved system utilization by more than 25%, reduced bounded slowdown by a factor of 9, while simultaneously reducing the amount of work lost due to failures by 30%. This thesis contributes a unique approach to providing large-scale system reliability through cooperative checkpointing, techniques for analyzing the approach, and blueprints for implementing it in practice.	en_US
dc.description.statementofresponsibility	by Adam Jamison Oliner.	en_US
dc.format.extent	94 p.	en_US
dc.format.extent	2455146 bytes
dc.format.extent	2616682 bytes
dc.format.mimetype	application/pdf
dc.format.mimetype	application/pdf
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Cooperative checkpointing for supercomputing systems	en_US
dc.type	Thesis	en_US
dc.description.degree	M.Eng.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc	62323950	en_US

Files in this item

Name:: 62323950-MIT.pdf
Size:: 2.495Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record