My Account Log in

1 option

Fault-Tolerance Techniques for High-Performance Computing / edited by Thomas Herault, Yves Robert.

SpringerLink Books Computer Science (2011-2024) Available online

View online
Format:
Book
Contributor:
Herault, Thomas, editor.
Robert, Yves, editor.
SpringerLink (Online service)
Series:
Computer Science (Springer-11645)
Computer communications and networks 1617-7975
Computer Communications and Networks, 1617-7975
Language:
English
Subjects (All):
Computer system failures.
Computer software--Reusability.
Computer software.
Numerical analysis.
System Performance and Evaluation.
Performance and Reliability.
Numeric Computing.
Local Subjects:
System Performance and Evaluation.
Performance and Reliability.
Numeric Computing.
Physical Description:
1 online resource (IX, 320 pages) : 113 illustrations.
Edition:
First edition 2015.
Contained In:
Springer eBooks
Place of Publication:
Cham : Springer International Publishing : Imprint: Springer, 2015.
System Details:
text file PDF
Summary:
This timely text/reference presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as algorithm-based fault tolerance. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Topics and features: Includes self-contained contributions from an international selection of preeminent experts Provides a survey of resilience methods and performance models Examines the various sources for errors and faults in large-scale systems, detailing their characteristics, with a focus on modeling, detection and prediction Reviews the spectrum of techniques that can be applied to design a fault-tolerant message passing interface Investigates different approaches to replication, comparing these to the traditional checkpoint-recovery approach Discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems, proposing a methodology to estimate such energy consumption This authoritative volume is essential reading for all researchers and graduate students involved in high-performance computing. Dr. Thomas Herault is a Research Scientist in the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville, TN, USA. Dr. Yves Robert is a Professor in the Laboratory of Parallel Computing at the Ecole Normale Supérieure de Lyon, France, and a Visiting Research Scholar in the ICL.
Contents:
Part I: General Overview
Fault-Tolerance Techniques for High-Performance Computing
Part II: Technical Contributions
Errors and Faults
Fault-Tolerant MPI
Using Replication for Resilience on Exascale Systems
Energy-Aware Check pointing Strategies.
Other Format:
Printed edition:
ISBN:
978-3-319-20943-2
9783319209432
Access Restriction:
Restricted for use by site license.

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account