1 option
Fault-Tolerance Techniques for High-Performance Computing / edited by Thomas Herault, Yves Robert.
- Format:
- Book
- Series:
- Computer Science (Springer-11645)
- Computer communications and networks 1617-7975
- Computer Communications and Networks, 1617-7975
- Language:
- English
- Subjects (All):
- Computer system failures.
- Computer software--Reusability.
- Computer software.
- Numerical analysis.
- System Performance and Evaluation.
- Performance and Reliability.
- Numeric Computing.
- Local Subjects:
- System Performance and Evaluation.
- Performance and Reliability.
- Numeric Computing.
- Physical Description:
- 1 online resource (IX, 320 pages) : 113 illustrations.
- Edition:
- First edition 2015.
- Contained In:
- Springer eBooks
- Place of Publication:
- Cham : Springer International Publishing : Imprint: Springer, 2015.
- System Details:
- text file PDF
- Summary:
- This timely text/reference presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as algorithm-based fault tolerance. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Topics and features: Includes self-contained contributions from an international selection of preeminent experts Provides a survey of resilience methods and performance models Examines the various sources for errors and faults in large-scale systems, detailing their characteristics, with a focus on modeling, detection and prediction Reviews the spectrum of techniques that can be applied to design a fault-tolerant message passing interface Investigates different approaches to replication, comparing these to the traditional checkpoint-recovery approach Discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems, proposing a methodology to estimate such energy consumption This authoritative volume is essential reading for all researchers and graduate students involved in high-performance computing. Dr. Thomas Herault is a Research Scientist in the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville, TN, USA. Dr. Yves Robert is a Professor in the Laboratory of Parallel Computing at the Ecole Normale Supérieure de Lyon, France, and a Visiting Research Scholar in the ICL.
- Contents:
- Part I: General Overview
- Fault-Tolerance Techniques for High-Performance Computing
- Part II: Technical Contributions
- Errors and Faults
- Fault-Tolerant MPI
- Using Replication for Resilience on Exascale Systems
- Energy-Aware Check pointing Strategies.
- Other Format:
- Printed edition:
- ISBN:
- 978-3-319-20943-2
- 9783319209432
- Access Restriction:
- Restricted for use by site license.
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.