Keynote: On Determining a Viable Path to Resilience at Exascale

Speaker:

Prof. Frank Mueller

North Carolina State University, USA

Abstract

Exascale computing is projected to feature billion core parallelism. At such large processor counts, faults will become more common place. Current techniques to tolerate faults focus on reactive schemes for recovery and generally rely on a simple checkpoint/restart mechanism. Yet, they have a number of shortcomings. (1) They do not scale and require complete job restarts. (2) Projections indicate that the meantime-between-failures is approaching the overhead required for checkpointing. (3) Existing approaches are application-centric, which increases the burden on application programmers and reduces portability.

To address these problems, we discuss a number of techniques and the level of maturity (or lack thereof) to address these problems: (a) Scalable network overlays track node failures and recoveries lifting part of the burden from the programmers. (b) Mechanisms for on-the-fly recovery without a need to restart compute jobs conserve large-scale resources much in contrast to today’s techniques. (c) An approach for proactive fault tolerance that complements reactive schemes further reduces resource requirement. (d) Redundant computing to allow forward computation in the presence of failures. (e) Minimal API support for fault tolerance increases portability without requiring vendors to implement extensive functionality.

We discuss the advantages for process-level virtualization and integration into MPI message passing. These and further advances provide scalability, transparent recovery, portability and reduced checkpoint frequencies in large-scale clusters. We also discuss shortcomings in standardization, existing software stacks at HPC centers and challenges in fault tolerance for exascale computing.

About the Keynote Speaker

Frank MuellerFrank Mueller is a Professor in Computer Science and a member of multiple research centers at North Carolina State University. Previously, he held positions at Lawrence Livermore National Laboratory and Humboldt University Berlin, Germany. He received his Ph.D. from Florida State University in 1994. He has published papers in the areas of parallel and distributed systems, embedded and real-time systems and compilers. He is a member of ACM SIGPLAN, ACM SIGBED and a senior member of the ACM and IEEE Computer Societies as well as an ACM Distinguished Scientist. He is a recipient of an NSF Career Award, an IBM Faculty Award, a Google Research Award and a Fellowship from the Humboldt Foundation.


Copyright of PDSEC. Created and Maintained by PDSEC Team.