Authors: Zhengji Zhao and Rebecca Hartman-Baker (National Energy Research Scientific Computing Center (NERSC)) and Gene Cooperman (Northeastern University, Khoury College of Computer Sciences)
Abstract: Checkpoint/restart (C/R) is a critical component of fault-tolerant computing, and provides scheduling flexibility for computing centers to support diverse workloads with different priorities. Because existing C/R tools are often research-oriented, there is a gap to close before they can be used reliably with production workloads, especially on cutting-edge HPC systems. In this talk, we present our strategy to enable C/R capabilities on NERSC production workloads, which are dominated by MPI and hybrid MPI+OpenMP applications. We share our journey to prepare a production-ready MPI-Agnostic Network-Agnostic (MANA) Distributed Multi-Threaded CheckPointing (DMTCP) tool for NERSC. We also present variable-time job scripts to automate preempted job submissions, queue policies and configurations we have adopted to incentivize C/R usage, our user training effort to increase NERSC users' uptake of C/R, and our effort to build an active C/R community. Finally, we showcase some applications enabled by C/R.
Extended Abstract: pdf
Presentation: pdf
Back to the Visualization & Data Analytics Showcase Archive Listing