Treffer: Task-level checkpointing system for task-based parallel workflows

Title:

Task-level checkpointing system for task-based parallel workflows

Authors:

Vergés Boncompte, Pere, Lordan Gomis, Francesc, Ejarque Artigas, Jorge, Badia Sala, Rosa Maria

Contributors:

Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center

Publisher Information:

Springer Nature

Publication Year:

2022

Collection:

Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge

Subject Terms:

Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, High performance computing, Fault-tolerant computing, Parallel processing (Electronic computers), Checkpointing, Task-based programming model, Recovery system, Fault tolerance, Càlcul intensiu (Informàtica), Tolerància als errors (Informàtica), Processament en paral·lel (Ordinadors)

Document Type:

Konferenz conference object

File Description:

12 p.; application/pdf

Language:

English

Relation:

https://link.springer.com/chapter/10.1007/978-3-031-31209-0_19; http://hdl.handle.net/2117/387402

DOI:

10.1007/978-3-031-31209-0_19

Availability:

http://hdl.handle.net/2117/387402
https://doi.org/10.1007/978-3-031-31209-0_19

Rights:

Open Access

Accession Number:

edsbas.90764BC6

Database:

BASE

Weitere Informationen

Scientific applications are large and complex; task-based programming models are a popular approach to developing these applications due to their ease of programming and ability to handle complex workflows and distribute their workload across large infrastructures. In these environments, either the hardware or the software may lead to failures from a myriad of origins: application logic, system software, memory, network, or disk. Re-executing a failed application can take hours, days, or even weeks, thus, dragging out the research. This article proposes a recovery system for dynamic task-based models to reduce the re-execution time of failed runs. The design encapsulates in a checkpointing manager the automatic checkpointing of the execution, leveraging different mechanisms that can be arbitrarily defined and tuned to fit the needs of each performance. Additionally, it offers an API call to establish snapshots of the execution from the application code. The experiments executed on a prototype implementation have reached a speedup of 1.9× after re-execution and shown no overhead on the execution time on successful first runs of specific applications. ; This work has been supported by the Spanish Government (PID2019-107255GB), by Generalitat de Catalunya (contract 2017-SGR-01414), and by the European Commission through the Horizon 2020 Research and Innovation program under Grant Agreement No. 955558 (eFlows4HPC- project). This work has partially been co-funded with 50% by the European Regional Development Fund under the framework of the ERFD Operative Programme for Catalunya 2014-2020. ; Peer Reviewed ; Postprint (author's final draft)

Treffer: Task-level checkpointing system for task-based parallel workflows

Weitere Informationen

Links

Zusatz-Funktionen