Treffer: Optimizing computation-communication overlap in asynchronous task-based programs

Title:
Optimizing computation-communication overlap in asynchronous task-based programs
Contributors:
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Publisher Information:
Association for Computing Machinery (ACM)
Publication Year:
2019
Collection:
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Document Type:
Konferenz conference object
File Description:
12 p.; application/pdf
Language:
English
Relation:
https://dl.acm.org/citation.cfm?id=3330379; info:eu-repo/grantAgreement/AEI/RYC-2016-21104; info:eu-repo/grantAgreement/AGAUR/2017 SGR 1414; info:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/; info:eu-repo/grantAgreement/AGAUR/2017-SGR-1328; http://hdl.handle.net/2117/177259
DOI:
10.1145/3330345.3330379
Rights:
Open Access
Accession Number:
edsbas.B1C62C63
Database:
BASE

Weitere Informationen

Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively. ; Peer Reviewed ; Postprint (author's final draft)