Treffer: Accelerating many-core, heterogeneous, and distributed architectures with hardware runtimes and programming models

Title:
Accelerating many-core, heterogeneous, and distributed architectures with hardware runtimes and programming models
Contributors:
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Álvarez Martínez, Carlos, Jiménez González, Daniel
Publisher Information:
Universitat Politècnica de Catalunya
Publication Year:
2025
Collection:
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Document Type:
Dissertation doctoral or postdoctoral thesis
File Description:
220 p.; application/pdf
Language:
English
DOI:
10.5821/dissertation-2117-442722
Rights:
http://creativecommons.org/licenses/by/4.0/ ; Open Access ; Attribution 4.0 International
Accession Number:
edsbas.4F2DBA96
Database:
BASE

Weitere Informationen

(English) Due to increasing concern about energy efficiency and the current trend to scale out HPC systems to many computing nodes, this thesis tries to tackle both problems with the help of hardware acceleration and programming models. Regarding the first topic, FPGAs have been the target of study due to their high flexibility to adapt to any computing workload and due to their high energy efficiency. We present extensions to the OmpSs@FPGA framework, which provides a high-level task-based programming interface to non-FPGA experts. These extensions include compiler directives to automatically optimize FPGA code, a hardware task scheduling runtime with dependence analysis called POM, and a multi-FPGA MPI-like API and runtime, called OMPIF. In addition, we present the Implicit Message Passing (IMP) model, which combines task-based and message-passing programming models, leveraging dependence information and a static data distribution. IMP automatically communicates data between nodes when required by the data dependencies of a task. Therefore, the user does not need to write any call to MPI or OMPIF in the code, as this is handled by IMP. We evaluate this model on both FPGA and CPU clusters, with hardware acceleration for task scheduling and message passing using the POM and OMPIF runtimes. For CPU clusters, we study several ways to incorporate POM into an SoC, first with an embedded FPGA, then we design it as an ASIC for a RISC-V core, and finally in an FPGA softcore also based on RISC-V. In the last case, we use both POM and OMPIF to evaluate distributed applications with a cluster of FPGAs that emulate a CPU cluster. We evaluate IMP and regular MPI+tasks programming with several benchmarks: Matrix Multiply, Spectra, N-body, Heat, and Cholesky. With the mentioned contributions, we achieve several objectives. First, we demonstrate that with OmpSs@FPGA we can achieve similar absolute performance to a CPU node for some benchmarks, like N-body, and outperform in energy efficiency to similar CPU and GPU ...