Treffer: A symbolic emulator for shuffle synthesis on the NVIDIA PTX code

Title:
A symbolic emulator for shuffle synthesis on the NVIDIA PTX code
Contributors:
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Barcelona Supercomputing Center
Publisher Information:
Association for Computing Machinery (ACM)
Publication Year:
2023
Collection:
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Document Type:
Konferenz conference object
File Description:
12 p.; application/pdf
Language:
English
Relation:
info:eu-repo/grantAgreement/EC/H2020/801051/EU/European joint Effort toward a Highly Productive Programming Environment for Heterogeneous Exascale Computing (EPEEC)/EPEEC; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-107255GB-C21/ES/BSC - COMPUTACION DE ALTAS PRESTACIONES VIII/; http://hdl.handle.net/2117/384604
DOI:
10.1145/3578360.3580253
Rights:
Attribution 4.0 International ; http://creativecommons.org/licenses/by/4.0/ ; Open Access
Accession Number:
edsbas.68249573
Database:
BASE

Weitere Informationen

Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures. ; We are funded by the EPEEC project from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 801051 and the Ministerio de Ciencia e Innovación-Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033). This work has been partially carried out on the ACME cluster owned by CIEMAT and funded by the Spanish Ministry of Economy and Competitiveness project CODEC-OSE (RTI2018-096006-B-I00). ; Peer Reviewed ; Postprint (published version)