Treffer: RICH: implementing reductions in the cache hierarchy

Title:
RICH: implementing reductions in the cache hierarchy
Contributors:
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Publisher Information:
Association for Computing Machinery (ACM)
Publication Year:
2020
Collection:
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Document Type:
Konferenz conference object
File Description:
13 p.; application/pdf
Language:
English
Relation:
info:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/; info:eu-repo/grantAgreement/AGAUR/2017 SGR 1414; info:eu-repo/grantAgreement/AGAUR/2017-SGR-1328; info:eu-repo/grantAgreement/AEI/RYC-2016-21104; info:eu-repo/grantAgreement/MINECO/2PE/RYC-2017-23269; info:eu-repo/grantAgreement/EC/FP7/321253/EU/Riding on Moore's Law/ROMOL; http://hdl.handle.net/2117/192495
DOI:
10.1145/3392717.3392736
Rights:
Open Access
Accession Number:
edsbas.98A61DA5
Database:
BASE

Weitere Informationen

Reductions constitute a frequent algorithmic pattern in high-performance and scientific computing. Sophisticated techniques are needed to ensure their correct and scalable concurrent execution on modern processors. Reductions on large arrays represent the most demanding case where traditional approaches are not always applicable due to low performance scalability. To address these challenges, we propose RICH, a runtime-assisted solution that relies on architectural and parallel programming model extensions. RICH updates the reduction variable directly in the cache hierarchy with the help of added in-cache functional units. Our programming model extensions fit with the most relevant parallel programming solutions for shared memory environments like OpenMP. RICH does not modify the ISA, which allows the use of algorithms with reductions from pre-compiled external libraries. Experiments show that our solution achieves the performance improvements of 11.2% on average, compared to the state-of-the-art hardware-based approaches, while it introduces 2.4% area and 3.8% power overhead. ; This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by Generalitat de Catalunya (contracts 2017- SGR-1414 and 2017-SGR-1328). V. Dimić has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship number 2017 FI_B 00855. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramón y Cajal fellowship number RYC-2016-21104. M. Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269. This manuscript has been co-authored by National Technology & Engineering Solutions of Sandia, LLC. under ...