Treffer: Boosting HPC data analysis performance with the ParSoDA-Py library

Title:
Boosting HPC data analysis performance with the ParSoDA-Py library
Contributors:
Barcelona Supercomputing Center
Publisher Information:
Springer
Publication Year:
2024
Collection:
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Document Type:
Fachzeitschrift article in journal/newspaper
File Description:
application/pdf
Language:
English
Relation:
https://link.springer.com/article/10.1007/s11227-023-05883-z; info:eu-repo/grantAgreement/EC/H2020/955558/EU/Enabling dynamic and Intelligent workflows in the future EuroHPCecosystem/eFlows4HPC; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PCI2021-121957/ES/ENABLING DYNAMIC AND INTELLIGENT WORKFLOWS IN THE FUTURE EUROHPCECOSYSTEM/; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-107255GB-C22/ES/UPC-COMPUTACION DE ALTAS PRESTACIONES VIII/; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-107255GB-C21/ES/BSC - COMPUTACION DE ALTAS PRESTACIONES VIII/; http://hdl.handle.net/2117/404582
DOI:
10.1007/s11227-023-05883-z
Rights:
Attribution 4.0 International ; http://creativecommons.org/licenses/by/4.0/ ; Open Access
Accession Number:
edsbas.66AD7F6E
Database:
BASE

Weitere Informationen

Developing and executing large-scale data analysis applications in parallel and distributed environments can be a complex and time-consuming task. Developers often find themselves diverted from their application logic to handle technical details about the underlying runtime and related issues. To simplify this process, ParSoDA, a Java library, has been proposed to facilitate the development of parallel data mining applications executed on HPC systems. It simplifies the process by providing built-in scalability mechanisms relying on the Hadoop and Spark frameworks. This paper presents ParSoDA-Py, the Python version of the ParSoDA library, which allows for further support of commonly used runtimes and libraries for big data analysis. After a complete library redesign, ParSoDA can be now easily integrated with other Python-based distributed runtimes for HPC systems, such as COMPSs and Apache Spark, and with the large ecosystem of Python-based data processing libraries. The paper discusses the adaptation process, which takes into consideration the new technical requirements, and evaluates both usability and scalability through some case study applications. ; This work has been partially funded by the European Commission’s Horizon 2020 Framework program and the European High-Performance Computing Joint Undertaking (JU) under Grant agreement No 955558 and by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (PCI2021-121957), project eFlows4HPC. It has also been supported by the Spanish Government (PID2019-107255GB) and by the Departament de Recerca i Universitats de la Generalitat de Catalunya to the Research Group MPiEDist (2021 SGR 00412) We also acknowledge financial support from “National Centre for HPC, Big Data and Quantum Computing," CN00000013 - CUP H23C22000360005, and from “FAIR - Future Artificial Intelligence Research" Project - CUP H23C22000860006. ; Peer Reviewed ; Postprint (published version)