Treffer: Programming models to support data science workflows

Title:
Programming models to support data science workflows
Contributors:
Badia Sala, Rosa M. (Rosa Maria), Ejarque Artigas, Jorge, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Source:
TDX (Tesis Doctorals en Xarxa)
Publisher Information:
Universitat Politècnica de Catalunya
Publication Year:
2020
Collection:
Universitat Politècnica de Catalunya (UPC): Tesis Doctorals en Xarxa (TDX) / Theses and Dissertations Online
Time:
004
Document Type:
Dissertation doctoral or postdoctoral thesis
File Description:
201 p.; application/pdf
Language:
English
DOI:
10.5821/dissertation-2117-330142
Rights:
L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by/4.0/ ; http://creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess
Accession Number:
edsbas.3C336297
Database:
BASE

Weitere Informationen

Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance. This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, general-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows. Second, we integrate container technologies to enable developers to easily port, distribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the packaging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos. Third, we design, implement and integrate AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine ...