Treffer: Programming models to support data science workflows

Title:

Programming models to support data science workflows

Authors:

Contributors:

Badia Sala, Rosa M. (Rosa Maria), Ejarque Artigas, Jorge, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors

Source:

TDX (Tesis Doctorals en Xarxa)

Publisher Information:

Universitat Politècnica de Catalunya

Publication Year:

2020

Collection:

Universitat Politècnica de Catalunya (UPC): Tesis Doctorals en Xarxa (TDX) / Theses and Dissertations Online

Subject Terms:

Distributed computing, High-performance computing, Data science pipelines, Task-ba workflows, Dataflows, Containers (Computer sicience), COMPSs, PyCOMPSs, AutoParallel, Docker, Àrees temàtiques de la UPC::Informàtica

Time:

004

Document Type:

Dissertation doctoral or postdoctoral thesis

File Description:

201 p.; application/pdf

Language:

English

Relation:

http://hdl.handle.net/10803/669728

DOI:

10.5821/dissertation-2117-330142

Availability:

http://hdl.handle.net/10803/669728
https://doi.org/10.5821/dissertation-2117-330142

Rights:

L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by/4.0/ ; http://creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Accession Number:

edsbas.3C336297

Database:

BASE

Weitere Informationen

Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance. This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, general-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows. Second, we integrate container technologies to enable developers to easily port, distribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the packaging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos. Third, we design, implement and integrate AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine ...

Treffer: Programming models to support data science workflows

Weitere Informationen

Links

Zusatz-Funktionen