Treffer: Scanflow-K8s: agent-based framework for autonomic management and supervision of ML workflows in Kubernetes clusters

Title:

Scanflow-K8s: agent-based framework for autonomic management and supervision of ML workflows in Kubernetes clusters

Authors:

Liu, Peini, Bravo Rocca, Gusseppe, Guitart Fernández, Jordi, Dholakia, Ajay, Ellison, David, Hodak, Miroslav

Contributors:

Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions

Publisher Information:

Institute of Electrical and Electronics Engineers (IEEE)

Publication Year:

2022

Collection:

Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge

Subject Terms:

Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures distribuïdes, Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic, Machine learning, Intelligent agents (Computer software)), Workflow, Scanflow, Machine learning workflow, Autonomic, Self-Management, Agent, Kubernetes, MLOps, Aprenentatge automàtic, Agents intel·ligents (Programari), Cicle de treball

Document Type:

Konferenz conference object

File Description:

10 p.; application/pdf

Language:

English

Relation:

https://ieeexplore.ieee.org/abstract/document/9826110; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-107255GB-C22/ES/UPC-COMPUTACION DE ALTAS PRESTACIONES VIII/; info:eu-repo/grantAgreement/AGAUR/2017 SGR 1414; http://hdl.handle.net/2117/371292

DOI:

10.1109/CCGrid54584.2022.00047

Availability:

http://hdl.handle.net/2117/371292
https://doi.org/10.1109/CCGrid54584.2022.00047

Rights:

Open Access

Accession Number:

edsbas.E2024430

Database:

BASE

Weitere Informationen

Machine Learning (ML) projects are currently heavily based on workflows composed of some reproducible steps and executed as containerized pipelines to build or deploy ML models efficiently because of the flexibility, portability, and fast delivery they provide to the ML life-cycle. However, deployed models need to be watched and constantly managed, supervised, and debugged to guarantee their availability, validity, and robustness in unexpected situations. Therefore, containerized ML workflows would benefit from leveraging flexible and diverse autonomic capabilities. This work presents an architecture for autonomic ML workflows with abilities for multi-layered control, based on an agent-based approach that enables autonomic management and supervision of ML workflows at the application layer and the infrastructure layer (by collaborating with the orchestrator). We redesign the Scanflow ML framework to support such multi-agent approach by using triggers, primitives, and strategies. We also implement a practical platform, so-called Scanflow-K8s, that enables autonomic ML workflows on Kubernetes clusters based on the Scanflow agents. MNIST image classification and MLPerf ImageNet classification benchmarks are used as case studies to show the capabilities of Scanflow-K8s under different scenarios. The experimental results demonstrate the feasibility and effectiveness of our proposed agent approach and the Scanflow-K8s platform for the autonomic management of ML workflows in Kubernetes clusters at multiple layers. ; This work was supported by Lenovo as part of Lenovo-BSC 2020 collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under grant 2020 FI-B 00257. ; Peer Reviewed ; Postprint (author's final draft)

Treffer: Scanflow-K8s: agent-based framework for autonomic management and supervision of ML workflows in Kubernetes clusters

Weitere Informationen

Links

Zusatz-Funktionen