Treffer: A control-theory approach for cluster autonomic management: maximizing usage while avoiding overload

Title:
A control-theory approach for cluster autonomic management: maximizing usage while avoiding overload
Contributors:
Control for Autonomic computing systems (CTRL-A), Centre Inria de l'Université Grenoble Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes 2016-2019 (UGA 2016-2019 )-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes 2016-2019 (UGA 2016-2019 ), GIPSA - Systèmes non linéaires et complexité (GIPSA-SYSCO), Département Automatique (GIPSA-DA), Grenoble Images Parole Signal Automatique (GIPSA-lab), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes 2016-2019 (UGA 2016-2019 )-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes 2016-2019 (UGA 2016-2019 )-Grenoble Images Parole Signal Automatique (GIPSA-lab), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes 2016-2019 (UGA 2016-2019 )-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes 2016-2019 (UGA 2016-2019 ), Data Aware Large Scale Computing (DATAMOVE), Laboratoire d'Informatique de Grenoble (LIG), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes 2016-2019 (UGA 2016-2019 ), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP), Université Grenoble Alpes (UGA)
Source:
CCTA 2019 - 3rd IEEE Conference on Control Technology and Applications ; https://hal.science/hal-02294272 ; CCTA 2019 - 3rd IEEE Conference on Control Technology and Applications, Aug 2019, Hong Kong, China. pp.189-195, ⟨10.1109/CCTA.2019.8920473⟩ ; https://ccta2019.ieeecss.org/
Publisher Information:
CCSD
IEEE
Publication Year:
2019
Collection:
Université Grenoble Alpes: HAL
Subject Geographic:
Document Type:
Konferenz conference object
Language:
English
DOI:
10.1109/CCTA.2019.8920473
Rights:
info:eu-repo/semantics/OpenAccess
Accession Number:
edsbas.11B28313
Database:
BASE

Weitere Informationen

International audience ; Cloud and HPC (High-Performance Computing) systems have increasingly become more varying in their behavior, in particular in aspects such as performance and power consumption, and the fact that they are becoming less predictable demands more runtime management. In this work, we describe results addressing autonomic administration in HPC systems for scientific workflows management through a control theoretical approach. We propose a model described by parameters related to the key aspects of the infrastructure thus achieving a deterministic dynamical representation that covers the diverse and time-varying behaviors of the real computing system. Later, we propose a model-predictive control loop to achieve two different objectives: maximize cluster utilization by best-effort jobs and control the file server's load in the presence of external disturbances. The accuracy of the prediction relies on a parameter estimation scheme based on the EKF (Extended Kalman Filter) to adjust the predictive-model to the real system, making the approach adaptive to parametric variations in the infrastructure. The closed loop strategy shows performance improvement and consequently a reduction in the total computation time. The problem is addressed in a general way, to allow the implementation on similar HPC platforms, as well as scalability to different infrastructures.