Treffer: GradeML:Towards holistic performance analysis for machine learning workflows

Title:
GradeML:Towards holistic performance analysis for machine learning workflows
Source:
Hegeman, T, Jansen, M, Iosup, A & Trivedi, A 2021, GradeML : Towards holistic performance analysis for machine learning workflows. in ICPE 2021 : Companion of the ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, Inc, pp. 57-63, 2021 ACM/SPEC International Conference on Performance Engineering, ICPE 2021, Virtual, Online, France, 19/04/21. https://doi.org/10.1145/3447545.3451185
Publisher Information:
Association for Computing Machinery, Inc
Publication Year:
2021
Document Type:
Fachzeitschrift article in journal/newspaper
Language:
English
ISBN:
978-1-4503-8331-8
1-4503-8331-9
Relation:
info:eu-repo/semantics/altIdentifier/hdl/https://hdl.handle.net/1871.1/e2a2a21e-f458-4f86-b6bd-2ca530784ac5; info:eu-repo/semantics/altIdentifier/isbn/9781450383318; urn:ISBN:9781450383318
DOI:
10.1145/3447545.3451185
Rights:
info:eu-repo/semantics/openAccess
Accession Number:
edsbas.321A4F59
Database:
BASE

Weitere Informationen

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.