Treffer: An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks

Title:
An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks
Contributors:
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Publisher Information:
Association for Computing Machinery (ACM)
Publication Year:
2021
Collection:
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Document Type:
Konferenz conference object
File Description:
13 p.; application/pdf
Language:
English
Relation:
info:eu-repo/grantAgreement/EC/H2020/800962/EU/Consolidation of European Research Excellence in Exascale HPC Systems/EUROLAB4HPC2; info:eu-repo/grantAgreement/EC/H2020/713673/EU/Innovative doctoral programme for talented early-stage researchers in Spanish host organisations excellent in the areas of Science, Technology, Engineering and Mathematics (STEM)./INPhINIT; http://hdl.handle.net/2117/348972
DOI:
10.1145/3431379.3460644
Rights:
Open Access
Accession Number:
edsbas.4A7CB305
Database:
BASE

Weitere Informationen

Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism. ; The project that gave rise to these results received the support of a fellowship from the "la Caixa" Foundation (ID 100010434). The fellowship code is LCF/BQ/DI17/11620059. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713673. The Eurolab4HPC project has received funding from the European Union Horizon 2020 Framework Programme (H2020-EU.1.2.2. - FET Proactive) under grant agreement number 800962. This work was supported by JST, ACT-X Grant Number JPMJAX190C, Japan; by JST, PRESTO Grant Number JPMJPR20MA, Japan. ; Peer Reviewed ; Postprint (author's final draft)