A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

Type: Preprint

Publication Date: 2024-09-13

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2409.09242

Abstract

The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant challenges such as communication overhead, hardware limitations, and node failure. This paper investigates various optimization techniques in distributed deep learning, including Elastic Averaging SGD (EASGD) and the second-order method AdaHessian. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process. We conduct experiments with different numbers of workers and communication periods to demonstrate improved convergence rates and test performance using our strategy.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging 2020 Qinggang Zhou
Yawen Zhang
Pengcheng Li
Xiaoyong Liu
Jun Yang
Runsheng Wang
Ru Huang
+ PDF Chat DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging 2020 Qinggang Zhou
Yawen Zhang
Pengcheng Li
Xiaoyong Liu
Jun Yang
Runsheng Wang
Ru Huang
+ A Hitchhiker's Guide On Distributed Training of Deep Neural Networks 2018 Karanbir Chahal
Manraj Singh Grover
Kuntal Dey
+ A Hitchhiker's Guide On Distributed Training of Deep Neural Networks. 2018 Karanbir Chahal
Manraj Singh Grover
Kuntal Dey
+ Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD 2018 Jianyu Wang
Gauri Joshi
+ PDF Chat Taming unbalanced training workloads in deep learning with partial collective operations 2020 Shigang Li
Tal Ben‐Nun
Salvatore Di Girolamo
Dan Alistarh
Torsten Hoefler
+ Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees 2019 Shaohuai Shi
Zhenheng Tang
Qiang Wang
Kaiyong Zhao
Xiaowen Chu
+ Revisiting Distributed Synchronous SGD 2017 Jianmin Chen
Rajat Monga
Samy Bengio
Rafał Józefowicz
+ A Mechanism for Distributed Deep Learning Communication Optimization 2020 Yemao Xu
+ A Quadratic Synchronization Rule for Distributed Deep Learning 2023 Xinran Gu
Kaifeng Lyu
Sanjeev Arora
Jingzhao Zhang
Longbo Huang
+ PDF Chat Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load 2023 Maximilian Egger
Serge Kas Hanna
Rawad Bitar
+ Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load 2023 Maximilian Egger
Serge Kas Hanna
Rawad Bitar
+ Gap Aware Mitigation of Gradient Staleness 2019 Saar Barkai
Ido Hakimi
Assaf Schuster
+ Gap Aware Mitigation of Gradient Staleness 2019 Saar Barkai
Ido Hakimi
Assaf Schuster
+ PDF Chat MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training 2023 Daegun Yoon
Sangyoon Oh
+ MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training 2023 Daegun Yoon
Sangyoon Oh
+ Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study 2017 Suyog Gupta
Wei Zhang
Fei Wang
+ Gradient Energy Matching for Distributed Asynchronous Gradient Descent 2018 Joeri Hermans
Gilles Louppe
+ PDF Chat Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning 2019 Xing Zhao
Manos Papagelis
Aijun An
Bao Xin Chen
Junfeng Liu
Yonggang Hu
+ Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning 2020 Xing Zhao
Manos Papagelis
Aijun An
Bao Xin Chen
Junfeng Liu
Yonggang Hu

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors