A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

Yuesheng Xu, Arielle Carr

Type: Preprint

Publication Date: 2024-09-13

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2409.09242

Abstract

The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant challenges such as communication overhead, hardware limitations, and node failure. This paper investigates various optimization techniques in distributed deep learning, including Elastic Averaging SGD (EASGD) and the second-order method AdaHessian. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process. We conduct experiments with different numbers of workers and communication periods to demonstrate improved convergence rates and test performance using our strategy.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+	DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging	2020	Qinggang Zhou Yawen Zhang Pengcheng Li Xiaoyong Liu Jun Yang Runsheng Wang Ru Huang
+ PDF Chat	DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging	2020	Qinggang Zhou Yawen Zhang Pengcheng Li Xiaoyong Liu Jun Yang Runsheng Wang Ru Huang
+	A Hitchhiker's Guide On Distributed Training of Deep Neural Networks	2018	Karanbir Chahal Manraj Singh Grover Kuntal Dey
+	A Hitchhiker's Guide On Distributed Training of Deep Neural Networks.	2018	Karanbir Chahal Manraj Singh Grover Kuntal Dey
+	Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD	2018	Jianyu Wang Gauri Joshi
+ PDF Chat	Taming unbalanced training workloads in deep learning with partial collective operations	2020	Shigang Li Tal Ben‐Nun Salvatore Di Girolamo Dan Alistarh Torsten Hoefler
+	Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees	2019	Shaohuai Shi Zhenheng Tang Qiang Wang Kaiyong Zhao Xiaowen Chu
+	Revisiting Distributed Synchronous SGD	2017	Jianmin Chen Rajat Monga Samy Bengio Rafał Józefowicz
+	A Mechanism for Distributed Deep Learning Communication Optimization	2020	Yemao Xu
+	A Quadratic Synchronization Rule for Distributed Deep Learning	2023	Xinran Gu Kaifeng Lyu Sanjeev Arora Jingzhao Zhang Longbo Huang
+ PDF Chat	Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load	2023	Maximilian Egger Serge Kas Hanna Rawad Bitar
+	Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load	2023	Maximilian Egger Serge Kas Hanna Rawad Bitar
+	Gap Aware Mitigation of Gradient Staleness	2019	Saar Barkai Ido Hakimi Assaf Schuster
+	Gap Aware Mitigation of Gradient Staleness	2019	Saar Barkai Ido Hakimi Assaf Schuster
+ PDF Chat	MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training	2023	Daegun Yoon Sangyoon Oh
+	MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training	2023	Daegun Yoon Sangyoon Oh
+	Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study	2017	Suyog Gupta Wei Zhang Fei Wang
+	Gradient Energy Matching for Distributed Asynchronous Gradient Descent	2018	Joeri Hermans Gilles Louppe
+ PDF Chat	Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning	2019	Xing Zhao Manos Papagelis Aijun An Bao Xin Chen Junfeng Liu Yonggang Hu
+	Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning	2020	Xing Zhao Manos Papagelis Aijun An Bao Xin Chen Junfeng Liu Yonggang Hu

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors