Reinforcement Learning for Datacenter Congestion Control

Chen Tessler, Yuval Shpigelman, Gal Dalal, Amit Mandelbaum, Doron Haritan Kazakov, Benjamin Fuhrer, Gal Chechik, Shie Mannor

Type: Article

Publication Date: 2022-01-17

Citations: 14

DOI: https://doi.org/10.1145/3512798.3512815

View Publication

Download PDF

Abstract

We approach the task of network congestion control in datacenters using Reinforcement Learning (RL). Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. Evidently, the most popular recent deployments rely on rule-based heuristics that are tested on a predetermined set of benchmarks. Consequently, these heuristics do not generalize well to newly-seen scenarios. Contrarily, we devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We overcome challenges such as partial-observability, nonstationarity, and multi-objectiveness. We further propose a policy gradient algorithm that leverages the analytical structure of the reward function to approximate its derivative and improve stability. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training. Our experiments, conducted on a realistic simulator that emulates communication networks' behavior, exhibit improved performance concurrently on the multiple considered metrics compared to the popular algorithms deployed today in real datacenters. Our algorithm is being productized to replace heuristics in some of the largest datacenters in the world.

Locations

ACM SIGMETRICS Performance Evaluation Review - View
arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+	Reinforcement Learning for Datacenter Congestion Control	2021	Chen Tessler Yuval Shpigelman Gal Dalal Amit Mandelbaum Doron Haritan Kazakov Benjamin Fuhrer Gal Chechik Shie Mannor
+ PDF Chat	Reinforcement Learning for Datacenter Congestion Control	2022	Chen Tessler Yuval Shpigelman Gal Dalal Amit Mandelbaum Doron Haritan Kazakov Benjamin Fuhrer Gal Chechik Shie Mannor
+ PDF Chat	A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers	2023	Shiva Ketabi Hongkai Chen Haiwei Dong Yashar Ganjali
+	A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers	2023	Shiva Ketabi Hongkai Chen Haiwei Dong Yashar Ganjali
+	Iroko: A Framework to Prototype Reinforcement Learning for Data Center Traffic Control	2018	Fabian Ruffy Michael Przystupa Ivan Beschastnikh
+	GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters	2023	Guillermo Bernárdez José Suárez‐Varela Xiang Shi Shihan Xiao Xiangle Cheng Pere Barlet‐Ros Albert Cabellos‐Aparicio
+ PDF Chat	Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs	2023	Benjamin Fuhrer Yuval Shpigelman Chen Tessler Shie Mannor Gal Chechik Eitan Zahavi Gal Dalal
+	Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs	2022	Benjamin Fuhrer Yuval Shpigelman Chen Tessler Shie Mannor Gal Chechik Eitan Zahavi Gal Dalal
+ PDF Chat	Multi-Agent Reinforcement Learning for Network Load Balancing in Data Center	2022	Zhiyuan Yao Zihan Ding Thomas Clausen
+ PDF Chat	Safe Load Balancing in Software-Defined-Networking	2024	Lam Ngoc Dinh Pham Tran Anh Quang Jérémie Leguay
+	Towards Safe Load Balancing based on Control Barrier Functions and Deep Reinforcement Learning	2024	Lam Ngoc Dinh Pham Tran Anh Quang Jérémie Leguay
+	Reinforced Workload Distribution Fairness	2021	Zhiyuan Yao Zihan Ding Thomas Clausen
+ PDF Chat	Reinforced Workload Distribution Fairness	2021	Zhiyuan Yao Zihan Ding Thomas Clausen
+	When Machine Learning Meets Congestion Control: A Survey and Comparison	2020	Huiling Jiang Qing Li Yong Jiang Gengbiao Shen Richard Sinnott Chen Tian Mingwei Xu
+	On the Robustness of Controlled Deep Reinforcement Learning for Slice Placement	2021	José Jurandir Alves Esteves Amina Boubendir Fabrice Guillemin Pierre Sens
+ PDF Chat	Learning to Harness Bandwidth With Multipath Congestion Control and Scheduling	2021	Shiva Raj Pokhrel Anwar Walid
+	A Reinforcement Learning Approach to Optimize Available Network Bandwidth Utilization	2022	Hasibul Jamil Elvis Rodrigues Jacob Goldverg Tevfik Kosar
+	Learning to Harness Bandwidth with Multipath Congestion Control and Scheduling	2021	Shiva Raj Pokhrel Anwar Walid
+ PDF Chat	Constrained Reinforcement Learning for Adaptive Controller Synchronization in Distributed SDN	2024	Ioannis Panitsas Akrit Mudvari Leandros Tassiulas
+	Learning Distributed and Fair Policies for Network Load Balancing as Markov Potential Game	2022	Zhiyuan Yao Zihan Ding

Works That Cite This (3)

Action	Title	Year	Authors
+ PDF Chat	Learning and Information in Stochastic Networks and Queues	2021	Neil Walton Kuang Xu
+ PDF Chat	Adaptive Discretization in Online Reinforcement Learning	2022	Sean R. Sinclair Siddhartha Banerjee Christina Lee Yu
+	Adaptive Discretization in Online Reinforcement Learning	2021	Sean R. Sinclair Siddhartha Banerjee Christina Lee Yu

Works Cited by This (6)

Action	Title	Year	Authors
+	Continuous control with deep reinforcement learning	2015	Timothy Lillicrap Jonathan J. Hunt Alexander Pritzel Nicolas Heess Tom Erez Yuval Tassa David Silver Daan Wierstra
+	Proximal Policy Optimization Algorithms	2017	John Schulman Filip Wolski Prafulla Dhariwal Alec Radford Oleg Klimov
+	AI Safety Gridworlds	2017	Jan Leike Miljan Martic Victoria Krakovna Pedro A. Ortega Tom Everitt Andrew Lefrancq Laurent Orseau Shane Legg
+	Simple random search provides a competitive approach to reinforcement learning	2018	Horia Mania Aurelia Guy Benjamin Recht
+	Reward Constrained Policy Optimization	2018	Chen Tessler Daniel J. Mankowitz Shie Mannor
+	Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation	2020	Hao Wu Patrick Judd Xiaojie Zhang Mikhail Isaev Paulius Micikevicius