Reinforcement Learning for Datacenter Congestion Control

Type: Article

Publication Date: 2022-01-17

Citations: 14

DOI: https://doi.org/10.1145/3512798.3512815

Abstract

We approach the task of network congestion control in datacenters using Reinforcement Learning (RL). Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. Evidently, the most popular recent deployments rely on rule-based heuristics that are tested on a predetermined set of benchmarks. Consequently, these heuristics do not generalize well to newly-seen scenarios. Contrarily, we devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We overcome challenges such as partial-observability, nonstationarity, and multi-objectiveness. We further propose a policy gradient algorithm that leverages the analytical structure of the reward function to approximate its derivative and improve stability. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training. Our experiments, conducted on a realistic simulator that emulates communication networks' behavior, exhibit improved performance concurrently on the multiple considered metrics compared to the popular algorithms deployed today in real datacenters. Our algorithm is being productized to replace heuristics in some of the largest datacenters in the world.

Locations

  • ACM SIGMETRICS Performance Evaluation Review - View
  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Reinforcement Learning for Datacenter Congestion Control 2021 Chen Tessler
Yuval Shpigelman
Gal Dalal
Amit Mandelbaum
Doron Haritan Kazakov
Benjamin Fuhrer
Gal Chechik
Shie Mannor
+ PDF Chat Reinforcement Learning for Datacenter Congestion Control 2022 Chen Tessler
Yuval Shpigelman
Gal Dalal
Amit Mandelbaum
Doron Haritan Kazakov
Benjamin Fuhrer
Gal Chechik
Shie Mannor
+ PDF Chat A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers 2023 Shiva Ketabi
Hongkai Chen
Haiwei Dong
Yashar Ganjali
+ A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers 2023 Shiva Ketabi
Hongkai Chen
Haiwei Dong
Yashar Ganjali
+ Iroko: A Framework to Prototype Reinforcement Learning for Data Center Traffic Control 2018 Fabian Ruffy
Michael Przystupa
Ivan Beschastnikh
+ GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters 2023 Guillermo BernĂĄrdez
JosĂ© SuĂĄrez‐Varela
Xiang Shi
Shihan Xiao
Xiangle Cheng
Pere Barlet‐Ros
Albert Cabellos‐Aparicio
+ PDF Chat Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs 2023 Benjamin Fuhrer
Yuval Shpigelman
Chen Tessler
Shie Mannor
Gal Chechik
Eitan Zahavi
Gal Dalal
+ Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs 2022 Benjamin Fuhrer
Yuval Shpigelman
Chen Tessler
Shie Mannor
Gal Chechik
Eitan Zahavi
Gal Dalal
+ PDF Chat Multi-Agent Reinforcement Learning for Network Load Balancing in Data Center 2022 Zhiyuan Yao
Zihan Ding
Thomas Clausen
+ PDF Chat Safe Load Balancing in Software-Defined-Networking 2024 Lam Ngoc Dinh
Pham Tran Anh Quang
Jérémie Leguay
+ Towards Safe Load Balancing based on Control Barrier Functions and Deep Reinforcement Learning 2024 Lam Ngoc Dinh
Pham Tran Anh Quang
Jérémie Leguay
+ Reinforced Workload Distribution Fairness 2021 Zhiyuan Yao
Zihan Ding
Thomas Clausen
+ PDF Chat Reinforced Workload Distribution Fairness 2021 Zhiyuan Yao
Zihan Ding
Thomas Clausen
+ When Machine Learning Meets Congestion Control: A Survey and Comparison 2020 Huiling Jiang
Qing Li
Yong Jiang
Gengbiao Shen
Richard Sinnott
Chen Tian
Mingwei Xu
+ On the Robustness of Controlled Deep Reinforcement Learning for Slice Placement 2021 José Jurandir Alves Esteves
Amina Boubendir
Fabrice Guillemin
Pierre Sens
+ PDF Chat Learning to Harness Bandwidth With Multipath Congestion Control and Scheduling 2021 Shiva Raj Pokhrel
Anwar Walid
+ A Reinforcement Learning Approach to Optimize Available Network Bandwidth Utilization 2022 Hasibul Jamil
Elvis Rodrigues
Jacob Goldverg
Tevfik Kosar
+ Learning to Harness Bandwidth with Multipath Congestion Control and Scheduling 2021 Shiva Raj Pokhrel
Anwar Walid
+ PDF Chat Constrained Reinforcement Learning for Adaptive Controller Synchronization in Distributed SDN 2024 Ioannis Panitsas
Akrit Mudvari
Leandros Tassiulas
+ Learning Distributed and Fair Policies for Network Load Balancing as Markov Potential Game 2022 Zhiyuan Yao
Zihan Ding