MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong

Type: Preprint

Publication Date: 2024-02-23

Citations: 7

DOI: https://doi.org/10.48550/arxiv.2402.15627

View Publication

Download PDF

Abstract

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+	AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning	2023	Qiaoling Chen Qinghao Hu Zhisheng Ye Guoteng Wang Peng Sun Yonggang Wen Tianwei Zhang
+ PDF Chat	Characterization of Large Language Model Development in the Datacenter	2024	Qinghao Hu Zhisheng Ye Zerui Wang Guoteng Wang Meng Zhang Qiaoling Chen Peng Sun Dahua Lin Wang Xiao-lin Yingwei Luo
+ PDF Chat	Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning	2025	Lang Xu Quentin Anthony Jacob Hatef Aamir Shafi Hari Subramoni K. Dhabaleswar Panda
+	Unicron: Economizing Self-Healing LLM Training at Scale	2024	Tao He Xue Li Zhibin Wang Kun Qian Jing-Bo Xu Wenyuan Yu Jingren Zhou
+ PDF Chat	Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development	2024	Siyuan Feng Jiawei Liu Ruihang Lai Charlie F. Ruan Yu Yong Lingming Zhang Tianqi Chen
+ PDF Chat	INTELLECT-1 Technical Report	2024	Sami Jaghouar Jack Min Ong Manveer Basra Fares Obeid Jannik Straube Michael Keiblinger Elie Bakouch Lucas Atkins Maziyar Panahi Charles Goddard
+ PDF Chat	ProTrain: Efficient LLM Training via Memory-Aware Techniques	2024	Hanmei Yang Jin Zhou Yao Fu X. Wang Ramine Roane Hui Guan Tongping Liu
+	Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment	2023	Fei Yang Shuang Peng Ning Sun Fangyu Wang Ke Tan Fu Wu Jiezhong Qiu Aimin Pan
+	Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment	2024	Fei Yang Shuang Peng Ning Sun Fangyu Wang Yuanyuan Wang Fu Zhong Wu Jiezhong Qiu Aimin Pan
+ PDF Chat	FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment	2024	Ran Yan Youhe Jiang Wangcheng Tao Xiaonan Nie Bin Cui Binhang Yuan
+ PDF Chat	Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization	2024	Haoyang Li Fangcheng Fu Hao Ge Sheng Lin Xuanyu Wang Jianxing Niu Yujie Wang Hailin Zhang Xiaonan Nie Bin Cui
+ PDF Chat	Improving training time and GPU utilization in geo-distributed language model training	2024	Palak Rohan Gandhi Karan Tandon Debopam Bhattacherjee Venkata N. Padmanabhan
+ PDF Chat	GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments	2024	Yanyu Chen Guangzheng Huang
+ PDF Chat	Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference	2024	Joyjit Kundu Weijie Guo Ali BanaGozar Udari De Alwis Sourav Sengupta Puneet Gupta Arindam Mallik
+	A Hardware Evaluation Framework for Large Language Model Inference	2023	Hengrui Zhang August Ning R.V.S.N. Prabhakar David Wentzlaff
+	Cramming: Training a Language Model on a Single GPU in One Day	2022	Jonas Geiping Tom Goldstein
+ PDF Chat	Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems	2024	Ning Lu Qian Xie Hao Zhang Wenyi Fang Yang Zheng Zheng Hu Jiantao Ma
+	Efficient Parallelization Layouts for Large-Scale Distributed Model Training	2023	Johannes Hagemann Samuel Weinbach Konstantin Dobler Maximilian Schall Gerard de Melo
+ PDF Chat	Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator	2024	Kazuki Fujii Kohei Watanabe Rio Yokota
+	Optimizing Distributed Training on Frontier for Large Language Models	2024	Sajal Dash Isaac Lyngaas Junqi Yin Xiao Wang Romain Égelé J. Austin Ellis Matthias Maiterth Guojing Cong Feiyi Wang Prasanna Balaprakash

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors