MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Type: Preprint

Publication Date: 2024-02-23

Citations: 7

DOI: https://doi.org/10.48550/arxiv.2402.15627

Abstract

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning 2023 Qiaoling Chen
Qinghao Hu
Zhisheng Ye
Guoteng Wang
Peng Sun
Yonggang Wen
Tianwei Zhang
+ PDF Chat Characterization of Large Language Model Development in the Datacenter 2024 Qinghao Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Meng Zhang
Qiaoling Chen
Peng Sun
Dahua Lin
Wang Xiao-lin
Yingwei Luo
+ PDF Chat Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning 2025 Lang Xu
Quentin Anthony
Jacob Hatef
Aamir Shafi
Hari Subramoni
K. Dhabaleswar
Panda
+ Unicron: Economizing Self-Healing LLM Training at Scale 2024 Tao He
Xue Li
Zhibin Wang
Kun Qian
Jing-Bo Xu
Wenyuan Yu
Jingren Zhou
+ PDF Chat Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development 2024 Siyuan Feng
Jiawei Liu
Ruihang Lai
Charlie F. Ruan
Yu Yong
Lingming Zhang
Tianqi Chen
+ PDF Chat INTELLECT-1 Technical Report 2024 Sami Jaghouar
Jack Min Ong
Manveer Basra
Fares Obeid
Jannik Straube
Michael Keiblinger
Elie Bakouch
Lucas Atkins
Maziyar Panahi
Charles Goddard
+ PDF Chat ProTrain: Efficient LLM Training via Memory-Aware Techniques 2024 Hanmei Yang
Jin Zhou
Yao Fu
X. Wang
Ramine Roane
Hui Guan
Tongping Liu
+ Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment 2023 Fei Yang
Shuang Peng
Ning Sun
Fangyu Wang
Ke Tan
Fu Wu
Jiezhong Qiu
Aimin Pan
+ Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment 2024 Fei Yang
Shuang Peng
Ning Sun
Fangyu Wang
Yuanyuan Wang
Fu Zhong Wu
Jiezhong Qiu
Aimin Pan
+ PDF Chat FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment 2024 Ran Yan
Youhe Jiang
Wangcheng Tao
Xiaonan Nie
Bin Cui
Binhang Yuan
+ PDF Chat Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization 2024 Haoyang Li
Fangcheng Fu
Hao Ge
Sheng Lin
Xuanyu Wang
Jianxing Niu
Yujie Wang
Hailin Zhang
Xiaonan Nie
Bin Cui
+ PDF Chat Improving training time and GPU utilization in geo-distributed language model training 2024 Palak
Rohan Gandhi
Karan Tandon
Debopam Bhattacherjee
Venkata N. Padmanabhan
+ PDF Chat GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments 2024 Yanyu Chen
Guangzheng Huang
+ PDF Chat Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference 2024 Joyjit Kundu
Weijie Guo
Ali BanaGozar
Udari De Alwis
Sourav Sengupta
Puneet Gupta
Arindam Mallik
+ A Hardware Evaluation Framework for Large Language Model Inference 2023 Hengrui Zhang
August Ning
R.V.S.N. Prabhakar
David Wentzlaff
+ Cramming: Training a Language Model on a Single GPU in One Day 2022 Jonas Geiping
Tom Goldstein
+ PDF Chat Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems 2024 Ning Lu
Qian Xie
Hao Zhang
Wenyi Fang
Yang Zheng
Zheng Hu
Jiantao Ma
+ Efficient Parallelization Layouts for Large-Scale Distributed Model Training 2023 Johannes Hagemann
Samuel Weinbach
Konstantin Dobler
Maximilian Schall
Gerard de Melo
+ PDF Chat Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator 2024 Kazuki Fujii
Kohei Watanabe
Rio Yokota
+ Optimizing Distributed Training on Frontier for Large Language Models 2024 Sajal Dash
Isaac Lyngaas
Junqi Yin
Xiao Wang
Romain Égelé
J. Austin Ellis
Matthias Maiterth
Guojing Cong
Feiyi Wang
Prasanna Balaprakash

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors