MegaScale: Scaling Large Language Model Training to More Than 10,000
GPUs
MegaScale: Scaling Large Language Model Training to More Than 10,000
GPUs
We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs …