Deep Optimizer States: Towards Scalable Training of Transformer Models using Interleaved Offloading
Deep Optimizer States: Towards Scalable Training of Transformer Models using Interleaved Offloading
Transformers and large language models (LLMs) have seen rapid adoption in all domains.Their sizes have exploded to hundreds of billions of parameters and keep increasing.Under these circumstances, the training of transformers is very expensive and often hits a "memory wall", i.e., even when using 3D parallelism (pipeline, tensor, data) and …