Ask a Question

Prefer a chat interface with context about you and your work?

Deep Optimizer States: Towards Scalable Training of Transformer Models using Interleaved Offloading

Deep Optimizer States: Towards Scalable Training of Transformer Models using Interleaved Offloading

Transformers and large language models (LLMs) have seen rapid adoption in all domains.Their sizes have exploded to hundreds of billions of parameters and keep increasing.Under these circumstances, the training of transformers is very expensive and often hits a "memory wall", i.e., even when using 3D parallelism (pipeline, tensor, data) and …