Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with
System Co-Design
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with
System Co-Design
The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the …