Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models:
Enhancing Performance and Reducing Inference Costs
Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models:
Enhancing Performance and Reducing Inference Costs
The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per …