Lory: Fully Differentiable Mixture-of-Experts for Autoregressive
Language Model Pre-training
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive
Language Model Pre-training
Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on …