Ask a Question

Prefer a chat interface with context about you and your work?

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on …