Mixture of Attention Heads: Selecting Attention Heads Per Token

Type: Article

Publication Date: 2022-01-01

Citations: 6

DOI: https://doi.org/10.18653/v1/2022.emnlp-main.278

Abstract

Mixture-of-Experts (MoE) networks have been proposed as an efficient way to scale up model capacity and implement conditional computing. However, the study of MoE components mostly focused on the feedforward layer in Transformer architecture. This paper proposes the Mixture of Attention Heads (MoA), a new architecture that combines multi-head attention with the MoE mechanism. MoA includes a set of attention heads that each has its own set of parameters. Given an input, a router dynamically selects a subset of k attention heads per token. This conditional computation schema allows MoA to achieve stronger performance than the standard multi-head attention layer. Furthermore, the sparsely gated MoA can easily scale up the number of attention heads and the number of parameters while preserving computational efficiency. Despite performance improvements, MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability. We conducted experiments on several important tasks, including Machine Translation and Masked Language Modeling. Experiments have shown promising results on several tasks against strong baselines that involve large and very deep models.

Locations

  • arXiv (Cornell University) - View - PDF
  • Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing - View - PDF

Works Cited by This (24)

Action Title Year Authors
+ PDF Chat Rethinking the Inception Architecture for Computer Vision 2016 Christian Szegedy
Vincent Vanhoucke
Sergey Ioffe
Jon Shlens
Zbigniew Wojna
+ Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned 2019 Elena Voita
David Talbot
Fédor Moiseev
Rico Sennrich
Ivan Titov
+ Pay Less Attention with Lightweight and Dynamic Convolutions 2019 Felix Wu
Angela Fan
Alexei Baevski
Yann Dauphin
Michael Auli
+ PDF Chat Neural Machine Translation of Rare Words with Subword Units 2016 Rico Sennrich
Barry Haddow
Alexandra Birch
+ Mesh-TensorFlow: Deep Learning for Supercomputers 2018 Noam Shazeer
Youlong Cheng
Niki Parmar
Dustin Tran
Ashish Vaswani
Penporn Koanantakool
Peter Hawkins
HyoukJoong Lee
Mingsheng Hong
Cliff Young
+ Scaling Neural Machine Translation 2018 Myle Ott
Sergey Edunov
David Grangier
Michael Auli
+ RoBERTa: A Robustly Optimized BERT Pretraining Approach 2019 Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
Mike Lewis
Luke Zettlemoyer
Veselin Stoyanov
+ Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 2019 Mohammad Shoeybi
Mostofa Patwary
Raul Puri
Patrick LeGresley
Jared Casper
Bryan Catanzaro
+ Do Attention Heads in BERT Track Syntactic Dependencies? 2019 Phu Mon Htut
Jason Phang
Shikha Bordia
Samuel R. Bowman
+ A Mixture of h - 1 Heads is Better than h Heads 2020 Hao Peng
Roy Schwartz
Dianqi Li
Noah A. Smith