Improving Multilingual Models with Language-Clustered Vocabularies

Type: Article

Publication Date: 2020-01-01

Citations: 43

DOI: https://doi.org/10.18653/v1/2020.emnlp-main.367

Chat PDF

Abstract

State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. Our experiments show improvements across languages on key multilingual benchmark tasks TyDi QA (+2.9 F1), XNLI (+2.1%), and WikiAnn NER (+2.8 F1) and factor of 8 reduction in out-of-vocabulary rate, all without increasing the size of the model or data.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Improving Multilingual Models with Language-Clustered Vocabularies 2020 Hyung Won Chung
Dan Garrette
Kiat Chuan Tan
Jason Riesa
+ Improving Multilingual Models with Language-Clustered Vocabularies 2020 Hyung Won Chung
Dan Garrette
Kiat Chuan Tan
Jason Riesa
+ PDF Chat Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training 2021 Bo Zheng
Li Dong
Shaohan Huang
Saksham Singhal
Wanxiang Che
Ting Liu
Song Xia
Furu Wei
+ Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training 2021 Bo Zheng
Li Dong
Shaohan Huang
Saksham Singhal
Wanxiang Che
Ting Liu
Song Xia
Furu Wei
+ PDF Chat XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models 2023 Davis Liang
Hila Gonen
Yuning Mao
Rui Hou
Naman Goyal
Marjan Ghazvininejad
Luke Zettlemoyer
Madian Khabsa
+ XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models 2023 Davis Liang
Hila Gonen
Yuning Mao
Rui Hou
Naman Goyal
Marjan Ghazvininejad
Luke Zettlemoyer
Madian Khabsa
+ Efficient Multilingual Language Model Compression through Vocabulary Trimming 2023 Asahi Ushio
Yi Zhou
José Camacho-Collados
+ An Efficient Multilingual Language Model Compression through Vocabulary Trimming 2023 Asahi Ushio
Yi Zhou
José Camacho-Collados
+ FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models 2023 Konstantin Dobler
Gerard de Melo
+ Inducing Language-Agnostic Multilingual Representations 2020 Wei Zhao
Steffen Eger
Johannes Bjerva
Isabelle Augenstein
+ Unsupervised Cross-lingual Representation Learning at Scale 2019 Alexis Conneau
Kartikay Khandelwal
Naman Goyal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Édouard Grave
Myle Ott
Luke Zettlemoyer
Veselin Stoyanov
+ PDF Chat Multilingual Large Language Models and Curse of Multilinguality 2024 Daniil Gurgurov
Tanja Bäumel
Tatiana Anikina
+ PDF Chat FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models 2023 Konstantin Dobler
Gerard de Melo
+ Unsupervised Cross-lingual Representation Learning at Scale. 2019 Alexis Conneau
Kartikay Khandelwal
Naman Goyal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Édouard Grave
Myle Ott
Luke Zettlemoyer
Veselin Stoyanov
+ PDF Chat LOLA -- An Open-Source Massively Multilingual Large Language Model 2024 Nikit Srivastava
Denis Kuchelev
Tatiana Moteu Ngoli
Kshitij Shetty
Michael Röder
Diego Moussallem
Hamada M. Zahera
Axel-Cyrille Ngonga Ngomo
+ Discovering Representation Sprachbund For Multilingual Pre-Training 2021 Yimin Fan
Yaobo Liang
Alexandre Muzio
Hany Hassan
Houqiang Li
Ming Zhou
Nan Duan
+ Discovering Representation Sprachbund For Multilingual Pre-Training 2021 Yimin Fan
Yaobo Liang
Alexandre Muzio
Hany Hassan
Houqiang Li
Ming Zhou
Nan Duan
+ Improving Pre-Trained Multilingual Models with Vocabulary Expansion 2019 Hai Wang
Dian Yu
Kai Sun
Janshu Chen
Dong Yu
+ PDF Chat A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers 2024 Kaiyu Huang
Fengran Mo
Hongliang Li
You Li
Yuanchi Zhang
Wei-Jian Yi
Yulong Mao
Jinchen Liu
Yuzhuang Xu
Jinan Xu
+ MonoByte: A Pool of Monolingual Byte-level Language Models 2022 Hugo Abonizio
Leandro Rodrigues de Souza
Roberto Lotufo
Rodrigo Nogueira

Cited by (26)

Action Title Year Authors
+ Incorporating Context into Subword Vocabularies 2023 Shaked Yehezkel
Yuval Pinter
+ How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models 2021 Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
+ How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. 2020 Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
+ PDF Chat Compositional Evaluation on Japanese Textual Entailment and Similarity 2022 Hitomi Yanaka
Koji Mineshima
+ Efficient Multilingual Language Model Compression through Vocabulary Trimming 2023 Asahi Ushio
Yi Zhou
José Camacho-Collados
+ MuRIL: Multilingual Representations for Indian Languages 2021 Simran Khanuja
Diksha Bansal
Sarvesh Mehtani
Savya Khosla
Atreyee Dey
Balaji Gopalan
Dilip Kumar Margam
Pooja Aggarwal
Rajiv Teja Nagipogu
Shachi Dave
+ PDF Chat Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT 2022 Beiduo Chen
Wu Guo
Quan Liu
Kun Tao
+ BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding 2021 Abhik Bhattacharjee
Tahmid Hasan
Kazi Samin
Saiful Islam
M. Sohel Rahman
Anindya Iqbal
Rifat Shahriyar
+ A Primer on Pretrained Multilingual Language Models 2021 Sumanth Doddapaneni
G. Ramesh
Anoop Kunchukuttan
Pratyush Kumar
Mitesh M. Khapra
+ Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training 2021 Bo Zheng
Li Dong
Shaohan Huang
Saksham Singhal
Wanxiang Che
Ting Liu
Song Xia
Furu Wei
+ PDF Chat Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages 2023 Tomasz Limisiewicz
Jiří Balhar
David Mareček
+ Multi-view Subword Regularization 2021 Xinyi Wang
Sebastian Ruder
Graham Neubig
+ PDF Chat Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models 2023 Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Jungo Kasai
David R. Mortensen
Noah A. Smith
Yulia Tsvetkov
+ UNKs Everywhere: Adapting Multilingual Language Models to New Scripts 2020 Jonas Pfeiffer
Ivan Vulić
Iryna Gurevych
Sebastian Ruder
+ PDF Chat Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages 2022 Vaidehi Patil
Partha Talukdar
Sunita Sarawagi
+ PDF Chat Lifting the Curse of Multilinguality by Pre-training Modular Transformers 2022 Jonas Pfeiffer
Naman Goyal
Xi Lin
Xian Li
James H. Cross
Sebastian Riedel
Mikel Artetxe
+ How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology? 2021 Chantal Amrhein
Rico Sennrich
+ PDF Chat <scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation 2022 Jonathan H. Clark
Dan Garrette
Iulia Turc
John Wieting
+ Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP 2021 Sabrina J. Mielke
Zaid Alyafeai
Elizabeth Salesky
Colin Raffel
Manan Dey
Matthias Gallé
Arun Raja
Chenglei Si
Wilson Y. Lee
Benoît Sagot
+ MergeDistill: Merging Pre-trained Language Models using Distillation. 2021 Simran Khanuja
Melvin Johnson
Partha Talukdar
+ PDF Chat Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer 2023 Elizabeth Salesky
Neha Verma
Philipp Koehn
Matt Post
+ Plug and Play Autoencoders for Conditional Text Generation 2020 Florian Mai
Nikolaos Pappas
Ivan Montero
Noah A. Smith
James Henderson
+ PDF Chat Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training 2021 Bo Zheng
Li Dong
Shaohan Huang
Saksham Singhal
Wanxiang Che
Ting Liu
Song Xia
Furu Wei
+ PDF Chat Multi-view Subword Regularization 2021 Xinyi Wang
Sebastian Ruder
Graham Neubig
+ How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology? 2021 Chantal Amrhein
Rico Sennrich
+ Specializing Multilingual Language Models: An Empirical Study 2021 Ethan C. Chau
Noah A. Smith

Citing (23)

Action Title Year Authors
+ Stochastic Complexity in Statistical Inquiry 1998 J. Rissanen
+ Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation 2016 Yonghui Wu
Mike Schuster
Zhifeng Chen
Quoc V. Le
Mohammad Norouzi
Wolfgang Macherey
Maxim Krikun
Yuan Cao
Qin Gao
Klaus Macherey
+ STOCHASTIC COMPLEXITY IN STATISTICAL INQUIRY 1991 A. P. Dawid
+ PDF Chat XNLI: Evaluating Cross-lingual Sentence Representations 2018 Alexis Conneau
Ruty Rinott
Guillaume Lample
Adina Williams
Samuel Bowman
Holger Schwenk
Veselin Stoyanov
+ Cross-lingual Language Model Pretraining 2019 Guillaume Lample
Alexis Conneau
+ Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges 2019 Naveen Arivazhagan
Ankur Bapna
Orhan Fırat
Dmitry Lepikhin
Melvin Johnson
Maxim Krikun
Mia Xu Chen
Yuan Cao
George Foster
Colin Cherry
+ PDF Chat Neural Machine Translation of Rare Words with Subword Units 2016 Rico Sennrich
Barry Haddow
Alexandra Birch
+ SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing 2018 Taku Kudo
John T. E. Richardson
+ Attention is All you Need 2017 Ashish Vaswani
Noam Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan N. Gomez
Łukasz Kaiser
Illia Polosukhin
+ Massively Multilingual Transfer for NER 2019 Afshin Rahimi
Li Yan Yuan
Trevor Cohn
+ Adam: A Method for Stochastic Optimization 2014 Diederik P. Kingma
Jimmy Ba
+ Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT 2019 Shijie Wu
Mark Dredze
+ Cross-Lingual Ability of Multilingual BERT: An Empirical Study 2019 K Karthikeyan
Zihan Wang
Stephen Mayhew
Dan Roth
+ Large Batch Optimization for Deep Learning: Training BERT in 76 minutes 2019 Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
James Demmel
Kurt Keutzer
Cho‐Jui Hsieh
+ ALBERT: A Lite BERT for Self-supervised Learning of Language Representations 2019 Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Kevin Gimpel
Piyush Sharma
Radu Soricut
+ Scaling Laws for Neural Language Models 2020 Jared Kaplan
Sam McCandlish
Tom Henighan
T. B. Brown
Benjamin Chess
Rewon Child
Scott Gray
Alec Radford
Jeffrey Wu
Dario Amodei
+ XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization 2020 Junjie Hu
Sebastian Ruder
Aditya Siddhant
Graham Neubig
Orhan Fırat
Melvin Johnson
+ Language Models are Few-Shot Learners 2020 T. B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
+ Unsupervised Cross-lingual Representation Learning at Scale 2020 Alexis Conneau
Kartikay Khandelwal
Naman Goyal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Édouard Grave
Myle Ott
Luke Zettlemoyer
Veselin Stoyanov
+ GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding 2020 Dmitry Lepikhin
HyoukJoong Lee
Yuanzhong Xu
Dehao Chen
Orhan Fırat
Yanping Huang
Maxim Krikun
Noam Shazeer
Zhifeng Chen
+ TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages 2020 Jonathan H. Clark
Eunsol Choi
Michael J. Collins
Dan Garrette
Tom Kwiatkowski
V. A. Nikolaev
Jennimaria Palomaki
+ Language Models are Few-Shot Learners 2020 T. B. Brown
Benjamin F. Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
+ Attention Is All You Need 2017 Ashish Vaswani
Noam Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan N. Gomez
Łukasz Kaiser
Illia Polosukhin