Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on
Developmentally Plausible Corpora
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on
Developmentally Plausible Corpora
The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus …