How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

Type: Preprint

Publication Date: 2021-11-17

Citations: 0

Abstract

Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? To tease apart these possibilities, we introduce RAVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure. We apply these analyses to four neural language models (an LSTM, a Transformer, Transformer-XL, and GPT-2). For local structure - e.g., individual dependencies - model-generated text is substantially less novel than our baseline of human-generated text from each model's test set. For larger-scale structure - e.g., overall sentence structure - model-generated text is as novel or even more novel than the human-generated baseline, but models still sometimes copy substantially, in some cases duplicating passages over 1,000 words long from the training set. We also perform extensive manual analysis showing that GPT-2's novel text is usually well-formed morphologically and syntactically but has reasonably frequent semantic issues (e.g., being self-contradictory).

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN 2021 R. Thomas McCoy
Paul Smolensky
Tal Linzen
Jianfeng Gao
Aslı Çelikyılmaz
+ PDF Chat How Much Do Language Models Copy From Their Training Data? Evaluating Linguistic Novelty in Text Generation Using RAVEN 2023 R. Thomas McCoy
Paul Smolensky
Tal Linzen
Jianfeng Gao
Aslı Çelikyılmaz
+ PDF Chat Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Reddit's Showerthoughts 2024 Tolga Buz
Benjamin Frost
Nikola Genchev
M. Schneider
Lucie-Aimée Kaffee
Gerard de Melo
+ PDF Chat Detection and Measurement of Syntactic Templates in Generated Text 2024 Chantal Shaib
Yanai Elazar
Junyi Jessy Li
Byron Wallace
+ Beyond Turing: A Comparative Analysis of Approaches for Detecting Machine-Generated Text 2023 Muhammad Farid Adilazuarda
Nikolaos Nektarios Arkoulis
О. А. Чумаков
+ Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity 2018 Glorianna Jagfeld
Sabrina Jenne
Ngoc Thang Vu
+ Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity 2018 Glorianna Jagfeld
Sabrina Jenne
Ngoc Thang Vu
+ Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity 2018 Glorianna Jagfeld
Sabrina Jenne
Ngoc Thang Vu
+ Syntax-driven Iterative Expansion Language Models for Controllable Text Generation 2020 Noé Casas
José A. R. Fonollosa
Marta R. Costa‐jussà
+ Syntax-driven Iterative Expansion Language Models for Controllable Text Generation 2020 Noé Casas
José A. R. Fonollosa
Marta R. Costa‐jussà
+ On-the-Fly Attention Modularization for Neural Generation 2021 Yue Dong
Chandra Bhagavatula
Ximing Lu
Jena D. Hwang
Antoine Bosselut
Jackie Chi Kit Cheung
Yejin Choi
+ PDF Chat Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text 2023 Qi Cao
Takeshi Kojima
Yutaka Matsuo
Yusuke Iwasawa
+ The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text 2023 Yanzhu Guo
Guokan Shang
Michalis Vazirgiannis
Chloé Clavel
+ On-the-Fly Attention Modulation for Neural Generation 2021 Yue Dong
Chandra Bhagavatula
Ximing Lu
Jena D. Hwang
Antoine Bosselut
Jackie Kit Cheung
Yejin Choi
+ Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text 2023 Qi Cao
Takeshi Kojima
Yutaka Matsuo
Yusuke Iwasawa
+ PDF Chat HALoGEN: Fantastic LLM Hallucinations and Where to Find Them 2025 Abhilasha Ravichander
Shrusti Ghela
David Wadden
Yejin Choi
+ Do Massively Pretrained Language Models Make Better Storytellers 2019 Abigail See
Aneesh Pappu
Rohun Saxena
Akhila Yerukola
Christopher D. Manning
+ On-the-Fly Attention Modulation for Neural Generation 2021 Yue Dong
Chandra Bhagavatula
Ximing Lu
Jena D. Hwang
Antoine Bosselut
Jackie Chi Kit Cheung
Yejin Choi
+ CTRL: A Conditional Transformer Language Model for Controllable Generation 2019 Nitish Shirish Keskar
Bryan McCann
Lav R. Varshney
Caiming Xiong
Richard Socher
+ TinyStories: How Small Can Language Models Be and Still Speak Coherent English? 2023 Ronen Eldan
Yuanzhi Li

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors