PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Type: Article

Publication Date: 2021-01-01

Citations: 118

DOI: https://doi.org/10.1162/tacl_a_00415

Abstract

Abstract Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) while retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to “back-off” to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.

Locations

  • Transactions of the Association for Computational Linguistics - View - PDF
  • UCL Discovery (University College London) - View - PDF
  • arXiv (Cornell University) - View - PDF
  • DOAJ (DOAJ: Directory of Open Access Journals) - View

Similar Works

Action Title Year Authors
+ PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them 2021 Patrick Lewis
Yuxiang Wu
Linqing Liu
Pasquale Minervini
Heinrich Küttler
Aleksandra Piktus
Pontus Stenetorp
Sebastian Riedel
+ PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them 2021 Patrick Lewis
Yuxiang Wu
Linqing Liu
Pasquale Minervini
Heinrich Küttler
Aleksandra Piktus
Pontus Stenetorp
Sebastian Riedel
+ Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering 2021 Fengbin Zhu
Wenqiang Lei
Chao Wang
Jianming Zheng
Soujanya Poria
Tat‐Seng Chua
+ End-to-End Training of Neural Retrievers for Open-Domain Question Answering 2021 Devendra Singh Sachan
Mostofa Patwary
Mohammad Shoeybi
Neel Kant
Ping Wei
William L. Hamilton
Bryan Catanzaro
+ End-to-End Training of Neural Retrievers for Open-Domain Question Answering 2021 Devendra Singh Sachan
Mostofa Patwary
Mohammad Shoeybi
Neel Kant
Wei Ping
William L. Hamilton
Bryan Catanzaro
+ GooAQ: Open Question Answering with Diverse Answer Types 2021 Daniel Khashabi
Amos H.C. Ng
Tushar Khot
Ashish Sabharwal
Hannaneh Hajishirzi
Chris Callison-Burch
+ PDF Chat Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering 2021 Siddhant Garg
Alessandro Moschitti
+ Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering 2021 Siddhant Garg
Alessandro Moschitti
+ Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering 2021 Siddhant Garg
Alessandro Moschitti
+ UnitedQA: A Hybrid Approach for Open Domain Question Answering 2021 Hao Cheng
Yelong Shen
Xiaodong Liu
Pengcheng He
Weizhu Chen
Jianfeng Gao
+ Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets 2021 Patrick Lewis
Pontus Stenetorp
Sebastian Riedel
+ PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development 2023 Avirup Sil
Jaydeep Sen
Bhavani Iyer
Martin Franz
Kshitij P. Fadnis
Mihaela Bornea
Sara Brin Rosenthal
Scott McCarley
Rong Zhang
Vishwajeet Kumar
+ PDF Chat RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering 2024 Zihan Zhang
Meng Fang
Ling Chen
+ GooAQ: Open Question Answering with Diverse Answer Types 2021 Daniel Khashabi
Amos H.C. Ng
Tushar Khot
Ashish Sabharwal
Hannaneh Hajishirzi
Chris Callison-Burch
+ GooAQ: Open Question Answering with Diverse Answer Types 2021 Daniel Khashabi
Amos H.C. Ng
Tushar Khot
Ashish Sabharwal
Hannaneh Hajishirzi
Chris Callison-Burch
+ QAMPARI: An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs 2022 Samuel Joseph Amouyal
Ohad Rubin
Ori Yoran
Tomer Wolfson
Jonathan Herzig
Jonathan Berant
+ Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets 2020 Patrick Lewis
Pontus Stenetorp
Sebastian Riedel
+ Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets 2020 Patrick Lewis
Pontus Stenetorp
Sebastian Riedel
+ PDF Chat You Only Need One Model for Open-domain Question Answering 2022 Haejun Lee
Akhil Kedia
Jongwon Lee
Ashwin Paranjape
Christopher D. Manning
Kyoung-Gu Woo
+ PDF Chat Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models 2024 Akchay Srivastava
Atif M. Memon

Works That Cite This (81)

Action Title Year Authors
+ Context Generation Improves Open Domain Question Answering 2023 Dan Su
Mostofa Patwary
Shrimai Prabhumoye
Peng Xu
Ryan Prenger
Mohammad Shoeybi
Pascale Fung
Anima Anandkumar
Bryan Catanzaro
+ MAUPQA: Massive Automatically-created Polish Question Answering Dataset 2023 Piotr Rybak
+ Data Augmentation for Conversational AI 2024 Heydar Soudani
Roxana Petcu
Evangelos Kanoulas
Faegheh Hasibi
+ PDF Chat Mitigating Temporal Misalignment by Discarding Outdated Facts 2023 Michael Zhang
Eunsol Choi
+ PDF Chat Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers 2023 Wanjun Zhong
Tingting Ma
Jiahai Wang
Jian Yin
Tiejun Zhao
Chin-Yew Lin
Nan Duan
+ PDF Chat An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks 2022 Yuxiang Wu
Yu Zhao
Baotian Hu
Pasquale Minervini
Pontus Stenetorp
Sebastian Riedel
+ PDF Chat Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks 2022 Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Hongsheng Li
Xiaohua Wang
Jifeng Dai
+ Precise Zero-Shot Dense Retrieval without Relevance Labels 2023 Luyu Gao
Xueguang Ma
Jimmy Lin
Jamie Callan
+ PDF Chat Multilingual Autoregressive Entity Linking 2022 Nicola De Cao
Ledell Wu
Kashyap Popat
Mikel Artetxe
Naman Goyal
Mikhail Plekhanov
Luke Zettlemoyer
Nicola Cancedda
Sebastian Riedel
Fabio Petroni
+ PDF Chat Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data 2022 Shuohang Wang
Xu Yi‐chong
Yuwei Fang
Yang Liu
Siqi Sun
Ruochen Xu
Chenguang Zhu
Michael Zeng