A Comparison of Methods for Evaluating Generative IR

Type: Preprint

Publication Date: 2024-04-05

Citations: 21

DOI: https://doi.org/10.48550/arxiv.2404.04044

Abstract

Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a large language model (LLM) might directly generate responses without consulting a retrieval component. While there are multiple definitions of generative information retrieval (Gen-IR) systems, in this paper we focus on those systems where the system's response is not drawn from a fixed collection of documents or passages. The response to a query may be entirely new text never. Since traditional IR evaluation methods break down under this model, we explore various methods that extend traditional offline evaluation approaches to the Gen-IR context. Offline IR evaluation traditionally employs paid human assessors, but increasingly LLMs are replacing human assessment, demonstrating capabilities similar or superior to crowdsourced labels. Given that Gen-IR systems do not generate responses from a fixed set, we assume that methods for Gen-IR evaluation must largely depend on LLM-generated labels. Along with methods based on binary and graded relevance, we explore methods based on explicit subtopics, pairwise preferences, and embeddings. We first validate these methods against human assessments on several TREC Deep Learning Track tasks; we then apply these methods to evaluate the output of several purely generative systems. For each method we consider both its ability to act autonomously, without the need for human labels or other input, and its ability to support human auditing. To trust these methods, we must be assured that their results align with human assessments. In order to do so, evaluation criteria must be transparent, so that outcomes can be audited by human assessors.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Generative Information Retrieval Evaluation 2024 Marwah Alaofi
Negar Arabzadeh
Charles L. A. Clarke
Mark Sanderson
+ Evaluating Generative Ad Hoc Information Retrieval 2023 Lukas Gienapp
Harrisen Scells
Niklas Deckers
Janek Bevendorff
Shuai Wang
Johannes Kiesel
Shahbaz Syed
Maik Fröbe
Guido Zuccon
Benno Stein
+ Gen-IR @ SIGIR 2023: The First Workshop on Generative Information Retrieval 2023 G. F. Benedict
Ruqing Zhang
Donald Metzler
+ PDF Chat Foundations of GenIR 2025 Qingyao Ai
Jingtao Zhan
Yiqun Liu
+ PDF Chat Gen-IR@SIGIR 2023: The First Workshop on Generative Information Retrieval 2023 G. F. Benedict
Ruqing Zhang
Donald Metzler
+ PDF Chat A Survey of Generative Information Retrieval 2024 Tzu-Lin Kuo
T. Chiu
Tzung-Sheng Lin
Shengyang Wu
Chao-Wei Huang
Yun-Nung Chen
+ PDF Chat From Matching to Generation: A Survey on Generative Information Retrieval 2024 Xiaoxi Li
Jiajie Jin
Yujia Zhou
Yuyao Zhang
Peitian Zhang
Yutao Zhu
Zhicheng Dou
+ PDF Chat DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation 2024 Jingwei Ni
Tobias Schimanski
Meihong Lin
Mrinmaya Sachan
Elliott Ash
Markus Leippold
+ PDF Chat Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework 2024 Ronak Pradeep
Nandan Thakur
Shivani Upadhyay
Daniel Campos
Nick Craswell
Jimmy Lin
+ PDF Chat RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation 2024 Dongyu Ru
Lin Qiu
Xiangkun Hu
Tianhang Zhang
Peng Shi
Shuaichen Chang
Jay J. Cheng
Cunxiang Wang
Shichao Sun
Huanyu Li
+ PDF Chat Generative Retrieval Meets Multi-Graded Relevance 2024 Yubao Tang
Ruqing Zhang
Jiafeng Guo
Maarten de Rijke
Wei Chen
Xueqi Cheng
+ PDF Chat A Survey on Retrieval-Augmented Text Generation for Large Language Models 2024 Yizheng Huang
Jimmy Xiangji Huang
+ PDF Chat Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation 2024 To Eun Kim
Fernando DĂ­az
+ PDF Chat Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana 2025 Simone Filice
Guy Horowitz
David Carmel
Zohar Karnin
Liane Lewin-Eytan
Yoelle Maarek
+ PDF Chat Evaluation of Retrieval-Augmented Generation: A Survey 2024 Hao Yu
Aoran Gan
Kai Zhang
Shiwei Tong
Qi Liu
Zhaofeng Liu
+ Beyond [CLS] through Ranking by Generation 2020 CĂ­cero Nogueira dos Santos
Xiaofei Ma
Ramesh Nallapati
Zhiheng Huang
Bing Xiang
+ Beyond [CLS] through Ranking by Generation 2020 CĂ­cero Nogueira dos Santos
Xiaofei Ma
Ramesh Nallapati
Zhiheng Huang
Bing Xiang
+ PDF Chat Fr\'echet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels 2024 Negar Arabzadeh
Charles L. A. Clarke
+ PDF Chat Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report 2024 Ayub Khan
Md Toufique Hasan
Kai Kristian Kemell
Jussi Rasku
Pekka Abrahamsson
+ PDF Chat BERGEN: A Benchmarking Library for Retrieval-Augmented Generation 2024 David Rau
Hervé Déjean
Nadezhda Chirkova
Thibault Formal
Shuai Wang
Vassilina Nikoulina
Stéphane Clinchant

Works Cited by This (0)

Action Title Year Authors