STORYSUMM: Evaluating Faithfulness in Story Summarization
STORYSUMM: Evaluating Faithfulness in Story Summarization
Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries …