Ranking Generated Summaries by Correctness: An Interesting butChallenging Application for Natural Language Inference
Abstract
While recent progress on abstractive summarization has led to remarkably fluent summaries, factual errors in generated summaries
still severely limit their use in practice. In
this paper, we evaluate summaries produced
by state-of-the-art models via crowdsourcing
and show that such errors occur frequently, in
particular with more abstractive models. We
study whether textual entailment predictions
can be used to detect such errors and if they
can be reduced by reranking alternative predicted summaries. That leads to an interesting
downstream application for entailment models. In our experiments, we find that outof-the-box entailment models trained on NLI
datasets do not yet offer the desired performance for the downstream task and we therefore release our annotations as additional test
data for future extrinsic evaluations of NLI