Abstract
In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment
datasets have been created as by-products of
the manual evaluations performed during the
DUC/TAC shared tasks. However, modern
systems are typically better than the best systems submitted at the time of these shared
tasks. We show that, surprisingly, evaluation metrics which behave similarly on these
datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic
because metrics disagree yet we can’t decide
which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over
which metrics to trust. This would also be
greatly beneficial to further improve summarization systems and metrics alike