Abstract
Introducing common sense to natural language
understanding systems has received increasing
research attention. It remains a fundamental
question on how to evaluate whether a system
has a sense making capability. Existing benchmarks measures commonsense knowledge indirectly and without explanation. In this paper, we release a benchmark to directly test
whether a system can differentiate natural language statements that make sense from those
that do not make sense. In addition, a system is
asked to identify the most crucial reason why
a statement does not make sense. We evaluate models trained over large-scale language
modeling tasks as well as human performance,
showing that there are different challenges for
system sense making.