Abstract
Evaluating AMR parsing accuracy involves
comparing pairs of AMR graphs. The major
evaluation metric, SMATCH (Cai and Knight,
2013), searches for one-to-one mappings between the nodes of two AMRs with a greedy
hill-climbing algorithm, which leads to search
errors. We propose SEMBLEU, a robust metric that extends BLEU (Papineni et al., 2002)
to AMRs. It does not suffer from search errors and considers non-local correspondences
in addition to local ones. SEMBLEU is fully
content-driven and punishes situations where
a system’s output does not preserve most information from the input. Preliminary experiments on both sentence and corpus levels show
that SEMBLEU has slightly higher consistency
with human judgments than SMATCH. Our
code is available at http://github.com/
freesunshine0316/sembleu.