Abstract
Researchers illustrate improvements in contextual encoding strategies via resultant performance on a battery of shared Natural Language Understanding (NLU) tasks. Many of
these tasks are of a categorical prediction variety: given a conditioning context (e.g., an NLI
premise), provide a label based on an associated prompt (e.g., an NLI hypothesis). The categorical nature of these tasks has led to common use of a cross entropy log-loss objective
during training. We suggest this loss is intuitively wrong when applied to plausibility
tasks, where the prompt by design is neither
categorically entailed nor contradictory given
the context. Log-loss naturally drives models
to assign scores near 0.0 or 1.0, in contrast to
our proposed use of a margin-based loss. Following a discussion of our intuition, we describe a confirmation study based on an extreme, synthetically curated task derived from
MultiNLI. We find that a margin-based loss
leads to a more plausible model of plausibility. Finally, we illustrate improvements on the
Choice Of Plausible Alternative (COPA) task
through this change in loss.