Abstract
Deep neural networks have achieved impressive performance in handling complicated semantics in natural language, while mostly treated as black boxes. To explain how the model handles compositional semantics of words and phrases, we study the hierarchical explanation problem. We highlight the key challenge is to compute non-additive and context-independent importance for individual words and phrases. We show some prior efforts on hierarchical explanations, e.g. contextual decomposition, do not satisfy the desired properties mathematically, leading to inconsistent explanation quality in different models. In this paper, we propose a formal way to quantify the importance of each word or phrase to generate hierarchical explanations. We modify contextual decomposition algorithms according to our formulation, and propose a model-agnostic explanation algorithm with competitive performance. Human evaluation and automatic metrics evaluation on both LSTM models and fine-tuned BERT Transformer models on multiple datasets show that our algorithms robustly outperform prior works on hierarchical explanations. We show our algorithms help explain compositionality of semantics, extract classification rules, and improve human trust of models 1 .