Textbook Question Answering with Multi-modal Context Graph
Understanding and Self-supervised Open-set Comprehension
Abstract
In this work, we introduce a novel algorithm
for solving the textbook question answering
(TQA) task which describes more realistic QA
problems compared to other recent tasks. We
mainly focus on two related issues with analysis of the TQA dataset. First, solving the
TQA problems requires to comprehend multimodal contexts in complicated input data. To
tackle this issue of extracting knowledge features from long text lessons and merging them
with visual features, we establish a context
graph from texts and images, and propose
a new module f-GCN based on graph convolutional networks (GCN). Second, scientific terms are not spread over the chapters
and subjects are split in the TQA dataset. To
overcome this so called ‘out-of-domain’ issue, before learning QA problems, we introduce a novel self-supervised open-set learning process without any annotations. The experimental results show that our model signifi-
cantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that
both methods of incorporating f-GCN for extracting knowledge from multi-modal contexts
and our newly proposed self-supervised learning process are effective for TQA problems.