Abstract. Referring expression image segmentation aims to segment
out the object referred by a natural language query expression. Without considering the specific properties of visual and textual information, existing works usually deal with this task by directly feeding a
foreground/background classifier with cascaded image and text features,
which are extracted from each image region and the whole query, respectively. On the one hand, they ignore that each word in a query
expression makes different contributions to identify the desired object,
which requires a differential treatment in extracting text feature. On the
other hand, the relationships of different image regions are not considered as well, even though they are greatly important to eliminate the
undesired foreground object in accordance with specific query. To address aforementioned issues, in this paper, we propose a key-word-aware
network, which contains a query attention model and a key-word-aware
visual context model. In extracting text features, the query attention
model attends to assign higher weights for the words which are more
important for identifying object. Meanwhile, the key-word-aware visual
context model describes the relationships among different image regions, according to corresponding query. Our proposed method outperforms
state-of-the-art methods on two referring expression image segmentation
databases