Abstract. In this paper, we propose a new memory network structure
for few-shot video classification by making the following contributions.
First, we propose a compound memory network (CMN) structure under the key-value memory network paradigm, in which each key memory
involves multiple constituent keys. These constituent keys work collaboratively for training, which enables the CMN to obtain an optimal video
representation in a larger space. Second, we introduce a multi-saliency
embedding algorithm which encodes a variable-length video sequence
into a fixed-size matrix representation by discovering multiple saliencies
of interest. For example, given a video of car auction, some people are
interested in the car, while others are interested in the auction activities.
Third, we design an abstract memory on top of the constituent keys. The
abstract memory and constituent keys form a layered structure, which
makes the CMN more efficient and capable of being scaled, while also
retaining the representation capability of the multiple keys. We compare
CMN with several state-of-the-art baselines on a new few-shot video
classification dataset and show the effectiveness of our approach