Abstract. This paper considers an architecture for multimodal video
categorization referred to as Pivot Correlational Neural Network (Pivot
CorrNN). The architecture consists of modal-specific streams dedicated
exclusively to one specific modal input as well as modal-agnostic pivot
stream that considers all modal inputs without distinction, and the architecture tries to refine the pivot prediction based on modal-specific predictions. The Pivot CorrNN consists of three modules: (1) maximizing pivotcorrelation module that maximizes the correlation between the hidden
states as well as the predictions of the modal-agnostic pivot stream and
modal-specific streams in the network, (2) contextual Gated Recurrent
Unit (cGRU) module which extends the capability of a generic GRU to
take multimodal inputs in updating the pivot hidden-state, and (3) adaptive aggregation module that aggregates all modal-specific predictions as
well as the modal-agnostic pivot predictions into one final prediction. We
evaluate the Pivot CorrNN on two publicly available large-scale multimodal video categorization datasets, FCVID and YouTube-8M. From the
experimental results, Pivot CorrNN achieves the best performance on the
FCVID database and performance comparable to the state-of-the-art on
YouTube-8M database