Recurrent Convolutional Neural Networks
for Continuous Sign Language Recognition by Staged Optimization
Abstract
This work presents a weakly supervised framework with
deep neural networks for vision-based continuous sign language recognition, where the ordered gloss labels but no exact temporal locations are available with the video of sign
sentence, and the amount of labeled sentences for training
is limited. Our approach addresses the mapping of video
segments to glosses by introducing recurrent convolutional
neural network for spatio-temporal feature extraction and
sequence learning. We design a three-stage optimization
process for our architecture. First, we develop an end-toend sequence learning scheme and employ connectionist
temporal classification (CTC) as the objective function for
alignment proposal. Second, we take the alignment proposal as stronger supervision to tune our feature extractor. Finally, we optimize the sequence learning model with
the improved feature representations, and design a weakly supervised detection network for regularization. We apply the proposed approach to a real-world continuous sign
language recognition benchmark, and our method, with no
extra supervision, achieves results comparable to the stateof-the-art.