Abstract
This work presents an iterative re-alignment approach applicable to visual sequence labelling tasks such as gesture recognition, activity recognition and continuous sign language recognition. Previous methods dealing with video data usually rely on given frame labels to train their classififiers. Looking at recent data sets, these labels often tend to be noisy which is commonly overseen. We propose an algorithm that treats the provided training labels as weak labels and refifines the label-to-image alignment onthe-flfly in a weakly supervised fashion. Given a series of frames and sequence-level labels, a deep recurrent CNNBLSTM network is trained end-to-end. Embedded into an HMM, the resulting deep model corrects the frame labels and continuously improves its performance in several realignments. We evaluate on two challenging publicly available sign recognition benchmark data sets featuring over 1000 classes. We outperform the state-of-the-art by up to 10% absolute and 30% relative