Abstract
Sign Language Recognition (SLR) has been an active
research field for the last two decades. However, most
research to date has considered SLR as a naive gesture
recognition problem. SLR seeks to recognize a sequence of
continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ
from spoken language. In contrast, we introduce the Sign
Language Translation (SLT) problem. Here, the objective
is to generate spoken language translations from sign
language videos, taking into account the different word
orders and grammar.
We formalize SLT in the framework of Neural Machine
Translation (NMT) for both end-to-end and pretrained
settings (using expert knowledge). This allows us to jointly
learn the spatial representations, the underlying language
model, and the mapping between sign and spoken language.
To evaluate the performance of Neural SLT, we collected
the first publicly available Continuous SLT dataset, RWTHPHOENIX-Weather 2014T1
. It provides spoken language
translations and gloss level annotations for German Sign
Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary
of >2.8K. We report quantitative and qualitative results for
various SLT setups to underpin future research in this newly
established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end
frame-level and gloss-level tokenization networks were able
to achieve 9.58 and 18.13 respectively