Abstract. This paper introduces a new problem, called Visual Text
Correction (VTC), i.e., finding and replacing an inaccurate word in the
textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence, and fix it by replacing the
inaccurate word(s). Our method leverages the semantic interdependence
of videos and words, as well as the short-term and long-term relations of
the words in a sentence. Our proposed formulation can solve the VTC
problem employing an End-to-End network in two steps: (1)Inaccuracy
detection, and (2)correct word prediction. In detection step, each word of
a sentence is reconstructed such that the reconstruction for the inaccurate word is maximized. We exploit both Short Term and Long Term Dependencies employing respectively Convolutional N-Grams and LSTMs
to reconstruct the word vectors. For the correction step, the basic idea
is to simply substitute the word with the maximum reconstruction error
for a better one. The second step is essentially a classification problem
where the classes are the words in the dictionary as replacement options.
Furthermore, to train and evaluate our model, we propose an approach
to automatically construct a large dataset for the VTC problem. Our
experiments and performance analysis demonstrates that the proposed
method provides very good results and also highlights the general challenges in solving the VTC problem. To the best of our knowledge, this
work is the first of its kind for the Visual Text Correction task