Abstract
We consider the scene text recognition problem under
the attention-based encoder-decoder framework, which is
the state of the art. The existing methods usually employ
a frame-wise maximal likelihood loss to optimize the models. When we train the model, the misalignment between the
ground truth strings and the attention’s output sequences of
probability distribution, which is caused by missing or superfluous characters, will confuse and mislead the training
process, and consequently make the training costly and degrade the recognition accuracy. To handle this problem,
we propose a novel method called edit probability (EP)
for scene text recognition. EP tries to effectively estimate
the probability of generating a string from the output sequence of probability distribution conditioned on the input
image, while considering the possible occurrences of missing/superfluous characters. The advantage lies in that the
training process can focus on the missing, superfluous and
unrecognized characters, and thus the impact of the misalignment problem can be alleviated or even overcome. We
conduct extensive experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets.
Experimental results show that the EP can substantially
boost scene text recognition performance