Abstract
Incidental scene text spotting is considered one of the
most difficult and valuable challenges in the document analysis community. Most existing methods treat text detection
and recognition as separate tasks. In this work, we propose a unified end-to-end trainable Fast Oriented Text Spotting (FOTS) network for simultaneous detection and recognition, sharing computation and visual information among
the two complementary tasks. Specifically, RoIRotate is introduced to share convolutional features between detection
and recognition. Benefiting from convolution sharing strategy, our FOTS has little computation overhead compared
to baseline text detection network, and the joint training
method makes our method perform better than these twostage methods. Experiments on ICDAR 2015, ICDAR 2017
MLT, and ICDAR 2013 datasets demonstrate that the proposed method outperforms state-of-the-art methods signifi-
cantly, which further allows us to develop the first real-time
oriented text spotting system which surpasses all previous
state-of-the-art results by more than 5% on ICDAR 2015
text spotting task while keeping 22.6 fps