Abstract There are a variety of grand challenges for text extraction in scene videos by robots and users, e.g., heterogeneous background, varied text, nonuniform illumination, arbitrary motion and poor contrast. Most previous video text detection methods are investigated with local information, i.e., within individual frames, with limited performance. In this paper, we propose a unifified tracking based text detection system by learning locally and globally, which uniformly integrates detection, tracking, recognition and their interactions. In this system, scene text is fifirst detected locally in individual frames. Second, an optimal tracking trajectory is learned and linked globally with all detection, recognition and prediction information by dynamic programming. With the tracking trajectory, fifinal detection and tracking results are simultaneously and immediately obtained. Moreover, our proposed techniques are extensively evaluated on several public scene video text databases, and are much better than the state-of-the-art methods