Abstract We address the problem of detecting scene text in arbitrary shapes, which is a challenging task due to the high variety and complexity of the scene. We treat text detection as instance segmentation and propose a segmentationbased framework, which extracts each text instance as an independent connected component. To distinguish among different text instances, our method maps pixels onto an embedding space where pixels belonging to the same text are encouraged to appear closer to each other and vise versa. In addition, we introduce a shape-aware Loss to make training adaptively accommodate various aspect ratios of text instances and even the tiny gaps among them. A new post-processing pipeline yields precise bounding box prediction. Experimental results on three challenging datasets (ICDAR15 [20], MSRA-TD500 [55] and CTW1500 [32]) demonstrate the effectiveness of our work.