Abstract
Capturing document images is a common way for digitizing and recording physical documents due to the ubiquitousness of mobile cameras. To make text recognition easier, it is often desirable to digitally flatten a document image when the physical document sheet is folded or curved.
In this paper, we develop the first learning-based method to
achieve this goal. We propose a stacked U-Net [25] with intermediate supervision to directly predict the forward mapping from a distorted image to its rectified version. Because
large-scale real-world data with ground truth deformation
is difficult to obtain, we create a synthetic dataset with approximately 100 thousand images by warping non-distorted
document images. The network is trained on this dataset
with various data augmentations to improve its generalization ability. We further create a comprehensive benchmark1
that covers various real-world conditions. We evaluate the proposed model quantitatively and qualitatively on
the proposed benchmark, and compare it with previous nonlearning-based methods