Abstract
Convolutional networks(ConvNets)have become the
dominant approach to semantic image segmentation.Pro-
ducing accurate,pixel-level labels required for this task is
a tedious and time consuming process;however,producing
approximate,coarse labels could take only a fraction cf
the time and eifort.We investigate the relationship between
the qualily cf labels and the per formance cf ConvNets for
semantic segmentation.We create a very large synthetic
dataset with peifeclly labeled street view scenes.From these
peifect labels,we synthetically coarsen labels with djferent
qualities and estimate human-hours required for producing
them.We per form a series cf experiments by training Con-
vNets with a varying number cf training images and label
quality.We found that the performance cf ConvNets mostly
depends on the time spent creating the training labels.That
is,a larger coarsely-annotated dataset can yield the same
performance as a smaller finely-annotated one.Further-
more,fine-tuning coarsely pre-trained ConvNets with few
finely-annotated labels can yield comparable or superior
pei formance to training it with a large amount cf finely-
annotated labels alone,at a fraction cf the labeling cost.We
demonstrate that our result is also valid for di;ferent network
architectures,and various ot ject classes in an urban scene