Abstract Test datasets should contain many different challenging aspects so that the robustness and real-world applicability of algorithms
can be assessed. In this work, we present a new test dataset for semantic
and instance segmentation for the automotive domain. We have conducted a thorough risk analysis to identify situations and aspects that can
reduce the output performance for these tasks. Based on this analysis
we have designed our new dataset. Meta-information is supplied to mark
which individual visual hazards are present in each test case. Furthermore, a new benchmark evaluation method is presented that uses the
meta-information to calculate the robustness of a given algorithm with
respect to the individual hazards. We show how this new approach allows for a more expressive characterization of algorithm robustness by
comparing three baseline algorithms