Abstract
The past few years have seen a dramatic increase in the
performance of recognition systems thanks to the introduction of deep networks for representation learning. However,
the mathematical reasons for this success remain elusive.
A key issue is that the neural network training problem is
nonconvex, hence optimization algorithms may not return a
global minima. This paper provides sufficient conditions to
guarantee that local minima are globally optimal and that
a local descent strategy can reach a global minima from
any initialization. Our conditions require both the network
output and the regularization to be positively homogeneous
functions of the network parameters, with the regularization being designed to control the network size. Our results apply to networks with one hidden layer, where size is
measured by the number of neurons in the hidden layer, and
multiple deep subnetworks connected in parallel, where size
is measured by the number of subnetworks