Abstract
Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, in order
to meet the low-memory or fast execution requirements. In
this paper, we present a deep mutual learning (DML) strategy. Different from the one-way transfer between a static
pre-defined teacher and a student in model distillation, with
DML, an ensemble of students learn collaboratively and
teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results
on both category and instance recognition tasks. Surprisingly, it is revealed that no prior powerful teacher network
is necessary – mutual learning of a collection of simple student networks works, and moreover outperforms distillation
from a more powerful yet static teacher