Learn Smart with Less: Building Better Online
Decision Trees with Fewer Training Examples
Abstract
Online decision tree models are extensively used
in many industrial machine learning applications
for real-time classification tasks. These models are
highly accurate, scalable and easy to use in practice. The Very Fast Decision Tree (VFDT) is the
classic online decision tree induction model that
has been widely adopted due to its theoretical guarantees as well as competitive performance. However, VFDT and its variants solely rely on conservative statistical measures like Hoeffding bound
to incrementally grow the tree. This makes these
models extremely circumspect and limits their ability to learn fast. In this paper, we efficiently employ statistical resampling techniques to build an
online tree faster using fewer examples. We first
theoretically show that a naive implementation of
resampling techniques like non-parametric bootstrap does not scale due to large memory and computational overheads. We mitigate this by proposing a robust memory-efficient bootstrap simulation
heuristic (Mem-ES) that successfully expedites
the learning process. Experimental results on both
synthetic data and large-scale real world datasets
demonstrate the efficiency and effectiveness of our
proposed technique