Abstract. Subspace clustering methods based on expressing each data
point as a linear combination of a few other data points (e.g., sparse subspace clustering) have become a popular tool for unsupervised learning
due to their empirical success and theoretical guarantees. However, their
performance can be affected by imbalanced data distributions and largescale datasets. This paper presents an exemplar-based subspace clustering method to tackle the problem of imbalanced and large-scale datasets.
The proposed method searches for a subset of the data that best represents all data points as measured by the ?1 norm of the representation
coefficients. To solve our model efficiently, we introduce a farthest first
search algorithm which iteratively selects the least well-represented point
as an exemplar. When data comes from a union of subspaces, we prove
that the computed subset contains enough exemplars from each subspace
for expressing all data points even if the data are imbalanced. Our experiments demonstrate that the proposed method outperforms state-of-theart subspace clustering methods in two large-scale image datasets that
are imbalanced. We also demonstrate the effectiveness of our method on
unsupervised data subset selection for a face image classification task.