资源论文Distributed Clustering via LSH Based Data Partitioning

Distributed Clustering via LSH Based Data Partitioning

2020-03-20 | |  67 |   42 |   0

Abstract

Given the importance of clustering in the analysis of large scale data, distributed algorithms for for mulations such as k-means, k-median, etc. have been extensively studied. A successful approach here has been the “reduce and merge” paradigm, in which each machine reduces its input size to 图片.png(k), e and this data reduction continues (possibly iteratively) until all the data fits on one machine at which point the problem is solved locally. This approach has the intrinsic bottleneck that each machine must solve a problem of size 图片.png k, and needs to communicate at least 图片.png(k) points to the other machines. We propose a novel data partitioning idea to overcome this bottleneck, and in effect, have different machines focus on “finding different clusters”. Under the assumption that we know the optimum value of the objective up to a poly(n) factor (arbitrary polynomial), we establish worst-case approximation guarantees for our method. We see that our algorithm results in lower communication as well as a near-optimal number of ‘rounds’ of computation (in the popular MapReduce framework).

上一篇:Decoupled Parallel Backpropagation with Convergence Guarantee

下一篇:Learning Hidden Markov Models from Pairwise Co-occurrences with Application to Topic Modeling

用户评价
全部评价

热门资源

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...