资源论文VRCA: A Clustering Algorithm for Massive Amount of Texts

VRCA: A Clustering Algorithm for Massive Amount of Texts

2019-11-21 | |  63 |   49 |   0
Abstract There are lots of texts appearing in the web every day. This fact enables the amount of texts in the web to explode. Therefore, how to deal with large-scale text collection becomes more and more important. Clustering is a generally acceptable solution for text organization. Via its unsupervised characteristic, users can easily dig the useful information that they desired. However, traditional clustering algorithms can only deal with small-scale text collection. When it enlarges, they lose their performances. The main reason attributes to the high-dimensional vectors generated from texts. Therefore, to cluster texts in large amount, this paper proposes a novel clustering algorithm, where only the features that can represent cluster are preserved in cluster’s vector. In this algorithm, clustering process is separated into two parts. In one part, feature’s weight is fine-tuned to make cluster partition meet an optimization function. In the other part, features are reordered and only the useful features that can represent cluster are kept in cluster’s vector. Experimental results demonstrate that our algorithm obtains high performance on both small-scale and large-scale text collections.

上一篇:Short and Sparse Text Topic Modeling via Self-Aggregation

下一篇:Towards Domain-Specific Semantic Relatedness: A Case Study from Geography

用户评价
全部评价

热门资源

  • Learning to learn...

    The move from hand-designed features to learned...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • Rating-Boosted La...

    The performance of a recommendation system reli...

  • Hierarchical Task...

    We extend hierarchical task network planning wi...