资源数据集Tencent AI Lab Embedding Corpus for Chinese Words and Phrases

Tencent AI Lab Embedding Corpus for Chinese Words and Phrases

2019-09-09 | |  489 |   0 |   0

A corpus on continuous distributed representations of Chinese words and phrases.


This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.

Data Description

Download the corpus from: Tencent_AILab_ChineseEmbedding.tar.gz.

The pre-trained embeddings are in Tencent_AILab_ChineseEmbedding.txt. The first line shows the total number of embeddings and their dimension size, separated by a space. In each line below, the first column indicates a Chinese word or phrase, followed by a space and its embedding. For each embedding, its values in different dimensions are separated by spaces.


In comparison with existing embedding corpora for Chinese, the superiority of our corpus mainly lies in coveragefreshness, and accuracy.

  • Coverage. Our corpus contains a large amount of domain-specific words or slangs in vocabulary, such as “喀拉喀什河”, “皇帝菜”, “不念僧面念佛面”, “冰火两重天”, “煮酒论英雄", which are not covered by most of the existing embedding corpora.

  • Freshness. Our corpus contains fresh words appearing or getting popular recently, such as “恋与制作人”, “三生三世十里桃花”, “打call”, “十动然拒”, “因吹斯汀”, etc.

  • Accuracy. Our embeddings can better reflect the semantic meaning of Chinese words or phrases, attributed to the large-scale data and the well-designed algorithm for training.


To ensure the coveragefreshness, and accuracy of our corpus, we carefully design our data preparation and training process in terms of the following aspects:


下一篇:Billion Words



  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据


  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...
