A corpus on continuous distributed representations of Chinese words and phrases.
Introduction
This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.
Data Description
Download the corpus from: Tencent_AILab_ChineseEmbedding.tar.gz.
The pre-trained embeddings are in Tencent_AILab_ChineseEmbedding.txt. The first line shows the total number of embeddings and their dimension size, separated by a space. In each line below, the first column indicates a Chinese word or phrase, followed by a space and its embedding. For each embedding, its values in different dimensions are separated by spaces.
Highlights
In comparison with existing embedding corpora for Chinese, the superiority of our corpus mainly lies in coverage, freshness, and accuracy.
Coverage. Our corpus contains a large amount of domain-specific words or slangs in vocabulary, such as “喀拉喀什河”, “皇帝菜”, “不念僧面念佛面”, “冰火两重天”, “煮酒论英雄", which are not covered by most of the existing embedding corpora.
Freshness. Our corpus contains fresh words appearing or getting popular recently, such as “恋与制作人”, “三生三世十里桃花”, “打call”, “十动然拒”, “因吹斯汀”, etc.
Accuracy. Our embeddings can better reflect the semantic meaning of Chinese words or phrases, attributed to the large-scale data and the well-designed algorithm for training.
Training
To ensure the coverage, freshness, and accuracy of our corpus, we carefully design our data preparation and training process in terms of the following aspects: