Tencent AI Lab Embedding Corpus for Chinese Words and Phrases

Introduction

This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.

Data Description

The pre-trained embeddings are in Tencent_AILab_ChineseEmbedding.txt. The first line shows the total number of embeddings and their dimension size, separated by a space. In each line below, the first column indicates a Chinese word or phrase, followed by a space and its embedding. For each embedding, its values in different dimensions are separated by spaces.

Highlights

In comparison with existing embedding corpora for Chinese, the superiority of our corpus mainly lies in coverage, freshness, and accuracy.

Coverage. Our corpus contains a large amount of domain-specific words or slangs in vocabulary, such as “喀拉喀什河”, “皇帝菜”, “不念僧面念佛面”, “冰火两重天”, “煮酒论英雄", which are not covered by most of the existing embedding corpora.

Freshness. Our corpus contains fresh words appearing or getting popular recently, such as “恋与制作人”, “三生三世十里桃花”, “打call”, “十动然拒”, “因吹斯汀”, etc.

Accuracy. Our embeddings can better reflect the semantic meaning of Chinese words or phrases, attributed to the large-scale data and the well-designed algorithm for training.

Training

To ensure the coverage, freshness, and accuracy of our corpus, we carefully design our data preparation and training process in terms of the following aspects:

Data collection. Our training data contains large-scale text collected from news, webpages, and novels. Text data from diverse domains enables the coverage of various types of words and phrases. Moreover, the recently collected webpages and news data enable us to learn the semantic representations of fresh words.

Vocabulary building. To enrich our vocabulary, we involve phrases in Wikipedia and Baidu Baike. We also apply the phrase discovery approach in Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches, which enhances the coverage of emerging phrases.

Training algorithm. Our corpus is trained with Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings, which is based on word co-occurrence and the directions of word pairs, i.e., which word is on the left, in a context window.

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com