Neural Pinyin-to-Chinese Character Converter—can you do better than SwiftKey™ Keyboard?
In this project, we examine how well neural networks can convert Pinyin, the official romanization system for Chinese, into Chinese characters.
Requirements
numpy >= 1.11.1
TensorFlow >= 1.2.
xpinyin (for Chinese pinyin annotation)
distance (for calculating the similarity score between two strings)
tqdm
Background
Because Chinese characters are not phonetic, various solutions have been suggested in order to type them in the digital environment. The most popular one is to use Pinyin, the official romanization system for Chinese. When people write in Chinese using smartphones, they usually type Pinyin, expecting the word(s) to appear magically on the suggestion bar. Accordingly, how accurately an engine can predict the word(s) the user has in mind is crucial in a Chinese keyboard.
Among several kinds in the Chinese keyboard, the major two are Qwerty keyboard and Nine keyboard (See the animations on the right. One is typing “woaini” to write 我爱你, which means “I love you.” Qwerty is on the left, and Nine is on the right). While in Qwerty each alphabet is associated with one independent space in the former, in Nine the machine is responsible for determining the one the user intended out of 3-4 grouped alphabets. Not surprisingly, it is more challenging to transliterate in Nine than in Qwerty.
Problem Formulation
I frame the problem as a labelling task. In other words, every pinyin character is associated with a Chinese character or _ which means a blank.