Abstract
Segmenting a chunk of text into words is usually the first step of processing Chinese text,
but its necessity has rarely been explored.
In this paper, we ask the fundamental question
of whether Chinese word segmentation (CWS)
is necessary for deep learning-based Chinese
Natural Language Processing. We benchmark neural word-based models which rely on
word segmentation against neural char-based
models which do not involve word segmentation in four end-to-end NLP benchmark tasks:
language modeling, machine translation, sentence matching/paraphrase and text classification. Through direct comparisons between
these two types of models, we find that charbased models consistently outperform wordbased models.
Based on these observations, we conduct comprehensive experiments to study why wordbased models underperform char-based models in these deep learning-based NLP tasks.
We show that it is because word-based models
are more vulnerable to data sparsity and the
presence of out-of-vocabulary (OOV) words,
and thus more prone to overfitting. We hope
this paper could encourage researchers in the
community to rethink the necessity of word
segmentation in deep learning-based Chinese
Natural Language Processing.