Abstract
Text classification aims at mapping documents
into a set of predefined categories. Supervised machine learning models have shown
great success in this area but they require a
large number of labeled documents to reach
adequate accuracy. This is particularly true
when the number of target categories is in
the tens or the hundreds. In this work, we
explore an unsupervised approach to classify
documents into categories simply described by
a label. The proposed method is inspired by
the way a human proceeds in this situation: It
draws on textual similarity between the most
relevant words in each document and a dictionary of keywords for each category reflecting its semantics and lexical field. The novelty of our method hinges on the enrichment
of the category labels through a combination
of human expertise and language models, both
generic and domain specific. Our experiments
on 5 standard corpora show that the proposed
method increases F1-score over relying solely
on human expertise and can also be on par with
simple supervised approaches. It thus provides
a practical alternative to situations where lowcost text categorization is needed, as we illustrate with our application to operational risk
incidents classification.