Abstract
The non-indexed parts of the Internet (the
Darknet) have become a haven for both legal and illegal anonymous activity. Given the
magnitude of these networks, scalably monitoring their activity necessarily relies on automated tools, and notably on NLP tools. However, little is known about what characteristics
texts communicated through the Darknet have,
and how well off-the-shelf NLP tools do on
this domain. This paper tackles this gap and
performs an in-depth investigation of the characteristics of legal and illegal text in the Darknet, comparing it to a clear net website with
similar content as a control condition. Taking drug-related websites as a test case, we
find that texts for selling legal and illegal drugs
have several linguistic characteristics that distinguish them from one another, as well as
from the control condition, among them the
distribution of POS tags, and the coverage of
their named entities in Wikipedia