Original Wikilinks dataset contains single entity per sentence, it is not enough to build state of the art named entity linking system, since context of entity may contain a valuable information. More over, it may take a lot of time to apply POS-tagging to 40 million of sentences, so, this information already included into this extended dataset.
Enrichment
There are two main mechanism involved into enrichment of Wikilinks dataset: CoreNLP pipeline and searching for additional entities with Elasticsearch engine in dataset itself. Let's describe both.
CoreNLP processing
Each text abstract in original dataset was analysed using CoreNLP library with following pipeline: abstract divided into sentences and the sentence with mention passes through following process:
tokenize, ssplit, pos, lemma, parse
Since parse step involves building of tree, the result of this step was converted into groups (analogue with Chunking). As a result, each token in enriched dataset store 5 parameters:
Token
Lemma
POS-tag
Parse tag (tag, assigned by parser)
Group id
Previous and next sentence are also stored in dataset, but in raw way.
Elasticsearch processing
In order to enrich sentences with additional links, each (except for stop words) Noun Phrase (NP) extracted on CoreNLP step was searched in mentions of original dataset (with respect to context). It is likely, that single entity have similar mentions in text, but ambiguity may appear in this case. Because of this all search results are stored in extended dataset with minimal threshold. Each additional mention contains information about hits count and average search score.
Storage format
Dataset consist of 79 protobuf files 500 000 sentence each compressed with tar gz. You can use following command to extract content:
tar -xzvf archive_name.tgz
Each protobuf file consists of protobuf encoded "messages" separated by the length of next message:
[varbyte(length of msg1)] [msg1] [varbyte(length of msg2)] [msg2] ...
You can find description of used protobuf in sentence.proto file.
For reading messages from this protobuf you can use parseDelimitedFrom in Java or streamFromDelimitedInput in Scala. Ruby example is also available in reader.rb file.