资源数据集Wikilinks 跨文档语料扩展版

Wikilinks 跨文档语料扩展版

2019-12-30 | |  99 |   0 |   0

ExtWikilinks

ExtWikilinks is a dataset, obtained from http://www.iesl.cs.umass.edu/data/wiki-links using enrichment of CoreNLP and Elasticsearch .

Download at: http://academictorrents.com/details/80d0a22ed403b65f7cc0d81d51759b62c66b41ce

Purpose

Original Wikilinks dataset contains single entity per sentence, it is not enough to build state of the art named entity linking system, since context of entity may contain a valuable information. More over, it may take a lot of time to apply POS-tagging to 40 million of sentences, so, this information already included into this extended dataset.

Enrichment

There are two main mechanism involved into enrichment of Wikilinks dataset: CoreNLP pipeline and searching for additional entities with Elasticsearch engine in dataset itself. Let's describe both.

CoreNLP processing

Each text abstract in original dataset was analysed using CoreNLP library with following pipeline: abstract divided into sentences and the sentence with mention passes through following process:

tokenize, ssplit, pos, lemma, parse

Since parse step involves building of tree, the result of this step was converted into groups (analogue with Chunking). As a result, each token in enriched dataset store 5 parameters:

  • Token

  • Lemma

  • POS-tag

  • Parse tag (tag, assigned by parser)

  • Group id

Previous and next sentence are also stored in dataset, but in raw way.

Elasticsearch processing

In order to enrich sentences with additional links, each (except for stop words) Noun Phrase (NP) extracted on CoreNLP step was searched in mentions of original dataset (with respect to context). It is likely, that single entity have similar mentions in text, but ambiguity may appear in this case. Because of this all search results are stored in extended dataset with minimal threshold. Each additional mention contains information about hits count and average search score.

Storage format

Dataset consist of 79 protobuf files 500 000 sentence each compressed with tar gz. You can use following command to extract content:

tar -xzvf archive_name.tgz

Each protobuf file consists of protobuf encoded "messages" separated by the length of next message:

[varbyte(length of msg1)] [msg1] [varbyte(length of msg2)] [msg2] ...

You can find description of used protobuf in sentence.proto file.

For reading messages from this protobuf you can use parseDelimitedFrom in Java or streamFromDelimitedInput in Scala. Ruby example is also available in reader.rb file.

Basic statistics

Number of sentences: 40 million

Number of unique entities: 2 million


上一篇:Imagenet 小尺寸图像数据集

下一篇:几个聊天机器人语料

用户评价
全部评价

热门资源

  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据

    凶杀案报告数据

  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...

    数据来自产品在Bosch真实生产线上制造过程中的设备...