资源数据集英语语言模型单词预测竞赛数据【Kaggle竞赛】

英语语言模型单词预测竞赛数据【Kaggle竞赛】

2019-12-25 | |  115 |   0 |   0

Description:


This competition uses the billion-word benchmark corpus provided by Chelba et al. for language modeling. Rather than ask participants to create a classic language model and evaluate sentence probabilities -- a task which is difficult to faithfully score in Kaggle's supervised ML setting -- we have introduced a variation on the language modeling task.

For each sentence in the test set, we have removed exactly one word. Participants must create a model capable of inserting back the correct missing word at the correct location in the sentence. Submissions are scored using an edit distance to allow for partial credit.

We extend our thanks to authors who created this corpus and shared it for the research community to use. Please cite this paper if you use this dataset in your research: Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn: One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling, CoRR, 2013.

Note: the train/test split used in this competition is different than the published version used for language modeling. If you are creating full language models and scoring perplexity, you should download the official version of the corpus from the authors' website.


Evaluation:


Submissions are evaluated on the mean Levenshtein distance between the sentences you submit and the original sentences in the test set.

Note: due to the size and computations necessary to score submissions for this competition, scoring may take 5-10 minutes, and possibly longer if there are other submissions in front of yours. Please be patient!

Submission File

Your submission file should contain the sentence id and a predicted sentence. To prevent parsing issues, you should use double quotes to escape the sentence text and two double quotes ("") for double quotes within a sentence. Note that test.csv is a valid submission file itself.

The file should contain a header and have the following format:

id,"sentence"
1,"Former Dodgers manager , the team 's undisputed top ambassador , is going strong at 83 and serving up one great story after another ."
2,"8 parliamentary elections meant to restore democracy in this nuclear armed nation , a key ally against Islamic ."
3,"Sales of drink are growing 37 per cent month-on-month from a small base ."

Data Description:


The data for this competition is a large corpus of English language sentences. You should use only the sentences in the training set to build you model.

We have removed one word from each sentence in the test set. The location of the removed word was chosen uniformly randomly and is never the first or last word of the sentence (in this dataset, the last word is always a period). You must attempt to submit the sentences in the test set with the correct missing word located in the correct location. 

Note: the train/test split used in this competition is different than the published version used for language modeling. If you are creating full language models and scoring perplexity, you should download the official version of the corpus from the authors' website.

File descriptions

  • train.txt - the training set, contains a large collection of English language sentences

  • test.txt - the test set, contains a large number of sentences where one word has been removed



上一篇:根据手机应有使用行为预测用户性别年龄竞赛【Kaggle竞赛】

下一篇:广告点击预测竞赛数据【Kaggle竞赛】

用户评价
全部评价

热门资源

  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据

    凶杀案报告数据

  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...

    数据来自产品在Bosch真实生产线上制造过程中的设备...