资源数据集European Parliament Proceedings Parallel Corpus 机器翻译数据

European Parliament Proceedings Parallel Corpus 机器翻译数据

2019-12-18 | |  115 |   0 |   0

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

Size of the Corpus

Sizes for single-language data after removing XML.


LanguageSentencesWords
Bulgarian411,636-
Czech668,59513,195,311
Danish2,323,09947,761,381
German2,176,53747,236,849
Greek1,517,141-
English2,218,20153,974,751
Spanish2,123,83554,806,927
Estonian692,21011,358,009
Finnish2,119,51533,708,706
French2,190,57954,202,850
Hungarian658,82412,606,986
Italian2,081,66950,259,169
Lithuanian678,66511,512,131
Latvian666,02612,085,228
Dutch2,333,81653,487,257
Polish387,4907,087,016
Portuguese2,121,88952,300,149
Romanian402,9049,663,544
Slovak674,35913,116,301
Slovene634,48812,665,974
Swedish2,241,38645,665,947


Sizes for parallel corpora after sentence aligning and removing XML.


Parallel Corpus (L1-L2)SentencesL1 WordsEnglish Words
Bulgarian-English406,934-9,886,291
Czech-English646,60512,999,45515,625,264
Danish-English1,968,80044,654,41748,574,988
German-English1,920,20944,548,49147,818,827
Greek-English1,235,976-31,929,703
Spanish-English1,965,73451,575,74849,093,806
Estonian-English651,74611,214,22115,685,733
Finnish-English1,924,94232,266,34347,460,063
French-English2,007,72351,388,64350,196,035
Hungarian-English624,93412,420,27615,096,358
Italian-English1,909,11547,402,92749,666,692
Lithuanian-English635,14611,294,69015,341,983
Latvian-English637,59911,928,71615,411,980
Dutch-English1,997,77550,602,99449,469,373
Polish-English632,56512,815,54415,268,824
Portuguese-English1,960,40749,147,82649,216,896
Romanian-English399,3759,628,0109,710,331
Slovak-English640,71512,942,43415,442,233
Slovene-English623,49012,525,64415,021,497
Swedish-English1,862,23441,508,71245,703,795




上一篇:体育馆人数数据

下一篇:NIST Handprinted Forms and Characters 手写英文字符数据

用户评价
全部评价

热门资源

  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据

    凶杀案报告数据

  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...

    数据来自产品在Bosch真实生产线上制造过程中的设备...