资源数据集Multi-Domain Sentiment Dataset V2.0

Multi-Domain Sentiment Dataset V2.0

2019-11-08 | |  101 |   0 |   0

The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. This page contains some descriptions about the data. If you have questions, please email Mark Dredze or John Blitzer.

A few notes regarding the data sets.

1) unprocessed.tar.gz contains the original data.
2) processed.acl.tar.gz contains the data pre-processed and balanced. That is, the format of Blitzer et al. (ACL 2007)
3) processed.realvalued.tar.gz contains the data pre-processed and balanced, but with the number of stars, rather than just positive or negative. That is, the format of Mansour et al. (NIPS 2009)

The preprocessed data is one line per document, with each line in the format:

feature:<count> .... feature:<count> #label#:<label>

The label is always at the end of each line.

4) Each directory corresponds to a single domain. Each directory contains several files, which we briefly describe:
all.review -- All reviews for this domain, in their original format
positive.review -- Positive reviews
negative.review -- Negative reviews
unlabeled.review -- Unlabeled reviews
processed.review -- Preprocessed reviews (see below)
processed.review.balanced -- Preprocessed reviews, equally balanced between positive and negative.

5) While the positive and negative files contain positive and negative reviews, these aren't necessarily the splits used in any of the cited papers. They are simply there as possible initial splits.

6) Each (unprocessed) file contains a pseudo XML scheme for encoding the reviews. Most of the fields are self explanatory. The reviews have a unique ID field that isn't very unique. If it has two unique id fields, ignore the one containing only a number.

上一篇:LETOR 信息检索数据

下一篇:RCV1-2 英文新闻数据



  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据


  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...
