The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. This page contains some descriptions about the data. If you have questions, please email Mark Dredze or John Blitzer.
A few notes regarding the data sets.
1) unprocessed.tar.gz contains the original data.
2) processed.acl.tar.gz contains the data pre-processed and balanced. That is, the format of Blitzer et al. (ACL 2007)
3) processed.realvalued.tar.gz contains the data pre-processed and balanced, but with the number of stars, rather than just positive or negative. That is, the format of Mansour et al. (NIPS 2009)
The preprocessed data is one line per document, with each line in the format:
feature:<count> .... feature:<count> #label#:<label>
The label is always at the end of each line.
4) Each directory corresponds to a single domain. Each directory contains several files, which we briefly describe:
all.review -- All reviews for this domain, in their original format
positive.review -- Positive reviews
negative.review -- Negative reviews
unlabeled.review -- Unlabeled reviews
processed.review -- Preprocessed reviews (see below)
processed.review.balanced -- Preprocessed reviews, equally balanced between positive and negative.
5) While the positive and negative files contain positive and negative reviews, these aren't necessarily the splits used in any of the cited papers. They are simply there as possible initial splits.
6) Each (unprocessed) file contains a pseudo XML scheme for encoding the reviews. Most of the fields are self explanatory. The reviews have a unique ID field that isn't very unique. If it has two unique id fields, ignore the one containing only a number.