Abstract
When collecting subjective human ratings of items, it can be difficult to measure and enforce data quality due to task
subjectivity and lack of insight into how
judges arrive at each rating decision. To
address this, we propose requiring judges
to provide a specific type of rationale underlying each rating decision. We evaluate this approach in the domain of Information Retrieval, where human judges
rate the relevance of Webpages. Costbenefit analysis over 10,000 judgments
collected on Mechanical Turk suggests
a win-win: experienced crowd workers
provide rationales with no increase in
task completion time while providing
further benefits, including more reliable
judgments and greater transparency