Impact of Similarity Ratings on Human Judgement
Authors: Ian A. McCulloh, Madelaine Zinser, Jesse Patsolic, Michael Ramos
Abstract:
Recommender systems are a common artificial intelligence (AI) application. For any given input, a search system will return a rank-ordered list of similar items. As users review returned items, they must decide when to halt the search and either revise search terms or conclude their requirement is novel with no similar items in the database. We present a statistically designed experiment that investigates the impact of similarity ratings on human judgement to conclude a search item is novel and halt the search. In the study, 450 participants were recruited from Amazon Mechanical Turk to render judgement across 12 decision tasks. We find the inclusion of ratings increases the human perception that items are novel. Percent similarity increases novelty discernment when compared with star-rated similarity or the absence of a rating. Ratings reduce the time to decide and improve decision confidence. This suggests that the inclusion of similarity ratings can aid human decision-makers in knowledge search tasks.
Keywords: Ratings, rankings, crowdsourcing, empirical studies, user studies, similarity measures, human-centered computing, novelty in information retrieval.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 498References:
[1] Melville, P., Sindhwani, V., Sammut, C. and Webb, G.I., 2010. Recommender Systems. In Encyclopedia of machine learning.
[2] Cena, F., Gena, C., Grillo, P., Kuflik, T., Vernero, F. and Wecker, A.J., 2017. How scales influence user rating behaviour in recommender systems. Behaviour & Information Technology, 36(10), pp.985-1004.
[3] Garland, R. 1991. “The Mid-Point on a Rating Scale: Is it Desirable.” Marketing Bulletin 2: 66–70.
[4] Friedman, H. H., and T. Amoo. 1999. “Rating the Rating Scales.” Journal of Marketing Management 9 (3): 114–123.
[5] Amoo, T., and H. H. Friedman. 2001. “Do Numeric Values Influence Subjects Responses to Rating Scales?” Journal of International Marketing and Marketing Research 26: 41–46.
[6] Cummins, R., and E. Gullone. 2000. “Why We Should Not Use 5-Point Likert Scales: The Case for Subjective Quality of Life Measurement.” In Proceedings of the Second International Conference on Quality of Life in Cities, 74–93. Singapore: The School.
[7] Cosley, D., S. K. Lam, I. Albert, J. A. Konstan, and J. Riedl. 2003. “Is Seeing Believing?: How Recommender System Interfaces Affect Users” Opinions.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ‘03, 585–592. New York: ACM.
[8] Vaz, P. C., R. Ribeiro, and D. M. de Matos. 2013. “Understanding the Temporal Dynamics of Recommendations Across Different Rating Scales.” UMAP Workshops 2013, CEUR-WS.org.
[9] Cena, F., F. Vernero, and C. Gena. 2010. “Towards a Customization of Rating Scales in Adaptive Systems.” In User Modeling, Adaptation, and Personalization – 18th International Conference, UMAP 2010, edited by P. De Bra, A. Kobsa, and D. N. Chin, Big Island, HI, USA, June 20–24. Proceedings, Volume 6075 of Lecture Notes in Computer Science, 369–374.
[10] Gena, C., R. Brogi, F. Cena, and F. Vernero. 2011. “The Impact of Rating Scales on User’s Rating Behavior.” In User Modeling, Adaption and Personalization – 19th International Conference, UMAP 2011, Girona, Spain, July 11–15. Proceedings, Lecture Notes in Computer Science 6787, 123–134.
[11] Preston, C., and A. Colman. 2000. “Optimal Number of Response Categories in Rating Scales: Reliability, Validity, Discriminating Power, and Respondent Preferences.” Acta Psychologica 104 (1): 1–15.
[12] Weng, L. J. 2004. “Impact of the Number of Response Categories and Anchor Labels on Coefficient Alpha and Test-Retest Reliability.” Educational and Psychological Measurement 64 (6): 956–972
[13] Best Movies, https://bestsimilar.com/movies, retrieved on October 31, 2022.