S.M. McNee, J. Riedl, and J.A. Konstan. "Being Accurate is Not Enough: How Accuracy Metrics have hurt Recommender Systems". In the Extended Abstracts of the 2006 ACM Conference on Human Factors in Computing Systems (CHI 2006) [to appear], Montreal, Canada, April 2006
The paper starts by informally arguing that "the recommender community should move beyond the conventional accuracy metrics and their associated experiment methodologies. We propose new user-centric directions for evaluating recommender systems".
The paper states that the current accuracy metrics, such as MAE (Herlocker 1999), measure recommender algorithm performance by comparing the algorithm's prediction against a user's rating of an item. They continue saying that this means, in essence, that a recommender that recommends places to a user where she has already visited would be rewarded rather than a recommendation on new places that might be of interest. Clearly, if that is the case, there is something rotten.
The paper proposes three aspects; similarity, recommendation serendipity and the importance of user needs and expectations in a recommender, and suggests how they could be improved.
A) Similarity
- the item-item collaborative filtering algorithm can trap users in a "similarity hole" only giving similar recommendations. This becomes more problematic when there is less data, for example, for a new user in a system.
The authors go on to discuss about the accuracy metrics that don't recognise this problem, because they are designed to judge the accuracy of individual items and not the list of items. However, the authors argue, "the recommendation list should be judged for its usefulness as a complete entity, not just as a collection of individual items." There was evidence in a user testing that the lists that had performed badly on conventional accuracy measures were the ones preferred by users. These lists had used the Intra-List Similarity Metrics and the process of Topic Diversification for recommendation lists (Ziegler 2005).
Authors go on saying that depending on the user's intentions, the makeup of items appearing on the list affected the user's satisfaction with the recommender. Here, in my opinion, it becomes important to remember the user intentions as provided by Swearingen & Sinha (2001)
- Reminder recommendations, mostly from within genre (“I was planning to read this anyway, it’s my typical kind of item”)
- More like this” recommendations, from within genre, similar to a particular item (“I am in the mood for a movie similar to GoodFellas”)
- New items, within a particular genre, just released, that they / their friends do not know about
- “Broaden my horizon” recommendations (might be from other genres)
B) Serendipity
This is how unexpected the recommendation is for the user and how novel it is. This is hard to measure. The authors approach the issue by its opposite: the ratability of received recommendations, and this, they say, is easy to measure by using the "leave-n-out" approach. However, the assumption that users are interested in the highest ratable items is not always true for recommenders. They give an example of recommending Beatle's White Album to users of a music recommender as a bad idea, as it almost adds no value.
The same example, I remember, was somewhere else on recommending to people buy bananas when they go shopping, however, apparently people almost always buy bananas anyway, thus no commercial value there..This could, though, have some value, when building people's trust on a recommender.
Moreover, the authors point out that different algorithms give different recommendations, and that people preferred one over another depending on their current task (think again about Swearingen/Sinha). To conclude on serendipity, the authors say that other metrics could be needed to judge a variety of algorithm aspects - no direction given on this one, though.
c) User experiences and expectations
New users have different needs from experienced users. Rashid (2001) has shown that the choice of algorithm for a new user greatly affects the experience (really?!) and also, apparently, the native language is greatly preferred (Torres 2004), wonder what kind of language groups were in question there..
- Moving forward
Authors don't suggest that the old-school metrics should be thrown away, but not only be used alone, we need to think of the users who want meaningful recommendations (!!).
Firstly, it is recommended that instead of looking at each item on the list of recommendations, one should pay more attention on the integrity of the list, using metrics like Intra-List Similarity metrics, and more of such kinds.
We should test more what kind of search algorithms users like and given them those.
Users have a purpose for expecting a recommendation, so we would need to know better what actually are the user needs when they come to see a recommendation (Zaslow 2002).
Well, well, if this is where we are at with recommender usability studies, it is not much. However, it is great that important people such as Grouplens researchers tell us this, so maybe it makes the general audience more susceptible for new things to come.
References:
Swearingen & Sinha (2001)
Torres, R., McNee, S.M., Abel, M., Konstan, J.A., and Riedl, J. Enhancing digital libraries with
TechLens+. In Proc. of ACM/IEEE JCDL 2004, ACM
Press (2004) 228-236.
Ziegler, C.N., McNee, S.M., Konstan, J.A., and Lausen, G., Improving Recommendation Lists through Topic Diversification. In Proc. of WWW 2005, ACM Press (2005), 22-32.
Zaslow, J. If TiVo Thinks You Are Gay, Here's How To Set It Straight --- Amazon.com Knows You, Too, Based on What You Buy; Why All the Cartoons? The Wall Street Journal, sect. A, p. 1, November 26, 2002.