Wednesday, September 20, 2006

Is rating broken?

Yahoo!, You Tube and Netflix all use ratings on their services to better gear towards users' needs. We are talking about huge numbers here; Yahoo! gets some 5 million ratings a day for artists, albums, songs and videos; Netflix has currently 2 million ratings per day from its 5M customers; and within a day after the launch of lonelygirl15's My First Kiss- clip on YouTube she's got over 5000 ratings. Clearly, rating is something not to neglect, it is an easy way for users to input their opinion, as well as rather a straight way to compute affinities in relation to some other type of data such as search history, demographies, etc..

However, it seems like there is more to rating than meets the eye, and it becomes increasingly complicated for services to make the best use out of it. In the Recommenders06 conference issues with ratings were mentioned, but not discussed thoroughly. To me this seems a highly important issue, as current services are using rating as a primary input for their recommenders and many of the algorithms work based on ratings.


The following issues came up with ratings (unordered list):
  • Semantics of rating are pretty unclear; what does a user actually mean with 3.5?
  • Meaning of ratings is very subjective; does my 2.5 mean the same as your 2.5?
  • Ratings are straight out unclear; on the scale 1 to 5, does one (1) mean that I really don't want to ever see it again or does it mean that I just quite don't like it?
  • What does a single attribute actually mean when rating for example one movie; is it about the plot, the actor, soundtrack? What if I like the plot but hate the main actor, how to express that?
  • Love/Hate-ratings: many services are getting more and more ratings only on the far dimentions of the scale; rating value distribution is large.
  • Binaries like thumbs up and down have issues too; How do I interpret something that has 10 thumbs-ups and 10 downs? Am I going to take the risk to either really like it or really hate it?
  • Rating variance; how does 10 up and 10 down rating effect on people's choice? Do people go for the middle way? Apparently not, see Jolie's presentation.
  • Only a few have rated many, many have rated a few – distribution of ratings is very sparse. It is hard to recommend something for those many with few ratings.
  • Rating distribution between genres: some genres are more predictable than others thanks to user ratings. How to recommend the ones not so ofter rated? In Netflix comedies and drama are more predictable than musicals, for example. The presentage of 4-5 start movies rented has increased, as prediction accuracy becomes better.
  • Do users understand what ratings are for? Whether users really understand what ratings can do for them when using Yahoo! Music stuff or Netflix? It's about the trade-off between control gained over the service but yielding to users' convenience that they give up when taking time to rate.
  • Feedback loop between rating and recommendations can become self-promoting. If I rate something good, the recommender keeps recommending that or similar items to me (also known as similarity-trap). There is the popularity bias: at the end, everything is related to Britney Spears.
  • Knowing the users' intentions: wanna buy or listen?
  • Ratings depend on when the item was rated (Netflix found out that ratings done immediately after watching the movie vary from the ones made at the later stage!).
  • Ratings are vulnerable to chilling, intentionally bad ratings, want to lift some music up on the list by rating it high, etc. (influencing the vote is relatively easy using some algorithms whereas hybrids might be more robust against manipulation, see Mr Mobasher’s persentation).
  • Does the user feel home with the other raters? Am I sure that I belong to this group of users and tastes? For example Last.fm started as a rather geek service, thus lots of users have rated items that match geeky music taste! This becomes really important when we think about internationalisation of recommender services, can my taste match with white-male-middle-class American taste?
  • Computing affinities with userprofile, editorial rankings, etc can take a long time, for example some Yahoo! services are only updated weekly since it's so computing intensive.

Ok, I think it's broken, but the question is can it be fixed? Well, that looks like a long list of issues to deal with, but I'm sure nothing has gone beyond repairing.

However, there are many remaining questions: How to help people who don't rate? How to better understand users behaviour, what do they like and what not, and get that information in a more implicit way? The following remedies were mentioned in the conference:
  • Going beyond rating for data input for recommenders by monitoring the play events in an online radio.
  • Uploaded playlists by users can yield important information about sequencing music, moods that they are played in, etc.
  • Netflix talked about encoding traits of movies that predict emotional responses, for example. Maria, one of the students, talked about combining personality traits and mood settings to further personalise and contextualise recommendations.
  • Prof.Riedl talked about letting users know the value of their rating to the community, e.g. how important rating one given item is to make better recommendations for this given group. It seems like people care about others, they are willing to make ratings to help other people similar to them finding better items.
  • Using social networks to better make and find recommendations.
  • Imporving ROI for users; with fewer inputs get more valuable outputs like playlists, concerts, videos, music news, etc.

Some more ways that I could think of:
  • If binary types of ratings are something that people do, let's just use thumbs up and thumbs down.
  • If scales are used, be explicit about them. No one really knows what the stars mean in iPod! Say clearly: O means “never play again”.
  • Multi-attribute ratings: if you allow people to rate, give them also options to be more clear about it; I think the plot is good, but acting sucks. There are people who love to do ratings and evaluations (just look at Amazon.com with their reviewers lists!) and many times they are good in doing it.
  • Leverage on re-using ratings from other services: Netflix, Yahoo!, MovieLense and God knows how many other services rate the movies. Think about webservices or harvesting those ratings and get rid of the sparsity problem! There should be some interoperability between user ratings and other evaluations between services.
  • I want a meta-recommender! Would be good to know if my music taste matches with other people's taste in a given service or whether I should hang out somewhere else to get favourable recommendations.
  • Anyway, those services are too focused only getting people to use that one and only service, by pooling up and letting users to take advantage of their profile in place a in place b would be convenient for me! Maybe Attention metadata could become to help here. Attention XML and Attention Metadata: Collecting, Managing and Exploiting of Rich Usage Information at International ACM Workshop

No comments: