Wednesday, August 23, 2006

Why Google is not a content-based recommender

Yesterday in the HMDB-bookclub, that we run in my unit, we read and discussed a paper on the recommender systems (Adomavicius & Tuzhilin, 2005, Toward the Next Generation of Recommender Systems). As this is my topic of research I was very eager to hear how my study-buddies perceived the issue and what did they have to say.

The discussion lingered into understanding the two main trends to produce recommendations: the content-based (CB) and collaborative recommendation. There were questions and attempts to answer them which left me unsatisfied after the session. Mainly, we left with the impression that Google, or any information retrieval system, would be, at the end of it, just a content-based recommender. I was somewhat troubled with this though and set my self for the quest to understand better what is there to discover.

Let’s go first by definition: Konstan et al. (2005) say:
Unlike ordinary keyword search systems, recommenders attempt to find items that match user's tastes and the user’s sense of quality, as well as syntactic matches on topic or keyword. For example, a music recommender will use an individual’s prior taste in music to identify additional songs or albums that may be of interest.

When Adomavicius et al (2005) talk about CB approach, they state that it has its roots in information retrieval and filtering research, but
the improvements over the traditional information retrieval approaches comes from the use of user profiles that contain information about user’s tastes, preferences, and needs. The profiling information can be elicited from users explicitly, e.g., though questionnaires, or implicitly- learned from their transactional behavior over time.

In the regular Google search there is no account, whereas to produce both content-based (CB) and collaborative filtering (CF) recommendations we need an account that we can assign to the user. An individual user profile is build based upon this.

In the CB recommendation a user is recommended items similar to the ones preferred in the past. This means that we need a search history, i.e. a user profile, where we can identify what the user has preferred in the past.

Thus, to generate a rather complete user profile that can find similarities between items (not people!) things like a history of viewed paged, bookmarked pages, the purchase history, “wish list”, and things like heurestic text analysis, etc. are important (implicit rating/input). Conventionally, especially with the first generation of recommenders the explicit ratings were the top notch:

...mid-1990s when researchers started focusing on recommendation problems that explicitly rely on the ratings structure. In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user...Once we can estimate ratings for the yet unrated items, we can recommend to the user the items(s) with the highest estimated ratings(s) (Adomavicius, 2005)

Additionally, many times the CB systems would use additional information such as demographic, specific interests, location, etc that is part of the user’s self-manifested profile for the input. Maybe in the future this type of information could be extracted from some other sources, such as blog-postings, as were suggested during the session.

So, to get closer to the answer to the question, whether Google is just a content-based recommender, we can say that if used anonymously, it is not, although probably many of the techniques are the same. However, if we think of Google Personalized Search (beta) it for sure gets to be one.

The second somewhat baffling issues was the name of collaborative filtering, as it turns out, there is no collaboration between the users to produce any recommendations. In the CF recommendation the user is recommended items that people with similar tastes and preferences liked in the past. This means that we need a history for this person, too, in order to find out similarities within tastes and past experiences.

The strength of the CF approach at this stage is that even if you personally haven’t seen a link, product or what ever object we are talking about, or indicated the system what you liked about it, there most likely is someone in your nearest neighbourhood who has indicated that. Thus, in CF the values used to compute the recommendations are inferred based on similarities on the profiles, and you don’t need to have necessarily done it yourself. So, here lies the one main divider between CB and CF as for the input for the recommender: CB only uses YOUR history, whereas CF uses other users’ search history to better understand, or guesstimate, your history.

Well, this is stuff explained in short, more and better arguments are found in the papers and in my links at:

Adomavicius & Tuzhilin, 2005, Toward the Next Generation of Recommender Systems

J.A. Konstan, N. Kapoor, S.M. McNee, and J.T. Butler. "TechLens: Exploring the Use of Recommenders to Support Users of Digital Libraries". A Project Briefing at the Coalition for Networked Information Fall 2005 Task Force Meeting, Phoenix, AZ, December 2005.

No comments: