Friday, September 12, 2008

Cross-boundary ranking of learning resources

Based on the idea of Interest Indicators, like social bookmarks and ratings, I've looked at the data so that we can make the cross-boundary resources better available on the MELT portal.
The aim is that we can, based on previous users' behaviour :

a) make separate "travel well" lists of resources that have a potential to cross-borders better,

b) use this information to rank resources better in the normal search result list,

c) allow users search for resources that have a good "travel well" value (e.g. give me resources in math that can cross-borders)

This is the data that I'm using (table below) and this is how I've defined cross-boundary (e.g. cross-country and language) learning resources before. Using that definition I have manually verified the number of cross-country resources. In the dataset, about 82% of resources were cross-country.



Now, we have a problem, though. On our MELT portal we do not have information about the country where the resource originates from. Dah!

This is a big blunder (in my opinion) in our Application Profile, we have not defined the country where the resource originates. We do define the provider, and the country information could be inferred from the provider, but it does not always work.

For example, one of our providers has frequently metadata about resources that do not originate from the same country!

I've experimented with the data using the information that we have on the portal, which is LOM about the resource including the language of the resource. As we also know the mother tongue of the registered users, this gives us a kick.

In the table below we can see the coverage of cross-boundary actions that we can get on resources without using any manual labour or verification of the country or language. As a base-line, with manual verification I found that 82% of the actions concerned cross-border rating or bookmarking of a resource.



The first row represents the cross-language resources (i.e. user's mother tongue is different from the resource language). Just using this information, we get about 65% of resources right, as opposed using manual checking (82%). I think it's pretty good, I'd settle for that! (although I have to look what kind of material was left out!)

The two other comparisons in the table are based only using information about users' previous behaviour. These would be:
  • rating > 2
  • bookmark
Only using information regarding bookmarked and rated resources results in a lousy coverage of around 20%. The problem is that 25% of resources bookmarked or rated are on more than one resource, the data still is very sparse.

Anyway, I want to use that information to "cross-boundary rank" the resources. As we do not know the country where the resource comes from, my work-around is based on countries where these users come from.

Here is a visualisation about resources that have been bookmarked or rated by users (see also ManyEyes link below). We can see the orange node in the middle, a learning resource called "Five Days in New York..". We see 3 edges leading out to Finland, Belgium and Hungary. This means at least one user from each of these countries has bookmarked the resource!

So, even if we do not know the origin of the resource, we know that it has users from 3 different countries. I can infer that it is a cross-boundary resource.

As most likely one of these 3 users come from the same country than the resource comes from, I will minus one country out of the total of countries: (number of countries -1)

My cross-boundary rank will be the following:
  • Count the number of ratings grater than 2 and/or bookmarks for a resource (actions). Give each action one point
  • Count the number of these users and give each user one point
  • Count the number of user countries of origin. Give each country one point and then minus 1
  • Compare the mother tongue of each of these users to the language of resource. If they differ, give one point/mismatch.
Then, count the following:
number of users + number of actions + number of cross-language x (number of countries -1)
Let's take the above resource "Five Days in New York.." as an example
  • Count the number of ratings grater than 2 (3) and/or bookmarks for a resource (5). Give each action one point. (8)
  • Count the number of these users and give each user one point (5)
  • Count the number of user countries of origin (Hungary, Finland, Belgium). Give each country one point and then minus 1 ( 3-1=2)
  • Count the number of user mother tongue (hu, nl, fi). Compare the mother tongue of each of these users to the language of resource (en). If they differ, give one point/mismatch (3).

  • number of users (5) + number of actions (8) + number of cross-language (3) x (number of countries -1) (2) = 32 Travel well value
This way you can count a value of "travel well" for each resource that users have previously interacted with on the portal. The value will always be an integer, which is important from the technical implementation point of view (in Lucine index it apparently needs to be an integer).

The down side is that we'll have a huge cold-start problem. As I said, our data is very sparse. To seed the system, I actually still manually check the new resources that users have interacted with and make a fake bookmark on them so that it looks like it has at least two users from 2 different countries. This way the resource gets a "travel well" value counted and appears on the "travel well" list and is better ranked, etc.

Of course, at the end I will evaluate how this treatment affects on users, do we, for example, see a big amount of bookmarks on these resources that I have been able to count a travel well value?

You can see a visualisation here. This is based on on user's country of origin.

No comments: