Sunday, August 24, 2008

How do tags connect to the Thesaurus terms?

Our social tagging system in MELT is special in two ways:
  • one, we support multi-linguality
  • two, we have not only tags, but all resources that users tag also have Thesaurus terms
In that context it becomes very interesting to know how do tags relate to the Thesaurus terms that have been used to index the resources that users tag.

I took a sample of tagged resources (n=185) that have 1013 tags associated with them. Out of those tags, there are 595 distinct tags. There are 44 users.

I made a network diagram visualisation that displays the Thesaurus terms as nodes that are connected by edges to tags. You'll find it here to play around with it. Unfortunately, I found out that 24 resources did not have Thesaurus terms related to them(that's about 13%, hmmm), thus a big plumb node in the middle without a Thesaurus term.

There is another visualisation here, it's more explorative about the data.

It's rather interesting that 595 distinct tags from users can be comprised to 34 thesaurus terms. That is 17,5 tags per Thesaurus term on average. Of course it does not go like that, it's more like rich-get-richer-type of a story. In the visualisation above you can see that most tags are related to language learning, for example.

If you look at the distribution of tags you'll find that many of the top tags are also about languages. Interestingly, many of them repeat the topic of the resource, but some of them (clearly less) state something about the nature of the resource (e.g. interactive) or the type (e.g. exercise).

The problem with creating this kind of visualisation of tags on the system level will be that the resources seem to have too many Thesaurus term. If there are 5 or so indexing terms, everything becomes related to everything else. It might be interesting to either to ask limit the Thesaurus terms to three (as should be the case anyway) or ask the indexer to give one term priority over others.

The same also goes for content-based recommendations, btw. If there are too many terms, you recommend everything for everyone.

No comments: