Monday, November 05, 2007

Notes on "Aspects on Broad Folksonomies"

Aspects on Broad Folksonomies by M.Lux and M. Granizer (2007)

This paper continues the trend in studying and analysing the underlying statistical properties of broad folksonomies that aims to identify laws and characteristics which allow inferring those properties. A few notes on what I found interesting related to the emerging notion of quality of tags, something that I've also spared a few thoughts on.

First, though, on some other issues. The paper talks about the emergence of power law distribution in folksonomies. They describe which approach they took to fit the sample to a power law, which was something that I've sometimes contemplated on the how-part of things. The paper aims at analysing whether one can find similar term distribution in folksonomies as in classical term retrieval (e.g. Zipf. note: Zipf's law with an exponent between 1 and 2). The dataset is that of delicious (uh, with about 800 000 bookmarks and about 27 000 users- I got a way to go with my MELT bookmarks).

Tag co-occurrence
They are able to show that "for around 80% of the tags of a folksonomy the co-occurring tags follow a power law distribution, which approves Cattuto's assumption. We found that for about 90% of the estimated power law exponent B xxx [-1.5, -0.5], which shows that for most tags co-occurrence follows a model with similar parameters. "
Resource and user based tagging characteristics
Secondly, they looked into frequently used tags (more than 30 users).
  • For resources statistics they (frequency of users tagging the resource with a tag) found that around 18,4% of resources followed a power law distribution.
    • assigned by lot of users to few resources (head) and to a lot of different resources by a few users (tail)
  • For user statistics (frequency of resources tagged with a tag), around 13% are following a power law.
    • few users tag a lot, whereas lot of users tag a few
  • i.e. the characteristics of the user statics are similar to the characteristics of the resource statics.
  • They argue that those tags, which follow a power law w.r.t users and resources are high quality tags (i.e. tags describing resources with high accuracy [no misspellings and meaningful tags] ) for most of the users involved in the investigated social bookmarking system.
  • A small fraction of tags have overlapping user groups, which points towards sub communities (user groups sharing the same link selection and tagging behavoiur) in the tail of the power law distribution.
    • this was found through splitting resources in 3 (high, mid and low rank resources)
They also looked at the big chunk of tags that were not following the power law.
  • Unique assignments. More than half (57%) of less frequently tags are used only once. They think that they can be seen as "shortcuts" for a user to a resource or a misspellings. They argue that these tags are useless from retrieval point of view (hmm..).
  • Personal vocabulary. especially in less frequently used tags (19%) of tags were only used by one user but assigned to many resources. They are useful for personal retrieval but useless for the rest of the community.
  • Unpopular vocabularies. between 1/5 and 2/5 of tags are assigned to different resources by different users only once. Unpopular vocs used by a small fraction of users.
  • they conclude that from retrieval point of view (e.g. inverted indices, TF*IDF) a large fraction of tags are good for single or sub-communities, and only the power law distributed tags are good for that.
    • They don't say anything about how to include the large fraction of tag not distributed by power law into IR methods.
Retrieval Aspects
Q: Do tags add information to further to description and title for retrieval purposes? This is a lot along the lines that I am also interested in, although I will look more into the networks of users. They say that for retrieval tags can be seen as an additional resource. Moreover, about 50% of available description contain information similar to the information described by tags, whereas the remaining 50% can be seen as orthogonal information.

Comment. This all is treating tags only as additional keywords that can be useful for conventional retrieval purposes. I think the connection tag-resource-user is more interesting. Just the fact that even if the tag is misspelled or hooks to a small user community is less important to me, because I know that the fact that this resource was tagged shows that the user has an interest to this resources, thus it is a vote. This aspect has an immense potential for retrieval (recommender point of view), but is seldom regarded in papers with very conventional retrieval approach.

No comments: