Monday, May 04, 2009

Link structure and anchor text

I read that Brin & Page (1998) paper again. A few guidelines to keep in mind:
..our notion of "relevant" to only include the very best documents since there may be tens of thousands of slightly relevant documents. This very high precision is important even at the expense of recall (the total number of relevant documents the system is able to return).


Two features to produce high quality precision:
  • Link structure is used to create objective measure of its citation importance that corresponds well with people’s subjective idea of importance. Well, it's that simple..

  • Anchor text:
    ..anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs,..
The point about the anchor text is so interesting, I wonder how well does it apply to tags? I bet really well..

I also found this interesting: "it has location information for all hits and so it makes extensive use of proximity in search"

Differences Between the Web and Well Controlled Collections
  • extreme variation internal to the documents: documents differ internally in their language (both human and programming), vocabulary (email addresses, links, zip codes, phone numbers, product numbers), type or format (text, HTML, PDF, images, sounds), and may even be machine generated (log files or out putfrom a database).
  • external meta information as information that can be inferred about a document, but is not contained within it. Examples of external meta information include things like reputation of the source, update frequency, quality, popularity or usage, and citations. Not only are the possible sources of external meta information varied, but the things that are being measuredvary many orders of magnitude as well.


http://www.scribd.com/doc/3208417/The-Anatomy-of-a-LargeScale-Hypertextual-Web-Search-Engine

No comments: