Create an Account
username: password:
 
  MemeStreams Logo

Do Not Crawl in the DUST: Different URLs with Similar Text

search

Worthersee
Picture of Worthersee
My Blog
My Profile
My Audience
My Sources
Send Me a Message

sponsored links

Worthersee's topics
Arts
Business
Games
Health and Wellness
Home and Garden
Miscellaneous
Current Events
Recreation
Local Information
Science
Society
Sports
Technology

support us

Get MemeStreams Stuff!


 
Do Not Crawl in the DUST: Different URLs with Similar Text
Topic: Miscellaneous 4:16 pm EST, Feb 20, 2009

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.

Do Not Crawl in the DUST: Different URLs with Similar Text



 
 
Powered By Industrial Memetics
RSS2.0