MemeStreams | MemeStreams Discussion

Create an Account

This page contains all of the posts and discussion on MemeStreams referencing the following web page: Indexing Robot Crawler Checklist. You can find discussions on MemeStreams as you surf the web, even if you aren't a MemeStreams member, using the Threads Bookmarklet.

Indexing Robot Crawler Checklist
by Acidus at 10:33 pm EST, Dec 1, 2005

This document provides both technical information and some background and insight into what search engine indexing robots should expect to encounter . Technically, the problems arise from misunderstandings and exploitation of anomalies by HTML creators (direct tagging, WYSIWYG and automated systems), and the tendency of browser applications to be very forgiving in their interpretation of pages and links. Therefore, it's impossible to simply read the HTML and HTTP specifications and follow the rules there -- the real world is much messier than that.

I wish I had found this about 4 months ago! Easily the best checklist of the various issues and practical solutions you will face when writing a web crawler.