This document provides both technical information and some background and insight into what search engine indexing robots should expect to encounter . Technically, the problems arise from misunderstandings and exploitation of anomalies by HTML creators (direct tagging, WYSIWYG and automated systems), and the tendency of browser applications to be very forgiving in their interpretation of pages and links. Therefore, it's impossible to simply read the HTML and HTTP specifications and follow the rules there -- the real world is much messier than that.
I wish I had found this about 4 months ago! Easily the best checklist of the various issues and practical solutions you will face when writing a web crawler.