] Heritrix is the Internet Archive's open-source, ] extensible, web-scale, archival-quality web crawler ] project. ] ] Heritrix (sometimes spelled heretrix , or misspelled or ] missaid as heratrix / heritix / heretix / heratix ) is an ] archaic word for inheritess . Since our crawler seeks to ] collect the digital artifacts of our culture for the ] benefit of future researchers and generations, this name ] seemed apt. The odds just went up greatly that MemeStreams will keep a cache and revision record of every page that get's meme'd. (Add that to the list of everything else we have promised..) I have not had a chance to look at this in depth yet, it just hit my radar. (via BoingBoing) OSS'ing this was a great move. I was thinking about trying to get a part-time job working down at the Internet Archive. I'm a big supporter of everything they are doing over there.. Heritrix - Home Page - Archive.org open sources crawler |