Unbelievable. AOL released a file containing the search engine queries of over 500,000 users during a three month period. It's being mirrored all over. Here is a screenshot of the download page before it was taken down, complete with a spelling error.. "ananomized" Update: I've imported the data into an SQL database so I can do some data mining. It's about 3.5G worth of SQL, so the process of building indexes and performing any useful queries is really slow going. Sometime in the next 24 hours, I should be posting up some statistics. I have to think about it some more first... From what I've gathered so far, there is no liability in doing so. AOL fucked up. This data is in the hands of many, many, many people. That being the case, I want to see how the data frames the issues we all have with this kind of data being available to law enforcement, marketers, and others.. Anyone who has any ideas about what questions we should be asking, reply to this with your thoughts. Since the hot button issue most directly connected with this is child porn, I've been doing some research focusing on that. The Justice Department wanted Google and other search engines to hand over exactly this information so they could build a profile of what people are searching for when they search for child porn. I've been attempting to do the same thing. Thus far, I've gotten a pretty expansive table of users (over 300) that have been blatantly searching for child porn. I've done a fair amount of work eliminating false positives, such as people searching for information about how to protect their kids, researching court cases, or looking up information about specific offenses. I've tried to limit the list to people blatantly repeatedly searching for illegal pictures of pre-teens and whatnot. I'm working on constructing a list of "what people who search for kiddie porn search for." I also have some indexes building that will allow me to mine general statistical data on what the top queries are and stuff like that. Since I'm working with a laptop that only has a gig of ram and not too speediest of a hard drive, it's going to take awhile. I expect my machine to be churning for the next few hours. Update: I don't have powerful enough hardware to mine this. I'm waiting on more resources to become available later tonight. |