Create an Account
username: password:
 
  MemeStreams Logo

Predicting Web Spam with HTTP Session Information

search

possibly noteworthy
Picture of possibly noteworthy
My Blog
My Profile
My Audience
My Sources
Send Me a Message

sponsored links

possibly noteworthy's topics
Arts
Business
Games
Health and Wellness
Home and Garden
Miscellaneous
  Humor
Current Events
  War on Terrorism
Recreation
Local Information
  Food
Science
Society
  International Relations
  Politics and Law
   Intellectual Property
  Military
Sports
Technology
  Military Technology
  High Tech Developments

support us

Get MemeStreams Stuff!


 
Predicting Web Spam with HTTP Session Information
Topic: High Tech Developments 7:56 am EST, Nov 20, 2008

Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most previous research analyzes the contents of Web pages and the link structure of the Web graph. Unfortunately, these heavyweight approaches require full downloads of both legitimate and spam pages to be effective, making real-time deployment of these techniques infeasible for Web browsers, high-performance Web crawlers, and real-time Web applications. In this paper, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information (i.e., hosting IP addresses and HTTP session headers). Concretely, we built an HTTP session classifier based on our predictive technique, and by incorporating this classifier into HTTP retrieval operations, we are able to detect Web spam pages before the actual content transfer. As a result, our approach protects Web users from Webpropagated malware, and it generates significant bandwidth and storage savings. By applying our predictive technique to a corpus of almost 350,000 Web spam instances and almost 400,000 legitimate instances, we were able to successfully detect 88.2% of the Web spam pages with a false positive rate of only 0.4%. These classification results are superior to previous evaluation results obtained with traditional linkbased and content-based techniques. Additionally, our experiments show that our approach saves an average of 15.4 KB of bandwidth and storage resources for every successfully identified Web spam page, while only adding an average of 101 microseconds to each HTTP retrieval operation. Therefore, our predictive technique can be successfully deployed in applications that demand real-time spam detection.

Predicting Web Spam with HTTP Session Information



 
 
Powered By Industrial Memetics
RSS2.0