The techniques used by unloveable rogues who automate search engine manipulation attacks themed around breaking news to sling scareware have been unpicked by new research from Sophos. A research paper published on Wednesday by Sophos researchers Fraser Howard and Onur Komili lifts the lid on the search engine optimisation …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Wednesday 31st March 2010 14:13 GMT BristolBachelor

The solution

The solution is for the search engines to harvest from IP addresses that cannot be associated with the search engines. That way, the content cannot be customised depending on if it is Google or some poor user fetching it.

Would also solve some of the problems with search engine poisning for other things, like google search for products.

I'm sure that someone like Google could easily arrange to "borrow" IP addresses from large ISPs on a random basis.

0 2
1. Wednesday 31st March 2010 15:22 GMT A J Stiles
  
  ..... not quite.
  
  Search engines don't just use a predictable pool of IP addresses; they also use a predictable user-agent string.
  
  0 0
2. Wednesday 31st March 2010 15:23 GMT Daniel 1
  
  Do you really think it could be that easy?
  
  Hell, if that was the solution, I'd no doubt be using the 'BristolBachelorBot' search engine, today, wouldn't I? The reason why there's only one serious contender, and one wannabe, in this market, is because it's hard.
  
  Even if the Googlebot did not explicitly identify itself, as such, the spider can easily be recognised, simply by the patterns of its behaviour. For instance, unlike regular web-scrapers, search engine spiders tend to poll their requests at regular intervals over a given period of time, and will avoid requesting certain content (like, for instance, javascript files whose functionality is not, in some way triggered by the page request), to avoid consuming a site's bandwidth: a visit from the Googlebot can easily take half a day, if you have a lot of content. The regularity and nature of the requests can act as a signature.
  
  Even if those factors didn't alert you that a search engine was on-the-visit, the very fact that it reads your Robots.txt file is a bit of a giveaway. I'm sure you wouldn't advocate search engines stop reading robots.txt?
  
  Google regularly and deliberately haze the behaviour of their search engines, to throw these people off, but its a constantly moving battle. I really don't think people outside of search, realise the enormity of the problem of automatically gathering realistic data on the Web, these days. We only notice it, when it fails.
  
  0 0
Wednesday 31st March 2010 22:42 GMT Anonymous Coward

heres one to add to the IP/Hosts blocklist

heres the IP of one nasty malware virus checker(worm) that seems to crash my browsers everytime it gets refered to by google.

IP to block 89.248.174.23

PeerBlock FTW

0 0
Wednesday 31st March 2010 22:43 GMT Anonymous Coward

the real answer

Is to spoof your user agent as googlebot and hide your referer.

or just use trusted news sites for your breaking news stories maybe.

0 0