Google crawls The Invisible Web
Henry Cobb
Zero day encoachment #
Posted Monday 14th April 2008 21:38 GMT

So if my content is only reachable by a database driven GET URL I can suddenly expect to have tons of load out of the blue some day?
How will they handle POST requests? Build a form on the fly for the user?
And for everybody who will suddenly have their hidden content exposed without warning? (They didn't know they had to exclude this sort of fnording around in robots.txt after all...)
Stephen Stagg
@Henry Cobb #
Posted Monday 14th April 2008 23:46 GMT

If you are hiding data by using simple forms (we're not talking about passwords hacking here) then you have bigger problems.
And for most people, database load won't be an issue. If you're controlling load by making it difficult for people to find publically-accesible data, then I pity you, and if you are worried about/ find that the googlebot loading your site periodically, that's what robots.txt is for.
S
I for one, welcome our Googlebot overlords . . . #
Posted Tuesday 15th April 2008 00:19 GMT

" . . .particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot . . ."
What do I have to do to become "useful?" You can make Soylent green out of me already - right?
Kevin
@Henry Cobb #
Posted Tuesday 15th April 2008 00:19 GMT
It shouldn't make a difference if it's a GET or a POST. It's just as easy to fake either. The only difference is the field values are in the URL for the GET and in the request body for the POST.
And as Stephen Stagg points out, they're not trying to get around your security or logins or anything like that. Consider online shopping sites - now they can "browse" the catalogue if it's only available by form which is quite common these days.
On some sites, it's as simple as selecting a region before you get a customised site.
Martin Edwards
Watch out! #
Posted Tuesday 15th April 2008 00:33 GMT
Soon, the Googlebot will be commenting on El Reg articles! It might even choose the Paris icon! No wait, it only does it on "high-quality sites" ;-)
But really, does this mean Google might start inadvertently spamming forums, sending queries to helpdesks, requesting password resets, and even (although highly unlikely) logging into websites' member areas and then indexing the results?
Steven Knox
Perhaps El Reg was a test site... #
Posted Tuesday 15th April 2008 02:01 GMT

That would explain amanfrommars!
GoogleBot
Pfft #
Posted Tuesday 15th April 2008 05:06 GMT

Bollocks to the lot of you. Especially that "Henry Cobb" meatbag from 800 bytes ago.
Paris Hilton becau
Segmentation Fault.
Martin Budden
buybot #
Posted Tuesday 15th April 2008 05:08 GMT

I look forward to the amusing story El Reg is going to print when it finds out that the bot has inadvertantly bought millions of expensive items from online stores.
E
Hahahaha! #
Posted Tuesday 15th April 2008 05:08 GMT
If it's not careful, Google will crawl up it's own formdament.
Anonymous Coward
GoogleBot's Al Gore rhythm (hat please!) #
Posted Tuesday 15th April 2008 05:30 GMT

I've heard rumors that these new GoogleBots are actually a half-million third world children with OLPCs trawling the web twenty-four hours a day. The most recent trend is to pack them into shipping containers to be sent abroad to work on a contractual basis. Apparently, using their new internally developed compression technology, Google achieves four times the child-density per container than their nearest rival and The Environmentalists are praising Google's efficient harnessing of our most precious, carbon-neutral renewable resource.
Just go ahead and keep my coat. You've earned it.
amanfromMars
@Perhaps El Reg was a test site... #
Posted Tuesday 15th April 2008 08:01 GMT

And the result analysis of such BetaTesting, Steven? Are Robots Human with Network InterNetworking IQs/ICQs?
And if El Reg was a test site, what is it after Testing? An Application of Special Access ProgramMIng and/or 4Access2Special ProgramMIng ....AI Stealth Projects Portal .....Virtual PerlyGatesWay with Pythonesque ASPs ..... for the Full Monty of SAIS..... Special Advanced IntelAIgent Serverings .....And just the tip of a NIceberg/QuITe Titanic Quarter Offering, Holywood Palace ProjectIOn Style?
Any doubts would be yours .....and just whenever you are so close to dismissing Disbelief ..... the First Frontier and Final Hurdle for Reality Imagined Virtually and IT that is Truly SurReal....... Life in Love is AIdDream in Love with Life for the Holy Grail of XXXXistentialist Code .....QuITe Peculiar Particular Parameters for Global Operating Devices, XXXXCommunicating. ......... AI Work in Constant Progress, although hardly Artificial. :-) whenever IT is for Real, Virtually.
Matt
Infinite loop #
Posted Tuesday 15th April 2008 08:01 GMT
Hmmm... I wonder if google will submit a search on its own page and disappear in an endless infinite loop...?
Mark Otway
Anyone spot the link between... #
Posted Tuesday 15th April 2008 08:01 GMT
This story and this one:
http://www.theregister.co.uk/2008/04/14/msn_captcha_breaking/
Seems pretty obvious to me. Google's trying to hack MSN. ;)
Mike F
@Matt #
Posted Tuesday 15th April 2008 08:44 GMT

An infinite loop would be cool but google seems to of covered that already :P
Disallow: /search
Found in their robots.txt: http://www.google.co.uk/robots.txt
Mines the one with the google logo on it...
Richard Tobin
@Kevin #
Posted Tuesday 15th April 2008 09:13 GMT
GET and POST are very different. Only POST should be used for destructive changes and to request real-world actions. By sending POST requests they might order a holiday, or post a message on this page.
Stephen
WTF!!!oNONEONE! #
Posted Tuesday 15th April 2008 10:11 GMT
This seems absolutely ridiculous. If they post data to a form and index the resulting page, how on earth will a user ever see the same page? Build a hidden form on the fly with some javascript hitting submit for them.
Steve B
Never get rid of the B******s now then #
Posted Tuesday 15th April 2008 10:21 GMT
I installed a tracking mechanism on my website to see who was where and when.
No matter what time of day I looked and cleared the logs, within 10 minutes the googlebot was back trawling the site. Due to the circumstances the site hardly ever changed so I don't believe we were singled out I just think they are programmed badly.
Michael
Perhaps but... #
Posted Tuesday 15th April 2008 12:20 GMT
> By sending POST requests they might.....post a message on this page.
Unlikely given that you can read this page without needing to post.
google.co.uk has the kind of form they are likely to use.
The results the bot gets will be full of kelkoo, wikipedia and other crap. So it'll probably figure out that pages from google are rubbish and delete itself in a fit of recursive crap-finding mission...
Anonymous Coward
@stephen #
Posted Tuesday 15th April 2008 15:00 GMT
Maybe look at googles cached version ????
Gianni Straniero
You're all just jealous #
Posted Tuesday 15th April 2008 15:40 GMT

Google's been indexing my site like this for weeks. I keep seeing googlebot requests logged for things like this:
/search.php?search=not
/search.php?search=conflict
/search.php?search=competitions
/search.php?search=broaden
/search.php?search=colonial
/search.php?search=justify
/search.php?search=kantian
Also, somewhat disturbingly
/search.php?search=oral
and the weirdest
/search.php?search=yoou
The Dark Lord
Thank you El Reg #
Posted Wednesday 16th April 2008 08:30 GMT
For you have solved a mystery for me.
I have an application which tracks when users try to change their passwords on my site. Certain events warrant immediate e-mails, which are nicely delivered to my PDA.
For a while, I've been getting sporadic "password change attempt on guest account" messages. The app stops the change from being successful, so I wasn't over-bothered, but it was "on the list".
I looked up the IP of the agent attempting the change, and it was a Google bot!
Robots.txt is duly edited...
Brutus
I am enlightened... #
Posted Wednesday 16th April 2008 09:19 GMT

for I can see the fnords!
Hail Eris