Facebook has updated its robots.txt file so that the site can only be crawled by a short list of search engines, including Google, Microsoft's Bing, China's Baidu, Russia's Yandex, and a few others. Previously, Facebook's robot.txt allowed anyone to crawl the site, although the company had threatened to sue at least one …
Any search engine that excludes Facebook pages from its search results, intentionally or not, gets the thumbs up from me.
Facebook pages are complete drivel, second only to MySpace pages.
Could it not be argued that a search company should make a concious decision before crawling certain sites rather than rely on the presence of a robots.xt file?
especially one that has the wherewithal to hire lawyers (lots of lawyers).
respect robots.txt or welcome to my infinite tarpit
It's entirely possible to tarpit a crawler you really don't like by generating an infinite number of random pages and links for it extremely slowly, tying up its resources for months on end. The robots.txt protocol is so well known that there is no excuse for a site owner not to use it to express policy or for a crawler operator not to respect it.
Given that a crawler which doesn't respect the wishes of a site owner can be tarpitted until it gives up, there is no point suing if a crawler respects robots.txt and a better punishment available if a crawler doesn't, given that all a crawler is doing is making automated use of information you have chosen to publish.
Populating Google Me?
Wouldn't it be ironic if Facebook's irresponsible approach to user privacy allowed Google to auto-populate their rumoured rival service with user accounts?
Expect an email from Google soon, starting: "If you, like many others, are unhappy with Facebook, then we're please to tell you that we've already prepared you an account at Google Me, with the same login details, same friends and groups lists ..."
Google already have...
their own social networking site, its called orkut, so it *could* happen
Don't blame them
I've run a long-tail site, and when I wasn't watching very hard it was brought to a grinding halt by crawler traffic placing 5x the load on the server that the actual user traffic was generating. I ended up blocking the random crawlers as well - half of them were effectively just stealing the content, and most of the rest of them were sending next to no traffic to use anyway
It's a start
Now, can I specify a custom robots.txt so that NO search engines can crawl my Facebook stuff?
I suspect I'm OK already since I go in every few weeks and make sure that nobody I haven't accepted as a friend can even tell that I exist on FB, let alone see any of my stuff.
Yes, some of these crawlers are really disreputable. If Facebook doesn't under the counter I'm a goldfish, even before you consider what they've got acting as CEO...
Fixed the quote...
"Some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to our own sleazy business model"