back to article Google open sources standardized code in bid to become Mr Robots.txt

Google on Monday released its robots.txt parsing and matching library as open source in the hope of its now public code will help encourage web developers to agree on a standard way to spell out the proper etiquette for web crawlers. The C++ library powers Googlebot, the company's crawler for indexing websites in accordance …

  1. RyokuMas Silver badge
    FAIL

    Never mind...

    " Developers have become worried... The company [Google] has so much sway over the web, they fret, that it doesn't have to consult with the web community."

    Oh, stop worrying and let's all get back to griping about evil Microsoft were some twenty-or-so years ago!

    /sarcasm

    1. Claverhouse Silver badge

      Re: Never mind...

      The trouble is, Evil once done, hangs around forever, and gives newer people the perpetual license to do evil.

    2. teknopaul Silver badge

      Re: Never mind...

      I've got an idea, i wonder if they will listen

      Google no

      Duckduckgo yes

      /sargasm

  2. Pascal Monett Silver badge
    Windows

    That's your problem right there

    "it includes code to accept five different misspellings of the "disallow" directive in robots.txt"

    Technical solution : code in all different forms of a reserved word, in order to ensure that every fat-fingered idiot who can't spell will still get his directive working

    Real life solution : learn how to spell 'disallow'

    No wonder code gets sloppy. Back in my day you barely had enough bytes to check one version of a word.

    Kids are spoiled these days.

    1. Alan J. Wylie

      Re: That's your problem right there

      learn how to spell 'disallow'

      In the same way as RFC 1945's spelling of "referer"?

      And always remember Postel's Law

      1. Claptrap314 Silver badge

        Re: That's your problem right there

        Postel's Law turned out to be a bad thing. Generosity == Ambiguity. Worse, it encourages sloppiness.

        1. Michael Wojcik Silver badge

          Re: That's your problem right there

          Postel's Law turned out to be a bad thing.

          The canonical name is "Postel Interoperability Principle", but I'll accept "Postel's Law". ("Interoperability Principal" was disabled in 2012 to force upgrades to the correct homophone.)

          The PIP certainly can lead to some unfortunate security vulnerabilities. This was widely discussed on, I think, BUGTRAQ some years back regarding non-canonical UTF-8 encodings. Filters would fail to recognize non-canonical sequences for code points that subsequent parsers would decode and interpret as special characters.

      2. MOH

        Re: That's your problem right there

        Well, Google solved that problem for us too

    2. Martin an gof Silver badge

      Re: That's your problem right there

      barely had enough bytes

      Suspect back then it would have been shortened to 'dis' or 'no' or something equally near-cryptic. I quite like the nostalgia hit of the Linux command line with

      cd, mv, rm, ls, dd
      and suchlike. Sort of reminds me of the classic Acorn BBC OS "star commands" where the command for listing a directory (folder) was
      *cat
      but could be shortened to
      *.
      M.

      1. swm Bronze badge

        Re: That's your problem right there

        Wow! That brings back memories! The "cat" (for catalogue) command was on the Dartmouth Time Sharing System in 1964.

      2. Anonymous Coward
        Anonymous Coward

        Re: That's your problem right there

        Re: cd, mv, rm, ls, dd

        I always assumed that the command names were short because of the natural sysadmin characteristics of both efficiency and laziness: why tire your fingers typing longer command names (or parameters «glowers at GNU») than you need to?

        (Just to be contrary: "more" is a whole wasteful 4 characters, though. Bah.)

      3. Martin J Hooper
        Happy

        Re: That's your problem right there

        Then there is the Econet Login:

        *i am <user>

    3. Claptrap314 Silver badge

      Re: That's your problem right there

      One of the benefits of having an actual standard is that you can write a validator against it. And against parsers.

    4. IGotOut
      Trollface

      Re: That's your problem right there

      You have to allow for alternative spellings, to cater for the American mangling of the English language.

  3. chubby_moth

    Gripes (plural)

    I think my gripes with Google have only increased with their success in dominating the web. Especially now with android and the inevitability of the app store for nearly anything. Yeah,.. F-Droid is an alternative or anything else as slurpy, but smaller. In all it is a rather sick ecosystem by now. Just as the microsoft domineered one was. Still sicker are the "social" media sites and the software that relates to them and their scope on our society.

    When it comes to open standards, Google has long since lost its innocence, so some healthy scepsis is required. On the other side, they could have gone unilateral and force the issue. I guess that is often how software gets open sourced. In case of the REP, I think it can use some form of standardisation, but wouldn't be surprised to have Google say, "Thanks for the input, we've incorporated some of them in our new proprietary product that we will be pushing up your (*) .. All you people are belong to us!"

  4. andy 103
    Boffin

    No feedback

    The problem with robots.txt is that as a developer or website operator it provides zero feedback in the event of a fuck up.

    For example if you deploy a site from a development environment and forget to change "Disallow: /" in production.... the "feedback" is that your site might drop out of Google's index. If you're making money through, for example an ecommerce website, that could be a huge problem. At which point your only option is to wait for it to be reindexed after rectifying the problem.

    Equally if you do it the other way round you can end up with a development site getting indexed and then have to deal with getting it removed from their index, which is a manual and time-delayed process.

    There are third party tools including Google's Analytics and WMT that will notify you about problems like this, or the absence of that file within a webspace. But the default scenario is one where you might not know anything about what's happened until it's too late.

    1. DougS Silver badge

      Re: No feedback

      Why is it Google's problem if you don't configure your server correctly to allow indexing? How are they supposed to tell the difference between a server where "disallow: /" was supposed to have been removed and one where it is very much desired to be left in place? That is 100% a server side issue, you should talk to Red Hat about a possible solution there, not Google (i.e. when you first change the default contents of the WWW tree but leave robots.txt unchanged, send an email to root to remind them it needs to be changed if they want it indexed)

      Agreed on removing stuff from the index that doesn't belong, maybe there should be a "purge" directive that tells crawlers to immediate remove anything covered under that directive rather than observing whatever policy they may have for caching/retaining it. Then the site owner is in charge of triggering it.

    2. Claptrap314 Silver badge

      Re: No feedback

      That's what checkers & validators are for. Except that they are impossible to write without a standard. I'm surprised it has taken this long.

    3. Anonymous Coward
      Anonymous Coward

      Re: No feedback

      But surely your development site should only be accessible from the local network or a VPN for testers/clients, and not the whole interweb, anyway?

      And as robots.txt has always made clear: it's a polite request, not an order. I don't doubt for a moment that there are black-hat spiders out there which will specifically try to access and index URIs that they are "asked" not to.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019