back to article Robots.txt tells hackers the places you don't want them to look

Melbourne penetration tester Thiebaud Weksteen is warning system administrators that robots.txt files can give attackers valuable information on potential targets by giving them clues about directories their owners are trying to protect. Robots.txt files tell search engines which directories on a web server they can and cannot …

  1. Destroy All Monsters Silver badge
    Holmes

    It's nice to have regular recalls, but...

    Didn't we have that discussion back before Y2K?

    1. Alister

      Re: It's nice to have regular recalls, but...

      I was just going to post the same thing, this is hardly a new revelation.

      Any sysadmin who manages web servers has to balance the plaintive cries of SEO consultants (who want robots.txt) with those of Security consultants (who don't)...

    2. ElsmarMarc

      Re: It's nice to have regular recalls, but...

      Yup - I have always seen robots.txt as a courtesy to "good" bots to say "crawling here is a waste of time". I have never seen it in any way related to security.

      1. This post has been deleted by its author

  2. Dave Wray

    I was checking robots.txt before it was cool...

    Perhaps that's why tools such as Nikto have been reporting on robots.txt for about fifteen years.

  3. Lee D Silver badge

    This is old news.

    If you want to protect something from prying eyes, put it behind HTTP authentication or secured scripts. Google can't magically guess your passwords and index password protected areas.

    But listing something in robots.txt that you don't want indexed? That's like looking for the evil bit on an Internet packet. If you don't want random people indexing content, don't make that content available to them. Even the "Ah, but I block the GoogleBot" junk is useless - do you have any idea how many other bots are out there just indexing sites at random?

    If your robots.txt is used for anything other than "that's a large image folder and I'd rather it wasn't uploaded to Google over time for bandwidth reasons, but there's nothing sensitive in there", then you're giving yourself a false impression of safety.

    It's like leaving your file server open to the world but putting the "hidden" bit on the files...

    1. LaeMing
      Facepalm

      Yes.

      I always assumed robots.txt was just to flag parts of the tree that were not worth the robot's trouble. As both a courtesy to the search engines and to reduce bandwitdh to your web server. Nothing to do with security.

      Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".

      1. Anonymous Blowhard

        Re: Yes.

        Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".

        Or even ""please don't nick anything from the second drawer down, hidden under the socks, in the chest of drawers in the bedroom at the front of the house".

        1. Christoph

          Re: Yes.

          ""please don't nick anything from the second drawer down, hidden under the socks, in the chest of drawers in the bedroom at the front of the house".

          Where I have placed a mousetrap primed to go off when you stick your hand in there.

          As the article says, temporarily block any IP that tries to access that area.

          1. Anonymous Coward
            Anonymous Coward

            Re: Yes.

            Where I have placed a mousetrap primed to go off when you stick your hand in there.

            Exactly. Come into my parlour said the spider to the fly, just the thing to trap the hackers.

            1. LaeMing
              Alert

              Re: Yes.

              http://choppingblock.keenspot.com/d/20020718.html

      2. Graham 32

        Re: Yes.

        > Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".

        I used to work for a company that insisted all confidential information be stored in locked cabinets with a label on the cabinet saying "Contains confidential information."

        It was probably meant as a reminder for staff to lock the cabinet but obviously helped any would-be industrial spy.

        1. JeffUK

          Re: Yes.

          These days it's all in the big blue bin marked 'confidential shredding'... which someone comes every day to wheel out, without anyone checking their ID.....

        2. Anonymous C0ward

          But did the door say

          Beware of the Leopard?

      3. Daggerchild Silver badge
        Flame

        Re: Yes.

        I always assumed robots.txt was just to flag parts of the tree that were not worth the robot's trouble
        Or "Please don't spider this area as the backend app/database is so bad it will cripple the whole site"

  4. Alan Sharkey

    From robots.org - page created in 2007:

    There are two important considerations when using /robots.txt:

    • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

    • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

    So don't try to use /robots.txt to hide information.

  5. Anonymous Coward
    Anonymous Coward

    I hope those that are using a honeypot driven ban hammer do reset their list occasionally. Wouldn't do to ban forever what may be a dynamic address. Other than that, always reflect on the possible misuses of any disclosure.

  6. Richard Lloyd

    Welcome to 'olds'

    I'm surprised this article hasn't died of old age considering its information has been known for 21 years (i.e. since the robots.txt standard was created in 1994). Yes, it will flag up some sensitive areas, but that's what IP/username/password /2-pass-auth (and so on) restrictions are for. Also note that hackers know where all the common CMS'es have their admin interfaces (most installs don't change that), so they don't need robots.txt to find them.

    Although robots.txt can be ignored by "bad" spiders, it's often useful to stop spiders that do read it from battering your site (e.g. constantly hitting a script with varying parameters - a classic one is a calendar that hyperlinks every day even if there's no events plus has nav going backwards and forwards for centuries in either direction :-) ).

  7. Anonymous Coward
    Anonymous Coward

    This was news back in 1996

    OMG really? No way!!!! If you're pentesting this is one of the first things you look at - sounds like someone just took their certified ethical hacker exam and had an epiphany.

    1. Keith Langmead

      Re: This was news back in 1996

      Perhaps this has lapsed into the realm of "well duh, everyone knows this, if you don't then what are you even doing here" and become something they don't even both to teach any more. Thiebauld may currently be coming to the embarrasing realisation that he's effectively announced that water is wet! :-)

  8. Anonymous Coward
    Anonymous Coward

    Checksums?

    Perhaps robot.txt should include crc32 checksums of directories that should be excluded, so that there is no reference to the actual directory.

    1. phuzz Silver badge
      FAIL

      Re: Checksums?

      So every single web spider has to crawl your entire site, before dropping the directories that match the hash?

      It doesn't solve the problem that robots.txt exists for (telling a search engine which bits aren't worth indexing), and it doesn't solve the second problem (flagging which bits of your site you don't want indexing).

  9. ratfox
    Angel

    Google found the solution

    They have a http://www.google.com/humans.txt file.

    1. Anonymous Coward
      Anonymous Coward

      Re: Google found the solution

      I always found it insanely amusing that Google has a robots.txt file to stop robots from indexing Google...

  10. Crisp

    Robots.txt should not be a security measure.

    And to be honest, I'd be astounded if anyone thought it was.

  11. swisstoni

    It has it's uses

    I've had pen testers my and fail one of the sites I look after, just because it had a robots.txt. They didn't look at the contents of it, just that it was there. Robots.txt is a useful tool for controlling the behaviour of legitimate crawlers, it also makes it easy to identify those that ignore it and take remedial action.

    Needless to say, it added an extra sense of perspective to some of their other suggestions, most of which I considered bullshit too.

    1. JeffUK

      Re: It has it's uses

      Those pen testers were clearly idiots then.

      Besides, since when have penetration tests been 'pass or fail'? Normally they return a long list of recommendations of varying degrees of severity. Do you mean they raised the existence as robots.txt as an issue? if so, what severity issue was it, if it was 'informational' I can see their point...

      1. SImon Hobson Bronze badge

        Re: It has it's uses

        > Normally they return a long list of recommendations of varying degrees of severity

        And then the PHB sees "anything" flagged up on a report and demands that it be fixed - without consulting those whose area it impacts. I've been on the receiving end of this at a previous job ...

        In that case, it was assessors for our parent company's insurers. One thing they flagged up was that they expect to see an account and the terminal blocked after 3 failed logins. They didn't ask us, we weren't even aware they'd been until "management" came along with a list of things we *must* fix.

        Had we been asked at the time, we'd have been able to point out that the OS didn't in fact have a means of locking an account like that (it was a looong time ago), and locking the "terminal" really really was a bad idea and was guaranteed to cause problems without adding any security. But we were instructed that we must do it, so we complied and waited.

        Sure enough, it wasn't long before the random "I can't log in" calls came in - from all over the company. You see, most users were on dynamic terminals (TCP sessions), one virtual line was blocked, and of course, once all the lower numbered lines were in use, that was the one that people hit when trying to log in. The only exception was if two people were logging in at once - when that locked line would be temporarily in use for a short time and allow others to log in on other lines.

        Sure enough, we were allowed to turn off that feature !

  12. wolfetone Silver badge

    I've always thought that Google and the other search engines should require you to submit the pages instead of crawling them. It'd be more secure, and it'd give them a better a way of filtering the spammy "Buy a Cam Girl for one hour and receive 2 Viagra pills free".

    1. Anonymous Coward
      WTF?

      You've clearly never run a website of more than a few pages.

      Could you imagine how many people would need to employed for the likes of the BBC or CNN, where pages are added every few minutes?

      BTW many search engine do allow you to add pages manually to make sre they get picked up.

      1. wolfetone Silver badge

        "You've clearly never run a website of more than a few pages."

        I have, several, and can you imagine how many pages I've had to prevent from being indexed by Google?

        For every page that's added to a website, it wouldn't take much to add that page to Google if the process was simple. Copy/paste URL to Google, click submit. Job done.

        And yes I know this, but it's not required any more where as before it was a benefit to your website if you did so. Google could take up to two weeks to index any change in your website, so it was better to inform them of a change instead of them waiting to see if one happened.

        1. Indolent Wretch

          >> I have, several, and can you imagine how many pages I've had to prevent from being indexed by Google?

          Luckily for you there's a solution to this, just organize the site so that the pages you don't want indexed have one or a few common roots and then there's a special file you can put in your root.

          Can't remember what it's called though. Damn!

          1. wolfetone Silver badge

            "Luckily for you there's a solution to this, just organize the site so that the pages you don't want indexed have one or a few common roots and then there's a special file you can put in your root."

            Would love to say I hadn't thought of this, but it's not possible with my current employers. So it's easier to just leave the job than do that - which I'm doing.

    2. Robert Carnegie Silver badge

      "I've always thought that Google and the other search engines should require you to submit the pages instead of crawling them."

      I think you sort of can do that if you want to. To get up-to-date stuff indexed.

      robots.txt is more about keeping your dull stuff out of the index. Or stuff that Google may suspect of being illegitimate SEO work.

      https://support.google.com/webmasters/answer/35291?hl=en appears to Google's standard advice on the question of "Do you need an SEO?"

  13. Paul Woodhouse

    or you can use it to ban the bad guys IP addresses from your server, by including a file in it which when accessed writes an insulting message in an infinite loop and adds the IP address that requested it to your webservers blocked list..

    1. Charlie Clark Silver badge

      A good honeypot should do little or no work itself but give the potential miscreant something they think is useful but is actually worthless and log any relevant information. In days gone by redirecting to a tarpit might have been an idea but now script-kiddies have almost limitless resources so it doesn't make sense any more.

      Don't think robots.txt is as good for this as some of the other files that are regularly looked for.

  14. Anonymous Coward
    Anonymous Coward

    Seems the next generation of computer scientists

    are doing their A levels ... 18 years since this is news to me seems to suggest it's exam time.

  15. vagabondo
    Facepalm

    I don't understand

    Why would a robots.txt list the files/directories that spiders should avoid? Surely it would list the places that spiders are welcome to visit and uses wildcard(s) to disallow everything else?

    Or have the other commentards here just got a better sense of irony than me?

    1. phuzz Silver badge
      Headmaster

      Re: I don't understand

      The short version is that when it was first invented it was easier to list what you didn't want indexed, rather than what you did.

      For the long version I'd start at the wikipedia entry.

      (From which I have just learnt that Charlie Stross claims to have caused the development of robots.txt by accidentally DOSing Martijn Koster's site with a home built web crawler).

    2. Tom 7

      Re: I don't understand

      robots.txt says which places you shouldn't visit but should also contain decoy areas that just fuck about with any robot that goes there. If you know how to monitor load from a page you can really piss them about when you have some spare grunt to keep them amused.

  16. Graham Marsden
    Holmes

    See icon...

    for details...!

  17. Anonymous Coward
    Anonymous Coward

    Usability != Security

    And robots.txt is about the former.

  18. amanfromMars 1 Silver badge

    Caveat Emptor ...... You are hereby forewarned and forearmed.

    The Internet/World Wide Webs are a free open space place, so don’t venture into/onto it with anything you want to hide or expect to remain secret.

    Claims to provide such effective security as renders information secret in those virtual space places, rather than searchable and liable to exposure are therefore to be assumed fraudulent and even criminal?

  19. JeffUK

    Lots of people saying 'Old news', of course it is, but that does beg the question of how so many sites still get it so badly wrong?

    One genuine failing in the article is that it doesn't even mention the right solution, which is to use a Robots meta tag within resources you want to hide like

    <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

    So the file will be ignored IF a spider finds it, but won't be advertised via the robots.txt file, this only works for HTML resources though.

    The bigger 'take away', of course, is the fact that you should never rely on obscurity as your only security; if you don't want files to be accessible; block/control access on the server-side.

  20. John Sanders

    My robots file ever since:

    # go away

    User-agent: *

    Disallow: /

    # That's all folks.

  21. Keith Langmead

    anti-sysadmin

    "Weksteen, a Securus Global hacker, thinks they offer clues about where system administrators store sensitive assets because the mention of a directory in a robots.txt file screams out that the owner has something they want to hide."

    Change "system administrators" to "developers" or "Wanabies" then perhaps you have a point. A SysAdmin by definition has access to the entire system, so has no need to store sensitive stuff within the web root! Your normal peon that's limited to an FTP login however doesn't have a choice. Us SysAdmins get enough crap already without people trying to blame us for dev faults.

  22. Spaceman Spiff

    I think this falls into the "no good deed goes unpunished" category! If there is stuff on your internet-facing systems you don't want "discovered", then DON'T PUT IT ON YOUR INTERNET-FACING SYSTEMS!

  23. GrumpenKraut

    Creative robots.txt

    # (the directory /stupid-bot/ does not exist):

    User-agent: *

    Disallow: /stupid-bot/

    This make the bad crawlers pup up in your logs.

    A comment such as <!-- log stupid crawlers: <a href="STUPID-BOT">STUPID-BOT-TXT</a> --> in a html page (they are supposed to see) shows which ones try harder than others to find things.

  24. Anonymous C0ward

    If you understand robots.txt

    you should understand .htaccess.

  25. David 14

    This is news.... why?

    This is like saying.... "Security researcher finds that longer passwords work better than short ones".

    This is not new, nor should it be surprising to anyone in the IT field with even a passing interest in security or web servers.

    My $0.02

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like