back to article Robots.txt tells hackers the places you don't want them to look

Melbourne penetration tester Thiebaud Weksteen is warning system administrators that robots.txt files can give attackers valuable information on potential targets by giving them clues about directories their owners are trying to protect. Robots.txt files tell search engines which directories on a web server they can and cannot …

Silver badge
Holmes

It's nice to have regular recalls, but...

Didn't we have that discussion back before Y2K?

27
0
Silver badge

Re: It's nice to have regular recalls, but...

I was just going to post the same thing, this is hardly a new revelation.

Any sysadmin who manages web servers has to balance the plaintive cries of SEO consultants (who want robots.txt) with those of Security consultants (who don't)...

3
0

Re: It's nice to have regular recalls, but...

Yup - I have always seen robots.txt as a courtesy to "good" bots to say "crawling here is a waste of time". I have never seen it in any way related to security.

13
0

This post has been deleted by its author

I was checking robots.txt before it was cool...

Perhaps that's why tools such as Nikto have been reporting on robots.txt for about fifteen years.

7
0
Silver badge

This is old news.

If you want to protect something from prying eyes, put it behind HTTP authentication or secured scripts. Google can't magically guess your passwords and index password protected areas.

But listing something in robots.txt that you don't want indexed? That's like looking for the evil bit on an Internet packet. If you don't want random people indexing content, don't make that content available to them. Even the "Ah, but I block the GoogleBot" junk is useless - do you have any idea how many other bots are out there just indexing sites at random?

If your robots.txt is used for anything other than "that's a large image folder and I'd rather it wasn't uploaded to Google over time for bandwidth reasons, but there's nothing sensitive in there", then you're giving yourself a false impression of safety.

It's like leaving your file server open to the world but putting the "hidden" bit on the files...

29
0
Silver badge
Facepalm

Yes.

I always assumed robots.txt was just to flag parts of the tree that were not worth the robot's trouble. As both a courtesy to the search engines and to reduce bandwitdh to your web server. Nothing to do with security.

Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".

39
0
Silver badge

Re: Yes.

Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".

Or even ""please don't nick anything from the second drawer down, hidden under the socks, in the chest of drawers in the bedroom at the front of the house".

14
0
Silver badge

Re: Yes.

""please don't nick anything from the second drawer down, hidden under the socks, in the chest of drawers in the bedroom at the front of the house".

Where I have placed a mousetrap primed to go off when you stick your hand in there.

As the article says, temporarily block any IP that tries to access that area.

2
2
Silver badge

Re: Yes.

Where I have placed a mousetrap primed to go off when you stick your hand in there.

Exactly. Come into my parlour said the spider to the fly, just the thing to trap the hackers.

0
1

Re: Yes.

> Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".

I used to work for a company that insisted all confidential information be stored in locked cabinets with a label on the cabinet saying "Contains confidential information."

It was probably meant as a reminder for staff to lock the cabinet but obviously helped any would-be industrial spy.

2
0

Re: Yes.

These days it's all in the big blue bin marked 'confidential shredding'... which someone comes every day to wheel out, without anyone checking their ID.....

8
0
Silver badge
Flame

Re: Yes.

I always assumed robots.txt was just to flag parts of the tree that were not worth the robot's trouble
Or "Please don't spider this area as the backend app/database is so bad it will cripple the whole site"

4
0

But did the door say

Beware of the Leopard?

2
0
Silver badge
Alert

Re: Yes.

http://choppingblock.keenspot.com/d/20020718.html

0
0

From robots.org - page created in 2007:

There are two important considerations when using /robots.txt:

• robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

• the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

23
0
Silver badge

I hope those that are using a honeypot driven ban hammer do reset their list occasionally. Wouldn't do to ban forever what may be a dynamic address. Other than that, always reflect on the possible misuses of any disclosure.

3
0

Welcome to 'olds'

I'm surprised this article hasn't died of old age considering its information has been known for 21 years (i.e. since the robots.txt standard was created in 1994). Yes, it will flag up some sensitive areas, but that's what IP/username/password /2-pass-auth (and so on) restrictions are for. Also note that hackers know where all the common CMS'es have their admin interfaces (most installs don't change that), so they don't need robots.txt to find them.

Although robots.txt can be ignored by "bad" spiders, it's often useful to stop spiders that do read it from battering your site (e.g. constantly hitting a script with varying parameters - a classic one is a calendar that hyperlinks every day even if there's no events plus has nav going backwards and forwards for centuries in either direction :-) ).

10
0
Anonymous Coward

This was news back in 1996

OMG really? No way!!!! If you're pentesting this is one of the first things you look at - sounds like someone just took their certified ethical hacker exam and had an epiphany.

3
0

Re: This was news back in 1996

Perhaps this has lapsed into the realm of "well duh, everyone knows this, if you don't then what are you even doing here" and become something they don't even both to teach any more. Thiebauld may currently be coming to the embarrasing realisation that he's effectively announced that water is wet! :-)

1
0
Anonymous Coward

Checksums?

Perhaps robot.txt should include crc32 checksums of directories that should be excluded, so that there is no reference to the actual directory.

1
3
Silver badge
FAIL

Re: Checksums?

So every single web spider has to crawl your entire site, before dropping the directories that match the hash?

It doesn't solve the problem that robots.txt exists for (telling a search engine which bits aren't worth indexing), and it doesn't solve the second problem (flagging which bits of your site you don't want indexing).

3
0
Silver badge
Angel

Google found the solution

They have a http://www.google.com/humans.txt file.

4
0
Anonymous Coward

Re: Google found the solution

I always found it insanely amusing that Google has a robots.txt file to stop robots from indexing Google...

0
0

Robots.txt should not be a security measure.

And to be honest, I'd be astounded if anyone thought it was.

1
0

It has it's uses

I've had pen testers my and fail one of the sites I look after, just because it had a robots.txt. They didn't look at the contents of it, just that it was there. Robots.txt is a useful tool for controlling the behaviour of legitimate crawlers, it also makes it easy to identify those that ignore it and take remedial action.

Needless to say, it added an extra sense of perspective to some of their other suggestions, most of which I considered bullshit too.

13
0

Re: It has it's uses

Those pen testers were clearly idiots then.

Besides, since when have penetration tests been 'pass or fail'? Normally they return a long list of recommendations of varying degrees of severity. Do you mean they raised the existence as robots.txt as an issue? if so, what severity issue was it, if it was 'informational' I can see their point...

1
0
Silver badge

Re: It has it's uses

> Normally they return a long list of recommendations of varying degrees of severity

And then the PHB sees "anything" flagged up on a report and demands that it be fixed - without consulting those whose area it impacts. I've been on the receiving end of this at a previous job ...

In that case, it was assessors for our parent company's insurers. One thing they flagged up was that they expect to see an account and the terminal blocked after 3 failed logins. They didn't ask us, we weren't even aware they'd been until "management" came along with a list of things we *must* fix.

Had we been asked at the time, we'd have been able to point out that the OS didn't in fact have a means of locking an account like that (it was a looong time ago), and locking the "terminal" really really was a bad idea and was guaranteed to cause problems without adding any security. But we were instructed that we must do it, so we complied and waited.

Sure enough, it wasn't long before the random "I can't log in" calls came in - from all over the company. You see, most users were on dynamic terminals (TCP sessions), one virtual line was blocked, and of course, once all the lower numbered lines were in use, that was the one that people hit when trying to log in. The only exception was if two people were logging in at once - when that locked line would be temporarily in use for a short time and allow others to log in on other lines.

Sure enough, we were allowed to turn off that feature !

2
0
Silver badge

I've always thought that Google and the other search engines should require you to submit the pages instead of crawling them. It'd be more secure, and it'd give them a better a way of filtering the spammy "Buy a Cam Girl for one hour and receive 2 Viagra pills free".

0
8
Silver badge
WTF?

You've clearly never run a website of more than a few pages.

Could you imagine how many people would need to employed for the likes of the BBC or CNN, where pages are added every few minutes?

BTW many search engine do allow you to add pages manually to make sre they get picked up.

1
0
Silver badge

"You've clearly never run a website of more than a few pages."

I have, several, and can you imagine how many pages I've had to prevent from being indexed by Google?

For every page that's added to a website, it wouldn't take much to add that page to Google if the process was simple. Copy/paste URL to Google, click submit. Job done.

And yes I know this, but it's not required any more where as before it was a benefit to your website if you did so. Google could take up to two weeks to index any change in your website, so it was better to inform them of a change instead of them waiting to see if one happened.

0
0

>> I have, several, and can you imagine how many pages I've had to prevent from being indexed by Google?

Luckily for you there's a solution to this, just organize the site so that the pages you don't want indexed have one or a few common roots and then there's a special file you can put in your root.

Can't remember what it's called though. Damn!

2
0
Silver badge

"Luckily for you there's a solution to this, just organize the site so that the pages you don't want indexed have one or a few common roots and then there's a special file you can put in your root."

Would love to say I hadn't thought of this, but it's not possible with my current employers. So it's easier to just leave the job than do that - which I'm doing.

0
1
Silver badge

"I've always thought that Google and the other search engines should require you to submit the pages instead of crawling them."

I think you sort of can do that if you want to. To get up-to-date stuff indexed.

robots.txt is more about keeping your dull stuff out of the index. Or stuff that Google may suspect of being illegitimate SEO work.

https://support.google.com/webmasters/answer/35291?hl=en appears to Google's standard advice on the question of "Do you need an SEO?"

0
0
Bronze badge

or you can use it to ban the bad guys IP addresses from your server, by including a file in it which when accessed writes an insulting message in an infinite loop and adds the IP address that requested it to your webservers blocked list..

3
0
Silver badge

A good honeypot should do little or no work itself but give the potential miscreant something they think is useful but is actually worthless and log any relevant information. In days gone by redirecting to a tarpit might have been an idea but now script-kiddies have almost limitless resources so it doesn't make sense any more.

Don't think robots.txt is as good for this as some of the other files that are regularly looked for.

2
0
Anonymous Coward

Seems the next generation of computer scientists

are doing their A levels ... 18 years since this is news to me seems to suggest it's exam time.

1
0
Facepalm

I don't understand

Why would a robots.txt list the files/directories that spiders should avoid? Surely it would list the places that spiders are welcome to visit and uses wildcard(s) to disallow everything else?

Or have the other commentards here just got a better sense of irony than me?

0
1
Silver badge
Headmaster

Re: I don't understand

The short version is that when it was first invented it was easier to list what you didn't want indexed, rather than what you did.

For the long version I'd start at the wikipedia entry.

(From which I have just learnt that Charlie Stross claims to have caused the development of robots.txt by accidentally DOSing Martijn Koster's site with a home built web crawler).

2
0
Silver badge

Re: I don't understand

robots.txt says which places you shouldn't visit but should also contain decoy areas that just fuck about with any robot that goes there. If you know how to monitor load from a page you can really piss them about when you have some spare grunt to keep them amused.

1
2
Holmes

See icon...

for details...!

1
0
Anonymous Coward

Usability != Security

And robots.txt is about the former.

0
0
Silver badge

Caveat Emptor ...... You are hereby forewarned and forearmed.

The Internet/World Wide Webs are a free open space place, so don’t venture into/onto it with anything you want to hide or expect to remain secret.

Claims to provide such effective security as renders information secret in those virtual space places, rather than searchable and liable to exposure are therefore to be assumed fraudulent and even criminal?

0
0

Lots of people saying 'Old news', of course it is, but that does beg the question of how so many sites still get it so badly wrong?

One genuine failing in the article is that it doesn't even mention the right solution, which is to use a Robots meta tag within resources you want to hide like

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

So the file will be ignored IF a spider finds it, but won't be advertised via the robots.txt file, this only works for HTML resources though.

The bigger 'take away', of course, is the fact that you should never rely on obscurity as your only security; if you don't want files to be accessible; block/control access on the server-side.

0
1
Silver badge

My robots file ever since:

# go away

User-agent: *

Disallow: /

# That's all folks.

0
1

anti-sysadmin

"Weksteen, a Securus Global hacker, thinks they offer clues about where system administrators store sensitive assets because the mention of a directory in a robots.txt file screams out that the owner has something they want to hide."

Change "system administrators" to "developers" or "Wanabies" then perhaps you have a point. A SysAdmin by definition has access to the entire system, so has no need to store sensitive stuff within the web root! Your normal peon that's limited to an FTP login however doesn't have a choice. Us SysAdmins get enough crap already without people trying to blame us for dev faults.

4
0

I think this falls into the "no good deed goes unpunished" category! If there is stuff on your internet-facing systems you don't want "discovered", then DON'T PUT IT ON YOUR INTERNET-FACING SYSTEMS!

1
0
Silver badge

Creative robots.txt

# (the directory /stupid-bot/ does not exist):

User-agent: *

Disallow: /stupid-bot/

This make the bad crawlers pup up in your logs.

A comment such as <!-- log stupid crawlers: <a href="STUPID-BOT">STUPID-BOT-TXT</a> --> in a html page (they are supposed to see) shows which ones try harder than others to find things.

1
0

If you understand robots.txt

you should understand .htaccess.

1
0

This is news.... why?

This is like saying.... "Security researcher finds that longer passwords work better than short ones".

This is not new, nor should it be surprising to anyone in the IT field with even a passing interest in security or web servers.

My $0.02

0
0

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2017