A security researcher has devised a successful attack on a Google-owned system for blocking malicious scripts on web-based email services and other types of sites. The attack, described in a paper released Saturday, uses a combination of OCR, or optical character recognition, techniques and other methods to break reCAPTCHA, a …
the Problem is....
""[it] was designed by researchers from Carnegie Mellon University as a way to solve two problems at once"
Now if they'd stuck to one or the other, they might not be in this position.
While the security should be absolute, you can have a degree of fuzziness in the other, especially if you are going to get multiple interpretations.
I saw a presentation of the lead Professor from Caregie Mellon at an event in Amsterdam. I found it really compelling. Unless I heard wrong the description in the article doesn't tell the whole story. One of the set of text is known good and one is unknown, the known good is distorted to reflect the a distortion equal to the unknown word.
When the human attempts to find the two words the server only needs to match the one known word to make an approximation that the finder is human. The unknown word is read by the user and is checked against a probable list and then against re-use of that word. The important thing is that they aren't trying to solve the problem of human authentication, they are just replacing crude CAPTCHA word distortion with something that is useful.
Traditional CAPTCHA just distorts text badly and then hopes you'll identify the text and attempts to stop simple OCR. ReCAPTCHA is achieving the same thing but also using that labour to identify words that can't be matched by OCR. They are matching text from public documents that need transcribing, they have been working on early copies of the New York Times which couldn't be recognised completely.
They are doing something that is simple, they aren't trying to be "the answer". In addition their use of statistics means they can quickly identify people "playing the system".
reCaptcha's not the only thing broken...
If I'm reading the article correctly they got a total of 22.5 right answers out of a possible 200. 10 solved, plus half of the 25 where one word was right and would also be considered solved (12.5).That's a total success rate of 12.5% isn't it - 22.5 out of 200?
No, it isn't.
11.25 is half of 22.5, not 12.5. Since you ask.
I'm sorry, but you can't simply split your statistics by stating: "If we presume that in half of the cases the failed word would be the unknown word for reCAPTCHA..." and expect double success rate.
1) The "unknown" word made the "unknown" list due to being mangled (see definition: foobar) in the first place, makes it less likely for a hack-attempt at OCR to do much better.
2) The original text was more-or-less easily OCRed, and thus removing add-on distortion would grant a better success rate.
Either way, I saw no reference to state whether the "correct" responses were to the "unknown" or the "known" text, which throws off the possibility further. One would then have to assume all "correct" answers were to the "known" word, giving a 9% (5%+1in25[4%]) worst-case scenario in the tested data-set.
Which is all a moot point considering the dataset is two years out of date...
Just stick with a grid of 9 images and have the users click on the "puppy" (with minor distortions to throw off checksum or color/location sampling hacks).
Re: Missing something
The problem with just 9 images is that a bot then has a 1 in 9 chance of guessing correctly, and that's without any image analysis. That's an 11% success rate, which would allow a botnet to create millions of addresses per day.
Image captchas work best when a single image is shown and the user is asked to identify it by typing in what it is. The program has to allow for possible variations; e.g. for a picture of a car, the program would accept car, vehicle, sedan, coupe, ute, pickup etc. A student at my college developed a system like this a few years ago, although I've only seen it used in a few places. It had a database of around 20,000 pictures of easily identifiable objects (dogs, cats, cars, fruits, trees, boats and such) and the user was asked to identify 3 pictures in a row. I thought at the time that it also made a semi-decent intelligence filter, since if a user couldn't spell words like "tree" or "apple" they shouldn't be registering with the site anyway!
I started work on a similar idea a while ago - I collected a set of easily recogisable animal pictures - cow, sheep, duck etc, presented a one at random and had the user type in their names before allowing form submission. As you say unless you provide a set of prepopulated options in a drop down list its best to build in fuzzy matching so that minor spelling mistakes, homonyms etc are handled. It also occured to me that while obvious to an eskimo an aboriginal Australian might never have heard of a penguine or polar bear before.
I installed reCaptcha on a system a while back but found -
1. A high percentage of the words presented were incomprehensible and from the end user point of view downright confusing.
2. On a secure private system I didn't like the idea if notifying a remote untrusted API everytime someone logged in.
Actually, Australia and New Zealand are home to a major chunk of the World's penguin population!
check out research.microsoft.com/en-us/um/redmond/projects/asirra for a solution that already does this pretty well...
[Paris: coz she can pet my puppy!]
Yes, fuzzy matching typed response is certainly a lot better than having a drop-down list, because this again reduces the number of options a bot has to choose from, thereby increasing the chance that the bot will guess correctly. As is pointed out in the article, even a 1% chance gives bots an unacceptable level of success when you're talking about a botnet of several thousand machines making thousands of attempts every second. To conquer this, you really need to set the captcha to give at least a 1 in several million chance of success by brute forcing.
Consider this: if you have a botnet of just 5000 machines each making one attempt per second, that's 5000 attempts per second, which is 18 million attempts per hour. That means that even if your captcha creates a 1 in 18 million chance of success by brute forcing, that's statistically 1 success per hour. You can probably fight this if you IP block any machine that transmits more than say 3 unsuccessful attempts, or more than 3 attempts in less than a minute, for 24 hours.
Keeping the pictures easily recognisable across a wide range of cultures was one of the challenges my student faced. Things like polar bears and penguins would perhaps have been a bit esoteric for some cultures, but commonplace ones like cats, dogs, trees and cars are pretty much recognisable by anyone with access to the Internet.
Finally, I also agree with your points 1 and 2; I also don't like invoking third-party systems on my clients' websites, which is why we wrote all our own tracking, statistics and captcha systems. Not only is the site's behaviour more controllable, it's more secure, it's faster for the visitor and doesn't clutter up the menus of protection addons like NoScript with a slew of domain names.
The Lone Tone
"Also diluting its effectiveness, the system accepts 'off-by-one' errors such as 'lone' instead of 'tone.'"
If only my teachers 35 years ago would have been so forgiving.... Heck, even if my spill chucker were that forgiving.
Character recognition is too easy.
Computer attacks are not the only worry about CAPTCHAs, they are also attacked by the "million monkeys" method, i.e., having a large amount of cheap labour guessing CAPTCHAs. The point being that humans are quite good at recognising distorted letters, even if they are Chinese with no knowledge of the English language -- after all, there are only fifty-odd letters even if you include both capitals and lower case, and you can see them all on the keyboard (if you add the lower-case variants to the keys).
So you need a test that requires skill similar to that required to actually use the web site. If the website is in English, a basic knowledge of English could be required, but you would still want to avoid machine attacks.
So my suggestion is to use word-to-picture matching. Set up a cluttered picture like those from the "I Spy" books and require users to click on two or three named objects. To set up the test, you need a human to name and point out about a dozen objects on each picture, so you can pick random subsets of these for each user. With just a few dozen pictures, the number of combinations is enormous, and it requires both language knowledge and visual recognition (of a kind that is much harder than OCR) to pass the test.
I Asia, people with enough knowledge of English to pass the test would cost a lot more to hire than the unskilled workers currently used to break CAPTCHAs.
Doesn't this mean....
If the boffin has created a way of 'cracking' ReCaptcha's, surely this means that he actually come up with a better method of OCR which can understand all those words that current OCR techniques cannot.
Great! So, Google just need to use the algorithm when scanning books, and come up with a new method of distinguishing between humans and computers.
A simple attack
Just guess every time that the two words are "korean mxyzptlk". One time in a hundred thousand or a million, the first word actually will be "korean", and they don't know what the other word really is anyway so they'll accept "mxyzptlk". Then you're in.
It may be slightly more complicated in reality. I haven't seen a Google recaptcha, the last captcha that I did see of theirs was using pronounceable non-words like "drooble", possibly chosen from pieces of real words, and tying them in elastic knots: I think my own solve rate was 80 per cent or less. A similar approach solves the Korean problem, or of course they can use two or more known words and one or more wanted misprint word. They also can disallow the unknown word or words if OCR or other users don't match your version, i.e. they can disable a GMail account or whatever if they think it was originally set up with invalid input.
Practice your reCAPTCHA skills here:
Call me a weirdo, but in filling in these reCAPTCHAs, I've discovered more than a few that would make excellent band names. Here are some of my favourites:
And my personal favourite:
Any budding musicians stuck for a snappy band name would be better off trying reCAPTCHA rather than one of the other silly band name generators on the net!
Time to devise different tests- you're starting to see captchas getting so ludicrous that humans can't read them. I started worrying that it was my tiny brain- but I saved some troublesome ones and inflicted them on other people (aren't I a blast at parties?), and they agreed that they were da poop.
Thus, yes, fluffy kitten tests or whatever else please, it's getting stupid. The balance is shifting too far from usability. At least one site's captxcha was so irritating that I just didn't bother using it- and others must be as puny as me, in that respect.
Why can't they do something along the same lines as Dr Kawashima's Brain training on the Nintendo DS? You see lots of different coloured numbers on the screen (some moving, some rotating, etc) and have to say how many of a certain type you can see.
This could be simplified to a random number of coloured shapes/blobs, but the idea would still be the same - "how many red blobs can you see?". The important thing also would be to only allow one attempt before changing the image and starting over - and finally, it's three strikes and you're out - you can't apply again from that Computer/IP address for 1 hour.
Would that work?
There is no charge for my consultancy - just to see the spammers thwarted is reward enough!
David, thank you for excluding the colour blind.
Thank you for devising a new system that doesn't allow colour blind people onto the Internet. They're twats. My boss is colour blind for a start. And the other day I saw a little lad looking at a display of second hand mobile phones and asking a friend what colour each one was - he didn't want to get a pink one for instance and look gay as well, at least I assume so. You can understand that but why are these people allowed out in the first place?
Bad enough that many Captcha tools have a little wheelchair icon to click for special treatment. Fortunately people who are completely blind can't see it anyway, so they don't get to make a nuisance of themselves.
reCAPTCHA is NOT Broken!
Read the article. It says the images in this "new" attack were collected in 2008. It took this attacker more than one year to come up with a way to read the images, and in the meantime reCAPTCHA has changed their distortions multiple times. I read the attacker's paper, and the reCAPTCHA images look completely different nowadays (for example, there is no line through the words). Stop spreading misinformation.
my success rate is only marginally better
they keep making these tests harder, so hard that I now have problems solving them. and last time I checked I was a human. It seems recently I have only gotten to a 50% success rate.
identify the picture techniques
How long would it take the bots to download the 20,000 images, hand them off to 100 chinese (or porn downloaders to identify for you), and 4 hours later your system is totally broken by a DB of matches made by humans? Now introduce some random transforms/noise/artifacts to the images and the bots can't do perfect matches, they're back to educated guesses. Maybe make it a series of pictures to increase your odds... Paris, Coat, Joke Alert.
PS, does this mean I can screw up old Times articles by always putting "jackass" as the second word?
Learning about CAPTCHA technology.
I've only begun my study of CAPTCHA technology, but I notice some comments here are contrary to what I've read so far. I'm collecting references that people who have been at this longer than I have, so I would be grateful for any help provided here. Feel free to provide this offline on my website at http://Asirra-Plus.com. Thank you.
...every single time I'm confronted by use of reCAPTCHA, I can quickly identify the word the computer didn't catch. And instead of putting the proper answer, I've always ended up putting in something like "buttpirate," "kiddydiddler," or "smegma." I always get a good chuckle when I imagine that my contributions will someday go into a good ol' fashioned cook book.
Buttpirate Kiddydiddler cookies:
1 tablespoon smegma oil
1 (18.25 ounce) package anal-lube
1/2 cup semisweet Deadbaby
Mines the one with a book that was checked by a real editor...
- Breaking news: Google exec veep in terrifying SKY PLUNGE DRAMA
- Geek's Guide to Britain Kingston's aviation empire: From industry firsts to Airfix heroes
- Analysis Happy 2nd birthday, Windows 8 and Surface: Anatomy of a disaster
- Google CEO Larry Page gives Sundar Pichai keys to the kingdom
- Something for the Weekend, Sir? SKYPE has the HOTS for my NAKED WIFE