Two of the programs submitted in the first round of a competition to find the next cryptographic hash standard contain buffer overflow errors that could make them prone to crashes and security problems. The discovery, by researchers at Fortify Software, just goes to show that even the world's most foremost security geeks can …
But the Corvair was really a good automobile!
G(enerous) M(otors)s Corvair (my aunt had one in the 60's) was really a good vehicle IF (big if!) you actually knew how to drive one. Almost as good as an early Prosche, which one can very easily slide around corners (yes, I've done it, and even got a ticket for doing it). Just back off on the gas and the rear end lets loose. Accelerate and all is well with the world.
The problem with the Corvair (and C for that matter) is that the problems are usually with the "driver". Please note that the problems the presidential candidate found were fixed (anti-sway bar) before the book was published. Oh, you can roll just about any vehicle (I should know, I've done it twice!).
C = Worst programming language ever created.
This is hardly surprising. C is one of the worst computer languages ever created. Conceived by morons, implemented by incompetents and loved by sicophants.
It is beyond me why a gazillion years after this DOS of programming languages was ever vomited upon the world, that none of the fools who use it have had the brain power to implement smart pointers - first in software - and then in hardware - in order to prevent buffer overflows.
Smart pointers would hold both the starting position of a table, along with the ending pointer, and any reference outside of that range by the pointer would result in an array bounds exception.
Once implemented in hardware, there would be <ZERO> speed penalty as the comparison can be done in parallel with the pointer computation.
Is GETS still part of the C standard Library?
Ahahahahahahahaahahh. It speaks volumes to the quality of thought that went into the language when it's own standard library function GETS overflows if the input data stream is too long.
Standard mental C midget solution? Make the buffer bigger.
"The buffer overflow in Blender was the result of a typo. A developer mistyped3 instead of 2 for an array access."
So these so-called experts are using magic numbers to access arrays? And, of course nobody reviewed these so-called experts' code, because, well...they're so-called experts!
"But it's all C's fault, mommy!"
"This is hardly surprising. C is one of the worst computer languages ever created. Conceived by morons, implemented by incompetents and loved by sicophants."
This is a really bad attempt at trolling, an/or the author is absolutely clueless.
Hang on, the name rings a bell ... ::hits metacrawler:: ... Oh. It's you.
Hi, Scott! Bye, Scott. :-)
buffer overflows - why?
I'm not sure whether these guys were working under time pressure (we can all make mistakes when pushed for time). We should, however, remember that their primary expertise is mathematical and logical analysis. Programming is probably secondary to what they really do. It would be unfair to accuse an architect for slightly wonky bricklaying. However I do believe we no longer have much real control over the code of our programs. Ultimately, a program is a set of machine code instructions that the machine follows blindly in some sequence, but we have largely lost this connection with the machine.
When I started microsystems development we used raw machine code, assembler and possibly C, multi-threading was an arcane concept and the machine instruction execution sequence was almost entirely under the control of the programmer. Now we mostly link together often massively complex black box library modules written by vendors, run hugely parallel (even if merely time sliced) multi-thread processes and execute on CPUs capable of predictive out-of-order machine instruction execution. The result is almost no control over the exact run-time process.
Buffer overflows can be written in by inattentive programmers even when coding at low level. They can also occur due to [a] calling a proprietary library module that contains a potential buffer overflow the programmer does not know about; [b] race conditions between threads or even possibly triggered by CPU execution order decisions. So testing becomes more important, but the very complexity of code bloated by massively redundant "do everything" library methods makes exhaustive testing impossible, so we're back to square one. And worse, as programmers are now almost exclusively trained to use solely highly abstracted languages, it's likely that buffer overflows and their causes will become less and less well understood. Expect buffer overflows (and integer overflows, and divide by zeros, and typing mismatches and null pointers, ...). They may be 35 years old, but they're still hale and hearty.
I always use these as a starting point. They aren't meant to be optimized, they aren't meant to be examples of good programming practice (magic numbers, et al). To quote from memory, they are "coded for clarity and not necessarily for efficiency. They will compile and run correctly on both big-endian and little-endian machines". They are *only* meant to produce correct outputs.
When you do your own optimized implementation, possibly in a different language, you test it against the reference implementation. The reference implementations tend to use C because it is both ubiquitous and powerful, especially when it comes to bit manipulation and memory management.
C was conceived by very brilliant people, implementation quality is generally very high (when was the last time *you* wrote a compiler?) and it is loved by anybody who has the need (and ability) to write fast, accurate code.
It is 40 years old and the earliest implementations were macro passes in assemblers. Back then you may have had only 2 choices, C or assembler. Many programs were written in a mixture of both. C is not designed to be used by children, the careless or the hard of thinking.
Do people make mistakes? Of course they do. That’s why a sensible manager insists on doing Design/Code Reviews. That’s what Fortify have done and they've caught a couple of mistakes. Well done them.
"submitted in the first round"
that hardly makes them "scrutinized" or even "used", and it is hardly shocking to see memory management issues in program written in C . The discovery was obviously made in the Department of Bleeding Obvious.
Actually, I dunno what is worse ...
What is worse? Me responding to a known troll, or so-called "security" researchers not asking for help when programming in a low-level language they aren't genetically familiar with?
Mea culpa :-)
"current hash standards such as MD4, MD5, and SHA-0"
MD4 and MD5 haven't been "current" for some time (though MD5 keeps popping up like a bad penny) and SHA-0 was replaced by SHA-1 in 1995.
What does reference implementation imply to you?
My impression was
Well commented code to make it clear how it does what it does
Well structured code so you can profile it and change sections if they don't meet your needs
Well tested code that does not crack under failure modes common to the language.
Magic numbers fully documented or in a header file, with an explanation
Does the job correctly
Efficiency is a low priority. The right answer can always be made faster.
Can be dropped into a prototype virtually as is.
Correct.Clear.Fast. In that order.
Then I read this
"The implementation submitted by Rivest's team suffered from a hashval field with a buffer that is read or written using different bounds"
If an Ansi standard compiler would have caught this what does that say about the team?
The code looked good enough to submit without compiling ?
They don't use a compiler that catches stuff like this.
They ignore warnings when it does advise them their is questionable.
OTOH Rivest is an MIT Professor. This sounds like the implementation was written by one of his students. It may not be right but it was written quick and that shifted boundary buffer access is so coool!
A reference implementations written by a PFY.
Does this inspire trust, bearing in mind the subject of this algorithm?
"So these so-called experts are using magic numbers to access arrays?"
I'd hardly call 2 and 3 "magic". I'm no number theorist, but it is likely that these constants are part of the algorithm, not an arbitrary implementation limit. If I'm writing a bisection routine, I don't use a named constant for the division by two, because the algorithm just doesn't work if you use any other value.
"And, of course nobody reviewed these so-called experts' code, because, well...they're so-called experts!"
Ermm, you are commenting on a news story, the point of which is that someone *has* reviewed the code.
Hmm... Programming errors blamed upon C
From the look of the errors they could happen in *ANY* procedural programming language (and probably some OOP ones as well.
Let's see, using different lengths for an array between proceedures... Hmm.. where do I see that happen most often?
C? Yes, but not that often, unless it's using static arrays.
FORTRAN? Oh yes, indeedy! Very often.
C programmers have to think about the memory allocation and hence are generally more careful (but not always). With FORTRAN(77) it's all done for you and the programmers don't worry (until their programs SEGV).
As for the second, that can happen in *ANY* language, including all those nice OOP ones which are the darlings of the safe-programming cults.
Digital signature collisions
"Over the past few years, a growing body of research has revealed those algorithms can be vulnerable to so-called "collisions," in which two separate data files generate the same digital signature."
If I understand the nature of Digital Signatures correctly, a signature is designed to be smaller in size than the data it represents. In fact, I would speculate that one would have little use if it was of the same size as the original file. Therefore, it seems there is a problem with the statement above. As the total number of permutations of file input is file size^2, the total number of possible signatures is only signature size^2, and input file size > signature size, there will ALWAYS be several possible inputs that can result in the same signature output. The algorithm in use is irrelevant to whether collisions can occur, it is simply a case of the number of permutations possible in the data size used to represent a signature.
Have I missed something or does the article seem to suggest that NIST are looking for a solution which cannot exist?
Crikey! Over at http://blog.fortify.com/blog/fortify/2009/02/20/SHA-3-Round-1 I found details of the two problems and the second (at least) is hardly subtle...
DataLength sourceDataLength2; // high order parts of data length
if (ss.sourceDataLength < (bcount | databitlen)) // overflow
if (++ss.sourceDataLength2 == 0) // increment higher order count
if (++ss.sourceDataLength2 == 0) // and the next higher order
++ss.sourceDataLength2; // and the next one, etc.
Far from requiring very large inputs (at run-time) this problem is trivially detected at compile-time by any compiler that bothers to check local array bounds. Sadly, not all compilers do so. Googling about suggests that GCC and Intel do, but Microsoft don't.
The first would require more lint-like analysis, but magic numbers aren't the problem. They had defined constants. The bug arose from the arithmetic being performed on those constants.
The problem with C and C++
Is first and foremost the clueless programmers.
The second problem is that it's mostly pointers, or pointers to other pointers...
(There's a reason I don't program in those languages... Bleagh!)
It is a firing offense to use a GOTO in a structured language.
And if you create a constructor WITHOUT a destructor, you'll be tied upside down and THEN shot!
Building a program with ARRAYS or 'buffers' and NOT implementing boundary checking is like pissing in your livingroom. easy to do, feels great when you do it, but the stench will never completely disappear, and the neighbours look at you strangely...
Don't expect an 'out of bounds' exception to handle your mistakes.
mathematical soundness != implementation flaws...
.. and C is fine if properly tested / debugged.
"This just emphasizes what we already knew about C," Fortify researchers wrote. Even the most careful, security conscious developer messes up memory management."
Like hell it does. Out of 45 submissions they found errors (or more likely potential errors) in about 5. Yes, C is awful and not a single living human can write C code with no security holes...
Re: C = Worst programming language...bla bla
C is beautiful. You just don't understand it.
You probably use something like Java or (heaven forbid) Perl - you know - runs like a dog, is massively bloated, needs huge gobs of memory to run, etc etc.
If it wasn't for C, you wouldn't have most of the electronic stuff you take for granted - from your washing machine and heating timer up to the PC you sent your post from. You have NO idea just how important C is.
Another vote for the maligned Corvair
Biggest problem with the Corvair was the driver/owner. No one was used to checking oil levels in air cooled engines, and no one paid attention to fan belts. You could get away with this in a 1964 Buick, but a 1964 Corvair needed more attention.
Driven by anyone with any automotive sense at all, the Corvair was actually a pretty nice car for the time. Economical, easy to steer, decent ride, surprisingly good handling. Corvair even had some good results in racing - look up "Yenko Stinger" on Google.
What killed the car was a combination of a certain ambitious liberal lawyer (who boasted that he didn't even have a driver's license!) and GM's arrogance and poor damage control after said lawyer made his name by attacking one of the very few almost-decent GM cars of the day.
I have personally met Ralph Nader - he's an asshole of the first order and a total publicity whore. We ought to wrap him up (duct tape would be good) and give him to North Korea or Iran as a Christmas present, even out of season if necessary. Drop him from 40,000 feet - he won't even need a parachute!
@Ed: Digital signature collisions
Try the interweb, say <http://en.wikipedia.org/wiki/Digital_signature>. Look for 'collision'.
Unless you're trolling too.
C Does Suck
The problem is C generally requires you to reinvent the wheel for all sorts of basic plumbing. Memory management, for one. It's like "to create a shower, first dig up some copper ore..."
I programmed in C for years, and I tolerated it for what it is, a high level assembler.
C gives you a lot of power, yes. It's compact, yes. But it has virtually no checking, no basic services, and most programmers disable the few things like boundry-checks C compilers do have in the name of the gods Speed and Compactness.
When nearly every variable assignment you do has the potential to overflow a buffer, when every memory allocation has the potential to create a memory leak, when every pointer has the ability to point *anywhere*, then you have a mess.
C requires a great deal of care to get the mind-numbingly simple things right. Over and over and over. It isn't that C itself is bad, it's that C does nothing for the programmer, the programmer has to do *everything*. The slightest flaw and you have a compromised system. This is a good thing?
C is still necessary in limited areas where the power truly is needed. Low level OS functions, for example. Stuff that A) isn't very large, B) needs to be absolutely as fast as possible.
C is not suited for large complex projects precisely because of the exponential chance of making a basic mistake. (And I know lots of large programs are in C--which is why buffer overflows are so damned common).
How about using a language that does memory management, with an interpreter or compiler that knows how to optimize to the bare metal? It's a lot easier for a specialist to create a solution everyone can use, right? That's the whole *point* of code reuse...
@BlueGreen: Digital signature collisions
Ok, I looked at the article you referenced. I think the part you are referring to is where it says that one of the purposes of a cryptographic hash function is to make the CALCULATION of a different message that produces the same hash or signature computationally infeasible.
However, the point still remains that for any given signature there are a number of possible inputs that will produce it. I refer to my earlier mathematical proof. Of course, for a good crytographic algorithm, it will be hard to work out the corresponding inputs for any given output. That is the reason they are also known as one-way functions. Nonetheless, those multiple input values do exist, and no algorithm that can be devised can change this fact.
It reminds me a bit of the much-touted compression algorithms of the 80s and 90s that claimed to be able to compress any possible input file to a smaller size. This is, of course, also mathematically impossible.
It's true that it's mathematically impossible to have a hash function that can map a string m to a hash value n where m can be longer than n. Naturally, for the same reason a hashing algorithm cannot compress every possible file. If the algorithm can compress an arbitrary, it is possible to construct a file that would be made bigger by such compression.
"However, the point still remains that for any given signature there are a number of possible inputs that will produce it. I refer to my earlier mathematical proof. Of course, for a good crytographic algorithm, it will be hard to work out the corresponding inputs for any given output. That is the reason they are also known as one-way functions. Nonetheless, those multiple input values do exist, and no algorithm that can be devised can change this fact."
Nothing you said is false. But I can just as easily say that for any given AES key, it's possible for someone to randomly guess the value and decrypt your encrypted files. The statement holds just as much meaning. The idea is to make decryption (or in the case of SHA-3, being able to spoof a message with a valid hash for the message being replaced) computationally infeasible. The weakest SHA-3 variant, assuming there are no design/implementation weaknesses (I know, I know, but bear with me) should have a collision attack complexity of 2^128 (that being accomplished through a birthday attack, which does very little in actually helping you spoof a given document). The others would should have complexities of 2^256 and 2^512.
tl;dr the concept of a collision-free general purpose hash is flawed to begin with, but that's no different than the concept of a "secret" key.
Sorry mate, got to disagree there.
Other languages just cannot optimise to the bare metal the way C does. Or maybe "do not" is better than "cannot".
I know, I know, java has intelligent runtime optimisation. It still seems to take forever to start and need a larger memory footprint than a decent C app though.
Where you absolutely, positively have to have control over what memory is read when and in which order, when you want to squeeze the best you possibly can out of a system, use C. And this applies to giant database systems as much as it does tiny embedded boxes.
Oh, and don't forget systems programming, drivers, kernels etc.
Plus you have the cost of distributing and installing java VMs everywhere.
If you can code C competently then there's no need to switch. I'll agree that it seems that many people can't.
@Steven Usher - C language
I would suggest to you that you might not have done that much programming in a variety of languages.
I've been programming since 1982. The C language is particularly dangerous and certainly more dangerous than most other programming languages out there.
It's use of pointers is the problem.
Bounds checking of arrays is not carried out. And when it comes to using pointers to access a block of memory, then there can't be any checking done.
Most other languages bounds check arrays.
In fact, take a look at the number of books on the market that talk about "Safer C programming" and show me another language where there such a similar book market exists.
It's even possible to go on training courses to learn how to defensively program in C to avoid the blunders.
@AC Hash Algorithms
"It's true that it's mathematically impossible to have a hash function that can map a string m to a hash value n where m can be longer than n. "
I'm not sure if that is really what you intended but it is wrong. It's the other way around.
If the input to the hash function is m and the output is n, then the output n must be equal to or less than the size of m ( in length of characters).
The whole point of the hash function is to create a 'signature' value which is much smaller in size than the input.
Digital Signature collisions
"Over the past few years, a growing body of research has revealed those algorithms can be vulnerable to so-called "collisions," in which two separate data files generate the same digital signature."
What is this, a joke? I was writing software for my A' level computer science project that was making use of hash functions ( but they weren't even called that in those days) back in 1987 that used hash functions and we were aware even then of the issues of multiple input values mapping to a single output value, and there were techniques back in those days that had been devised to deal with it.
It's something that's been known about for more than 20 years.
Now this guy Wolf is absolutely right. It's quite evident he has programmed in C for a good number of years ( as have I). In my case, military applications running on micro-controllers with 1536 bytes of RAM and 32B of ROM space (among other apps).
GOTO is mandatory in some places.
And I'm not joking. My first carpeting in my first job was with a company in the early 90s who wanted backwards compatibility of new code in a new version of their favourite language in case they had to revert to a second hand machine in case of trouble. My training was only in the newer version which introduced While, Repeat and If/then. Their data dictionary was quite nice.
My methodology training was to read the office copy of the English edition of a Swedish book on JSP. It made more sense than Michael Jacksons book. But not a lot.
The company, hardware and language will remain unnamed. They might be hiring.
I've never needed a reference implementation but I've long suspected that despite warnings (It is meant to guide you in writing your own not replace having to) that's exactly how they are used.
Which does beg some questions.
Could you write a tool to pick the ref. imps. out of production apps? How many would you find and how much bloat could be squeezed out with a re-implementation based on real user feed back reports?
@By Ken Hagan
"this problem is trivially detected at compile-time by any compiler that bothers to check local array bounds. Sadly, not all compilers do so. Googling about suggests that GCC and Intel do, but Microsoft don't"
Another good reason not to use MS?
> How about using a language that does memory management, with an interpreter or compiler that knows how to optimize to the bare metal? It's a lot easier for a specialist to create a solution everyone can use, right? That's the whole *point* of code reuse..
We've been using the same home grown C utility libraries for more than 10 years. They are thoroughly debugged, run in every O/S and hardware combination we've tried them on (even Windows) with little on no modification and are very *very* fast. Certainly faster than those in the standard libraries. [At least with C (and C++) you can override the built-ins and use your own.]
That *is* the point of code reuse. If you aren't still using code you wrote 10 years ago then you need better code and better local procedures.
Ed - Hash Algoritms
"The algorithm in use is irrelevant to whether collisions can occur, it is simply a case of the number of permutations possible in the data size used to represent a signature."
Of course the particular hash algorithm is important, otherwise what would be the point of devising so many different ones, other than for intellectual curiousity.
It's to do with the distribution of those numbers in the output set. If the size of your output space is 2048 numbers, you don't want bunching up of the numbers, you don't want 80% of the collisions from the input numbers to be mapped to 10% of the output space.
" I refer to my earlier mathematical proof."
Can we please ditch the use of this phrase, which was used by AC and subsequently by others. There was no mathematical proof provided, what there was a simple expression, which was actually incorrect anyway.
Feel free to tell me how I was wrong. I said that for two entities m and n, if Length(m) > Length(n) then it is impossible to map m to n without collisions for all possible values of m. I await your disproof of that statement with baited breath. This is basic stuff. Just because you thought I was saying something I was not does not mean I was incorrect.
"Of course the particular hash algorithm is important, otherwise what would be the point of devising so many different ones, other than for intellectual curiousity.
It's to do with the distribution of those numbers in the output set. If the size of your output space is 2048 numbers, you don't want bunching up of the numbers, you don't want 80% of the collisions from the input numbers to be mapped to 10% of the output space."
All that, of course, I agree with. In fact, one of the more important output characteristics of a cryptographic hash algorithm is that very similar input strings (e.g. "abcde" and "abcdd") have radically different output values.
I never said I presented a mathematical proof. I did state that it's mathematically impossible to map a set onto a smaller set without collisions, but forgive me if I assumed a reader was at least capable of a 2 minute Google University intro if they lacked the basic knowledge (or hell, even intuition) to see the truth of that statement. I suppose it's easier to just argue by assertion.
> I did state that it's mathematically impossible to map a set onto a smaller set without collisions
Depends entirely on the entropy of the larger set.
But then we all know what we're talking about, not really disagreeing on anything but quibbling over trivia then taking it personally. We wouldn't converse like this if we were face-to-face in a pub, so why here? (assuming any of us would talk tech in a pub - I really hope not)
Hashes in general collide. We all know it and the consequences/resolution. End of story.
"Depends entirely on the entropy of the larger set."
Let's try a fun experiment. Grab 4 balls. Get 3 boxes. Now place every ball in a box, without putting more than 1 ball in any given box.
Now explain how the generic set of m items mapped onto a set of n items is any different.
But you're right, this is really just quibbling and not germane to the article (any fule kno cryptographic hashes collide, the game is more complicated than that). It's just irritating being told you're wrong by someone who is actually wrong, with them not even bothering to try and justify their statement.