back to article Crypto collision used to hijack Windows Update goes mainstream

The cryptographic hash collision attack used by cyberspies to subvert Microsoft's Windows Update has gone mainstream, revealing that MD5 is hopelessly broken. Security researcher Nat McHugh created two images of different rock 'n' roll icons - James Brown and Barry White - with the same MD5 hash. "The images were just two I …

  1. DropBear

    While it's obviously always a good idea to use a better alternative (and MD5 is indeed near-useless for authentication by now), I'd say it's not useless for corruption-checking or indexing as long as the nature of the application implies attacks with crafted data are not to be expected.

    1. Lee D Silver badge

      Hashes have many uses, some of which have no security impact at all.

      Consider data integrity hashes on your own stored data. If a malicious agent could get access to your backups and their hashes, that's game over anyway. But if a hash differs on one of your backups to the others, you know there must be data corruption or loss somewhere.

      It doesn't render MD5 useless, just insecure. There's a difference.

      I put in the MD5 routines into the game OpenTTD, for instance (the code has long since changed, I believe). It checks that you have a copy of the original GRF (graphics data) files from the original game and whether you have the demo or full-version, and DOS or Windows palettes for them based on the hash. Unknown hashes flag up a warning. Someone who WANTS to feed in a fake GRF would be pointless.

      But we found a lot of people who had corrupt copies of the original GRF's from their old backups that were generating support tickets that nobody could fathom. For most, it meant that they then knew they'd got dodgy backups and they just replaced it with the originals. For others, it meant they'd been modifying the GRF's and so generating tickets because of their own mistakes. Despite their arguments, when they are the only ones on the planet with the GRF files corresponding to the hashes they posted, you know immediately they are either using a corrupt or edited GRF rather than the supported original GRF's.

      Obviously, if they wanted to fake a support ticket, they could just say that the program never warned them, in the same way someone could edit a kernel log to remove references to the taint flags. It's not "secure". But it is useful.

    2. JeffyPoooh
      Pint

      Is this news?

      A long time ago, when I learned about such things, it was immediately obvious to me that one could create 'N' different files with the same hash. It's kinda duh-obvious based on the number of bits (input file, output hash), trivial. It never occurred to me that anyone else wouldn't immediately think of the same thing.

      Instead of appending additional bits (lazy!), the hash adjustments should be imperceptible watermark-style adjustments within the existing image file (not easily applicable to other file types). And it should be an efficient algorithm, not a brute force search.

      1. kphair

        Re: Is this news?

        It WAS news, back in 2004. That's when I read about it via Dan Kaminsky's blog. Anyone using MD5 in a security setting after that point might as well have been using CRC-32.

        1. Michael Wojcik Silver badge

          Re: Is this news?

          Anyone using MD5 in a security setting after that point might as well have been using CRC-32.

          And Yet Another Reg commentator demonstrates that reading security blogs is no guarantee that you'll learn what IT security means.

          "In a security setting" is essentially meaningless. There are threat models, and there are technologies and protocols that address those threat models.

          MD5 is still entirely suitable for many applications under many threat models. It's a perfectly good way to compute salted password hashes for a great many applications, for example; forcing colliding preimage pairs doesn't do an attacker a damn bit of good there. And while the existing MD5 attacks might someday be extended to generating arbitrary collisions, the typical password authentication system doesn't let an attacker pass enough data to make use of such an attack (even if it were in the scope of the threat model, which for most applications would be rather excessive - there are almost certainly more pressing vulnerabilities).

      2. as2003

        Re: Is this news?

        Not only is this old news; as far as I can tell, this security researcher is just running 'hashclash' to look for the collisions. A program that has existed since 2009. Not sure why this story is making so many headlines.

      3. Michael Wojcik Silver badge

        Re: Is this news?

        A long time ago, when I learned about such things, it was immediately obvious to me that one could create 'N' different files with the same hash.

        Yes, that's immediately obvious to most people who have a rudimentary understanding of coding theory. It's the Pigeonhole Principle.

        The PP of course applies to any information-reducing operation, which of course includes all cryptographic hashes.

        The point of a cryptographic hash is to maximize the work factor for:

        1. Given a hash ("image"), finding an input ("preimage") that generates that value.

        2. Finding two preimages that generate the same image.

        while minimizing the work factor for computing the image from the preimage.

        When a cryptographic hash of length L has no known attack better than brute force, the work factor for #1 is 2L-1, and the work factor for #2 is 2L/2 (by the Birthday Paradox). Thus MD5, as a 128-bit digest, has a brute-force work factor for finding two preimages with matching hash value of 264. That's a fair number of hashing operations to carry out.

        That also means the probability of accidentally finding a couple of files that have the same MD5 hash isn't worth worrying about. Check a million files a second, and you should be good for more than half a million years.

        Once you understand that, perhaps you'll understand why the MD5 attacks are important, and why your rediscovery of the Pigeonhole Principle is irrelevant to the question of MD5's suitability.

        Everything old is new again.

  2. David Nash Silver badge

    it ought to be impossible to recover the original information.

    It still is impossible, and this exercise demonstrates that. Given the hash in question, there's no way to know whether it came from Barry White or James Brown.

    1. Destroy All Monsters Silver badge
      Holmes

      Re: it ought to be impossible to recover the original information.

      It is also very hard indeed to recover the image of James Brown out of 64 byte.

      1. Anonymous Bullard

        Re: it ought to be impossible to recover the original information.

        If that was the case then it wouldn't be classed as "bad crypto", more like "fucking good compression".

      2. Michael Wojcik Silver badge

        Re: it ought to be impossible to recover the original information.

        It is also very hard indeed to recover the image of James Brown out of 64 byte.

        You just need a specialized compression algorithm. I have one that can reconstruct an image of James Brown from a single bit.

        Here's the pseudocode for the expander:

        if first bit is 0

        return hard-coded James Brown image

        else

        remove initial bit and pass remaining data to gzip

        Extending this to support an image of Barry White is left as an exercise for the reader.

  3. This post has been deleted by its author

  4. Alt0n

    Please delete

    I assume that's meant to read "a rogue [not rough] CA certificate"

  5. Billa Bong

    Something is broken... the method

    Yes, the finding is significant, but:

    "What McHugh was able to do was to add binary data to the end of two different JPEG images such that the two modified files gave the same hash value"

    Surely the point is to take two different files, a target like a security update file, and a payload like a malware package, and only modify the payload until the MD5 is the same as the target - if you have access to the original on the source server to make changes to it, you could completely bypass the need of MD5 breaking in the first place.

    Maybe I missed it in the article, but has *that* been done yet?

    1. Destroy All Monsters Silver badge

      Re: Something is broken... the method

      It's about as hard. Adding random bytes to your malware should do the trick. But you do need at least a man-in-the-middle attack situation to pull of the subsitution. OTOH, if you are forging material that has been digitally signed and is sitting on the notary's disk, you may well have all the time in the world.

      1. JeffyPoooh
        Pint

        Re: Something is broken... the method

        If one's malware is smaller than the target, it provides a very large REM space to hold the hash-tweaked noise.

        Prediction: in the near future, compilers will offer an option for 'Desired Hash', and the compilers would tweak the Object Code until it matches whatever hash value you want; while maintaining the same function. Obviously a block of arbitrary bits in the middle that the intended code skips over. Maybe some alternate compilation equivalents to get closer.

      2. Michael Wojcik Silver badge

        Re: Something is broken... the method

        It's about as hard.

        Um, no, it's not. The Birthday Paradox applies - finding two colliding preimages has a work factor that's approximately the square root of the work factor for finding a preimage for a given hash. That's true even with the MD5 chosen-prefix attack. Being able to vary both preimages is an exponential improvement over being able to vary only one.

    2. Anonymous Coward
      Anonymous Coward

      Re: Something is broken... the method

      The answer to your question is "no", I think. As far as I know, no one has publicly demonstrated how to construct a file with a given MD5 digest or to find a different file with the same MD5 digest as a given file ("preimage attack"). It is therefore an exaggeration to claim that MD5 is "well and truly broken". In many applications it is, perhaps, still secure. However, in the circumstances, it would be foolish to continue using it. A successful preimage attack might be announced at any moment, and perhaps someone already has one in secret.

      If something is really important, use several different hashes.

  6. Crazy Operations Guy

    "[I]t does however have a much lower complexity than a complete brute force attack."

    Umm, no. Brute-force is the least complicated attack possible (Since its just n = n +1;). It may be much quicker, but can't possibly be less complex.

  7. Crazy Operations Guy

    Why not implement a dual-hash (or multiple-hash) system?

    Why is it that certificates can only have a single hash? Having more than one would increase the difficulty of finding a collision that satisfies both hashes. Theoretically this would allow certificates to remain safe long enough after the hash was found to be easily exploitable to let it just expire and re-issue a new one without the weak algorithm. Plus it would have the advantage where old clients that support only a very weak hash can still verify with a single hash and ignore the others (Or ones that don't need the extra security and have very low compute power as it is); allowing for certificates, clients, and servers to use $standard +/- 1 without compatibility issues.

    I use a similar thing when verifying OS install packages, I run an MD5 and SHA-256 hash on them when I first get them and when install on my beefy systems, but only do an MD5 on my low-powered single core boxes.

    1. the spectacularly refined chap

      Re: Why not implement a dual-hash (or multiple-hash) system?

      Already been done. For example from NetBSD's pkgsrc system for third party software:

      SHA1 (postgresql-9.1.11.tar.bz2) = b9ee975498705647aae77dae91869bfc6bb079ff

      RMD160 (postgresql-9.1.11.tar.bz2) = 2a3849a3e8b8f8fbbe2a35ee422a0be4a67e6dae

      Size (postgresql-9.1.11.tar.bz2) = 15861805 bytes

      (No MD5 there, but some of the packages use it). You've got three different metrics all of which have to match for the file to be validated. Even now MD5 + file size is still effectively secure since you'll need to add a few kilobytes but can't say in advance precisely how much. Two secure checksums and the size should be secure for the next few years at least.

    2. JeffyPoooh
      Pint

      Re: Why not implement a dual-hash (or multiple-hash) system?

      Meta hash. Instead of a single hash, define a meta group of hash functions. More bits in the output. Plus added complexity.

      Eventually, when the number of bits in the hash outputs equals the number of bits in the input file, then it would (correction: could) be as secure as a One Time Pad.

      1. Michael Wojcik Silver badge

        Re: Why not implement a dual-hash (or multiple-hash) system?

        Meta hash. Instead of a single hash, define a meta group of hash functions. More bits in the output. Plus added complexity.

        There's been plenty of research into combining multiple hash functions in various ways. It's generally believed (by people who actually study these things, as opposed to speculating about them in the Reg forums) that it's a less-efficient and riskier way to improve hash security than simply developing longer versions of existing, well-studied hash algorithms, as has been done with SHA.

        Eventually, when the number of bits in the hash outputs equals the number of bits in the input file, then it would (correction: could) be as secure as a One Time Pad.

        That's not a meaningful comparison, since (1) OTP is reversible, and (2) OTP's security proof rests on the equal probability of all plaintexts (of given length). When attacking a cryptographic hash, generally you don't care about finding a preimage, or you don't care about finding the preimage used to generate the hash in the first place. So "secure as a One Time Pad" doesn't mean anything in relation to a hash.

        I suspect, based on your earlier comment, that you mean something like "the hash could be a perfect hash" (i.e., with only one possible preimage for any given output). That's not a useful quality for a cryptographic hash; the whole point of a cryptographic hash is to produce a fixed-size output that's short enough to serve as input for other cryptographic primitives, such as MACs and asymmetric encryption algorithms.

        If you want an irreversible same-size transformation of arbitrary input, just use a cryptographically-strong PRNG as a stream cipher. That's using an OTP and throwing away the key.

  8. Pierson

    Need multiple checks, always

    For applications such as verifying updates, etc, good practice should include secondary checks to make attacks against weak hashes harder.

    In this instance the following would all have helped: verify data size as well as checksum (not always feasible); use more than one checksum, e.g. compute checksum for the data and also for the data plus a salt; use more than one checksum algorithm.

    All of the above should significantly increase the workload for a potential attacker, hopefully making the attack unfeasible.

    Oh, and of course, drop old and weak algorithms like a hot potato...

  9. harmjschoonhoven
    Coffee/keyboard

    For corruption checking

    a 32-bit CRC is sufficient. You can even compute it with a look-up table.

    Security is indeed an other matter.

    1. Lee D Silver badge

      Re: For corruption checking

      CRC checks are not as infallible as you might think.

      When I studied Coding Theory, it was explained that there are a number of errors which you need to be able to compensate for and how you do that depends on what you design it to compensate for.

      As such, ISBN's (the final digit of which is a checksum, and X = 10) are actually designed to withstand the swapping of any two numbers within them as well as detecting single-digit errors. That's because you expect a human typing in the code to make a mistake.

      Similar properties exist that you can design for that take account of other errors - until you get up to the standards of the Voyager spacecraft which can compensate for something like 999 errors in every 1000 bits, or something ridiculous.

      CRC isn't *as* suitable for spotting recurring data corruption, bit-flipping, etc. MD5 may not have been designed with such things in mind but certainly fits the criteria due to the extreme unlikeliness of such errors resulting in the same MD5 hash. You *REALLY* have to play with an MD5 to get it to stay the same, and even changing all the bytes of the code may not be sufficient by accident.

      CRC, though useful - as evidenced by its use in ZIP and thus PNG, DOCX, etc. file formats - is not necessarily the best use for data integrity checks. It was back when computing power was very limited and that's where its strength lies. It's very easy to make with basic boolean operations.

      However, given modern power, I'd trust an MD5 much further than CRC. And neither should be used for security purposes.

      1. JeffyPoooh
        Pint

        Re: For corruption checking

        Optimum CRC (or hash) depends on assumptions about the characteristics of noise.

        Has anyone brought on board the "recent" news that noise bursts can be fractal in nature? Bursts of burst of bursts of noise bursts, etc. Fractal all the way down.

        I expect not.

      2. stanimir

        Re: For corruption checking

        CRC is a faster, though - basically no inner loops, just table look-up 4096bytes for crc32 - so during the calculation the table is to be hot in L1 cache.

    2. Lifelong Learner

      Re: For corruption checking

      " For corruption checking a 32-bit CRC is sufficient."

      That depends on how many files you have to check. I once got 195 false positives for 1.32E6 files with CRC32 (same CRC32, but the different filesize flagged them).

      IOW 1.32E6 apparently is already a significant fraction of 2^32 to have collisions by pure chance. If somebody could show the math that this wasn't a freak accident but about to be expected I would be grateful.

      I repeated the calculation with MD5 (which took only 5 min longer for a total runtime of about 3 hours) and I got no false positives this time. So I learned that using CRC32 for speed reasons was a false assumption in my case.

      I agree that CRC32 + filesize might be sufficient, if no malicious cause can be assumed.

      BTW: My task was to find duplicate document scans resulting from a user error.

  10. Elmer Phud

    Something missing?

    I'm down at the bottom of the thread (as it stands) and not one post slagging off Microsoft.

    What is going on?

    1. Alistair
      Linux

      Re: Something missing?

      ....

      Rabies shots perhaps?

    2. Michael Wojcik Silver badge

      Re: Something missing?

      Why, this reminds me of Microsoft's braindead LanMan hash. Come sit by me, kids, and listen while Uncle Mike tells the story of a bold entropy-discarding transformation and the beautiful but weak bifurcated DES encryption he loved!

      Better?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon