Antivirus firms are concerned about the emergence of techniques that could render meaningless the use of checksums to mark applications as safe. The issue concerns hash functions - one way mathematical functions that produce a small fixed length checksum or message digest from a much longer batch of code or email message. When …
just whitelist based on 2 (or more) different hashes - it is probably not feasible to add bits correctly in order to generate 2 files that collide in 2 different hash spaces at the same time.
IANAC by the way
(I am not a cryptographer)
isnt the filesize a useful check in addition to the checksum? I'd be inclined to check both
Most popular P2P apps use MD5 to hash the files for downloading.
If you're a nefarious spammer script kiddie than you could potentially inject legitimate files on index sites with nefarious code or if you're a low life soul selling scum bucket working for a large media organisation you could corrupt illegal shares of you lord and masters copyrighted products.
Could cause a number of headaches that could.
Or Arbitrarily Slice
Find the length of the file and then divide it into two sections. Calculate the hash of each half. That should work with just the one hash function.
Chess may give hints
Chess programs use hash functions to check for collisions in a search tree - significantly increasing search depth. But false matches can lead to bad moves. In fact this is likely because the opposing side will tend to follow moves which are mis-judged, so avoiding collisions is important.
One way this is done is to use multiple "independent" hash functions. You only have to check the second if the first gives a match, so it isn't significantly slower than using a single hash function.
As the first poster suggests, this works.
Surely this is a touch overblown
I mean, for this kind of attack to be successful a malware author would need access to the source code of a safe application (to do the required additions to it) as well as this evil application.
If the bad guys have the ability to add code to your trusted apps then you're screwed anyway.
EVERY hash function produces collisions
"SHA-1 is also known to produce collisions . . ."
Since the output of the hash function is a fixed number of bits, whilst the possible file inputs are nearly infinite, every hash function will produce collisions.
The weaknesses stem from tricks to FIND collisions. If I compute the hash for word.exe, AND I can produce a malware file then add some bytes so that it produces the same hash, I then have a file that can bypass hash-based whitelisting.
Having said that, I would agree with those above who suggest using 2 different hash functions on the same file. I'm not a cryptographer either, but I can't see why that wouldn't work.
In fact, I think this should be done anyway. Just because we think one method is secure now, there's nothing to say that it won't be circumvented tomorrow, either by ingenuity or increase in raw computing power. If a hash can be quickly calculated, do 3 different hashes anyway.
If PK encryption can be done with 1000-bit keys in a "reasonable" time now, then do so - never mind that we don't think even 128-bit keys can yet be easily broken. (Those are "for example" figures - I don't know how the state of the art is nowadays!). After all, how long will it be before botnets are decrypting confidential information between spam floods?
128 bits is safe....
>"we don't think even 128-bit keys can yet be easily broken"
It's not down to the number of bits. Brute force search of a 128 bit keyspace cannot be done, not now, not ever - the number is simply too big. This isn't a case of having a big enough bot net, it's a case of needing all the energy in a couple of suns to do it - not going to happen. Ever.
Nope... to break a code you need to find a flaw in the algorithm which makes brute force unnecessary. This is what happened to MD5.
Using two different checksum methods is probably a good idea.
MD5 hash AND Cyclic Redundancy Check... is it too obvious???
Using multiple hash functions turns out not to improve things as much as one might expect. IIRC the paper showing this is by one Antoine Joux.
From the get-go, Tripwire (you know, the filesystem integrity checker) was set up to use > 1 hash for this reason. IANAC either but would expect the probability of compromising 2 (or more) hashes to be extremely small -- at this point anyway. Ah, good ol' defense in depth...
@Daniel du Preez
With a bit of tlc, you can add non-executing bytes to an executable without needing the source.
The idea as I understand it is that the malware author would write BOTH executables and have the clean sent to be whitelisted. Once it is then its hash will still match the malware code and the distribution can begin.
"whitelisted by submitting it to a classification server"
Sorry, did I miss something? Either the "classification server" is run by someone *you* trust to show basic competence, or you are letting an untrusted agency run your whitelist and you are a cretin.
re: Or Arbitrarily Slice
Problem with dividing the file into 2 - it wouldn't take a lot of effort to make two sections with that match the two original file halves hashes and join them - after all the second half could just be junk just to get right hash (and pretty much all combinations of slicing and dicing I suspect with have same issues).
> the possible file inputs are nearly infinite
<chortle> Nearly infinite, hey? Well how big is that, since any quantity is by definition, er, infinitely smaller than infinite...
Don't they hire real programmers?
So Symantec is looking for excuses of why its technology won't work. Don't they have smart people working for them? Maybe they should just close up shop.
It is funny how AV companies try to cover their #@$. Hash...hmmm...well I guess if every whitelisting company did it that way. Do any of you know how this whitelisting technolgy became popular? It all began about 7 years ago when a small company came to the Army Research Lab and then to the NSA SPOCK program. Everyone tried to figure out how this company did it...They tried using hash...Now everyone thinks that is how it is done. Why not find the company that started all this? www.seventhknight.com. They don't use hash and they surely are not going to tell you how they do it.
One thing I would like to remind everyone. Anything is better than AV technology. If it catches anything....guess what?....you are already infected....what a scam!!!!!!
AV technolgy is a KNOWN failure.
"The idea as I understand it is that the malware author would write BOTH executables and have the clean sent to be whitelisted."
Well, yes, that would work but why would you want to run even "clean" software from a dodgy vendor?
The point of whitelisting is that you or your IT department figure out the programs you need, find a reputable ISV and obtain the software from that ISV via a trusted path. Then everything else is sandboxed (or blocked). Where's the risk?
Well your reputable ISV could have been taken over by malware types as you suggest. No new risk there, though. Given complete freedom to author software, I'm sure a competent programmer can insert malicious code that a whitelisting inspection fails to spot.
Alternatively, the whitelisting might fail to reverse-engineer the entire app and prove that parameters to security-sensitive APIs can never take undesirable values. Actually that's inevitable and demonstrates why a "whitelisting service" is such a bad idea. It's likely that they'll just approve anything plausible from reputable ISVs. Rejection might lead to a lawsuit and they wouldn't want that! You, on the other hand, have no such fears, so you might actually be better able to whitelist than a commercial service.
The other approach
Would an alternative not be to find some existing whitelisted app, and add appropiate extra bytes to some malware so that it matches? Wouldn't require access to anything but the malware.
AC writes: "Anything is better than AV technology. If it catches anything....guess what?....you are already infected....what a scam"
What are you smoking? AV tools prevent infection by detecting viruses (usually at time of download or time of reading/writing from/to disk). Typically, you only get infected when your AV software FAILS to detect the virus.
Any AV software that detects a virus AND lets it infect your system is broken.
But what are the chance that you could successfully fake multiple different hashing algorithms at the same time?
Gentoo linux at least, maintains 4 different types of hash for each file it downloads.
@Ken Hagan: @AC: @Daniel
I think you're missing the point. It's not that the malware author would get you to run both executables. The author produces both files, and then submits the clean file to the AV companies for them to check and add to their whitelist. Then if you get the bad file on your machine your AV software won't pick it up since it will register it as being the good one.
Here's a solution:
Salt your hashes.
If each user prepended their own unique salt to an executable before hashing, a single maliciously crafted exe couldn't infect more than one person.
I identified this sort of issue an while ago. Any simple hash function must produce the same value for countless combinations. Though most of those combinations might have limited/no practical use to an hacker, some intelligence can narrow down to an useful design.
So I devised an comprehensive idea (unimplemented) of multiple hash values that nullifies this potential threat. In reality, there maybe no file system supporting this level of comprehensiveness, so separate hash values would have to be maintained, and when an file change happens an comparison made (i.e. time consuming) until OS companies can incorporate such hash systems. Until then, the present hash systems can be incorporate with file change details, and file lengths. As an file change indicates something has happened (though the date might be got around, so OS companies will have to enforce, or at least enforce an flag that an change has happened). File length, because it is easy to make an hack that changes the file size without compromising file functionality, but hard to make one the same size, unless you refer to another file, in which case the files that the file passes execution to need to be pre-defined to easily detect unusual behaviour.
@ Stephen B Streater
"One way this is done is to use multiple "independent" hash functions. You only have to check the second if the first gives a match, so it isn't significantly slower than using a single hash function."
Initially this seems like a good idea, unfortunately the idea represents at least two fatally flawed assumptions.
1) Significant time savings requires the file to not match the first hash result. This indicates the first hash function is sufficient. If the first is compromised then you have to run both to detect a bad match and you do have significantly slower verification. Since you expect files to verify most of the time, you are significantly slowing the verification process most of the time.
2) Reading the file takes little time compared with calculating the result. Checking the access times for registers, cache, main memory and hard drives should show the flaw. If that is not enough then try this little experiment. Get a copy of md5sum, sha1sum, and whatever other hash functions you want. Create a simple program that reads a file 4KB at a time doing no processing on the data. Run these programs timing how long it takes to process the same really large file, try a DVD ISO or something. You will find it takes nearly the same amount of time to just read the file as it does to calculate the hash results.
The correct solution would be to always calculate two or more independent hash function results using a single program that only reads the file once. That way reading the file, the most time consuming part of the verification, only happens once. You get the benefit of multiple independent checks without significant increase in verification time.
AV Technology....A known Failure
When AV Technology detects a virus, unless detected in an email, you are already infected. It scans. Therefore, you can be infected until the scanner detects the virus. Have you ever noticed how a machine runs slow after being "cleaned". Have you ever had a virus that the scanner could not remove. AV technology is not proactive. It is reactive clean up. The damage is done. You could lose productivty, proprietary data, etc. Not to mention that is slows down your machine and constantly has to be updated. Oh, and where did all those updated come from. You guessed it, infected pcs. Now that is a joke.
@ - AV Technology....A known Failure
AV software is Reactive, you are right. As you can't predict an infection that's not been downloaded.
Just like a Doctor can't treat an illness that you have never caught. Innoculations, sure. An innoculation works like a simple AV App's updates. Alert the system ( Your immune system / Anti-Virus application ) to a variation in the norm that's malicious, Antibodies ( Active/passive scanning of downloaded files ) are alerted in the system. Then the system ( Anti-Virus application / Immune System ) Jumps all over it once it penetrates the system and is detected ( Downloaded application / Caught an illness ). Only PC's are better since they scan EVERYTHING that comes in, a human's immune system is more of a dragnet.
But by the fact that you are downloading the file, the point of failure is the Luser, not the AV Application.
If you can create an Anti-Virus application that's Pro-active in it's workings, ( Heuristics? - HAHAHAHA *Thud* AAAHAHAHAHAHA! ) and comes up with less than say.. 2% ( Even this level is unaccepable in a medium sized organisation ) False-Posivive, and 0% False-Negative ( If you can't get this, then it's not Pro-Active, and things are defeating your hallowed app ) then by all means. Go for it.
Otherwise, I'd suggest hitting the books and maybe Re-Education of yourself on a few matters.
The Damn Hash calc would use several hashing algorithms (plus variations) in a single file read.
Of course you had to truse the author of the package in the first place
It's fairly easy in modern executables to hack a file without changing the file length, simply due to the number of 'caves' (caves being the term used by reverse engineers for blank space or redundant code). With current bloatware theres more than enough space to hide a whole operating system in there.
Well, you're right, there can only be an infinite number of file inputs if infinite file length is possible which, of course, it isn't. The point is that the number of FEASIBLY possible file inputs might as well be infinite.
If, for example, you consider ONLY files of 4MB in size. There are 2^25 possible such files. A 128-bit hash has 2^7 possible values. Therefore, on average, each hash value can be derived from 2^18 different files.
As John Stag kindly pointed out, we can already provide security that is theoretically impossible to crack by brute-force methods. The problems arise when weaknesses in the security algorithms make it possible to bypass these theoretical limits and find a solution more directly.
Winword.exe on my system is over 12MB in size. I'm pretty sure a decent (?) piece of malware could be written that would be a fraction of that size. If I then had some way to populate the other 11Mb with junk such that the MD5 (or whatever) hash of the whole file matched the original Winword.exe hash, it would pass a hash-based check.
Without actually being able to prove it, I am pretty well positive that there would be many different versions of my file that could be made to match the original MD5 hash. Whether it could AT THE SAME TIME match an SHA-1 hash, and/or an SHA-2 hash etc, would be a question for a REAL cryptographer!
Surely there was never a better argument for open source than this
I'm really not trying to start a flame war, but consider how hard it would be to infect an application that is recompiled on the target computer as part of the installation.
Some form of compromise/infection surely can be done, but wouldn't it be harder to do, and much easier to fix if you could simply recompile?
Just a note about key sizes (and heuristics)
Not all crypto is equal: the "safe" key sizes for symmetric algorithms (AES et co) and asymmetric ie public-key (RSA et co) are vastly different simply because they're fundamentally different principles. Nevertheless the problem usually isn't the key space but some implementation/design mistake on a detail that allows the attacker to shortcut (or a major breakthrough in mathematics that allows "easy" solutions to problems previously thought to be "hard").
As for heuristics the engines have been pretty good for a long time and false positives can be managed with a signature list, nowadays with increasing computing power more thorough behaviour analysis is possible and at least F-Secure is doing reasonable job at it (http://www.heise-security.co.uk/news/100900)
re: AV Technology....A known Failure
Eh? Maybe true(ish) for an active scanner that just looks for virus activity on a running PC but anyone who only uses that level of anti-malware protection on a Windows PC deserves viruses.
Minimum basic precautions are..
Properly configure (and use) your PC in as secure a manner as possible.
Scan all dangerous files at download time with a good, up-to-date virus scanner, preferably on a mail/proxy server and certainly before installing anything.
Run an active scanner in case the other precautions failed.
Back-up everything regularly in case you have to flatten your PC and start-over.
"If, for example, you consider ONLY files of 4MB in size. There are 2^25 possible such files. A 128-bit hash has 2^7 possible values. Therefore, on average, each hash value can be derived from 2^18 different files."
Really? I don't think so. 2^7 possible values - that's only 128 values. A 128 bit hash has 2^128 possible values. That's over 3.4 X 10^38. As the sun has a mass of approx 1.9 X 10^30. That's 17 million times the mass of the sun in kilos. Or, in file sizes, a single file of 3.4 X 10^14 terrabytes. One file of this size would be needed to generate all the possible combinations.
That's (for current computers) not a space that can brute-forced.
There are 10 types of people in this world:
Those who understand binary,
Those who don't and
Those who start counting at 0
- Geek's Guide to Britain INSIDE GCHQ: Welcome to Cheltenham's cottage industry
- 'Catastrophic failure' of 3D-printed gun in Oz Police test
- Game Theory Is the next-gen console war already One?
- BBC suspends CTO after it wastes £100m on doomed IT system
- Peak Facebook: British users lose their Liking for Zuck's ad empire