But can you trust a canary
not to tweet?
Belgian boffins have proposed adding what they call “Canary Numbers” to random number generators (RNGs), in the hope and expectation they will fall off the twig if something goes wrong. In this International Association for Cryptologic Research (IACR) paper, Vladimir Rozic, Bohan Yang, Nele Mentens and Ingrid Verbauwhede write …
I don't think the idea was to expose this numbers anywhere near end user code, and rather have two modules - one HWRNG and one verifier. This way you can have independently designed RNG and the thing thing that performs the runtime checking.
This is rather good idea, as we know how to design whitening functions that pass all statistical checks on the output while fed no entropy at all. In other words, RNGs passing statistical tests doesn't mean it's a good RNG, it just means it's not horribly broken.
There are two sets of random numbers generated, one with high entropy and one with low. It checks the statistical distribution of the low entropy set (the canary numbers). They should be evenly distributed, like you'd expect repeatedly flipping a coin to turn out roughly 50/50. If they're not then something's up. In theory the canary number will be affected first if there's something up with the RNG because its entropy is lower.
This post has been deleted by its author
"There are two sets of random numbers generated, one with high entropy and one with low."
I get that. But why should the latter tell you anything about the former? Or, given that they're talking about TRNGs based on noise sources, how do you get two sets of numbers out of them other than by diverting every other number (or some other percentage) from one set to the other in which case how do you have different entropies for the sets?
In short, this seems like a remarkably low content article.
Probably by way of skimming only a low bit count out of the generator to get your low-entropy set (say your standard set uses 32 or 64 bits, but your canary set only uses 8 bits). A 32-bit number has a numerical range of over 4 billion while an 8-bit one only has 256. Analyzing a stream of 8-bit numbers should be much quicker than scanning a stream of 32-bit ones. You use the first set as an early warning system.
"Probably by way of skimming only a low bit count out of the generator"
You still have the problem of ensuring that the low bit count numbers repeat any patterns in the high bit count. To take an extreme example you take the low 8 bits, they look random but the top 8 bits are cycling through a short repeated sequence.
@Dan 55: Ermm, I'm pretty sure this isn't what they're doing. (See my post below.)
The second (low-entropy) set of numbers is the raw thermal noise data, or the timing of radioactive decay events, or the timing of key presses/mouse movements for a desktop computer. That data is semi-random, but will have patterns in it, including (possibly) not having bits set 50/50, or correlations between bits. This raw, low-entropy data is "stirred" (I think the Linux kernel used to do it with a primitive polynomial) and hashed to create a smaller amount of high-entropy, "really random" output.
The idea these folks are proposing is that you send, say, 90% of your raw data to be stirred and hashed, but send the remaining 10% off still in raw form. That way, if your radioactive event detector is broken and sending all zeroes, you'll see it. (Stirring and hashing can make even a stream of zeroes look random. Keep in mind that the stirring will include the previous state.)
Your average TRNG may work by taking, say, images of a lava lamp (see http://lavarnd.org/index.html; note that they now take images of the noise from the camera). This provides correlated bits that are not entirely random. But if you take, say, a 10 KByte sample and hash it, 100 bytes taken from it will be "random" (in the sense that an outsider, without access to the camera, can have no way to say that a given 100 bytes is more likely than any other 100 bytes.)
The problem is that if the camera died and started providing all zeroes, or zeroes alternating with ones, or something like that, the hashed output still would look pretty darn random. A while back, I made a TRNG by taking a small motor, attaching an (empty) peanut butter jar to the end of it, and starting it up with some dice in it. I pointed a cheap Webcam at it, and set it up to take the stream of image data and scramble/hash it in the above manner. (Note that I did this out of curiosity. I'm not currently doing anything for which I'd need that level of security anyway.)
However, I had it send nine out of ten images to the scrambling code, and the tenth to a plain old image file. As long as that image was of a peanut butter jar tumbling dice, I figured things were probably okay. If it was blank or maybe the motor had died and the dice weren't tumbling, I'd know I was getting non-random numbers. I won't swear to it that this is what the authors are contemplating, but it sounds like it.
In theory, you may be revealing some data if people can interpolate from images 1 and 11 to say something about images 2 through 10, but your sampling rate really ought to be low enough that consecutive images won't be correlated anyway.
I once tried a video entropy daemon. Thing was, it's hard to get the noise right even with whitening (try running your stuff through a chi-squared analysis, for example). I also have to wonder how fast such a rig can provide truly random data, which is important for an enterprise setting, especially one dealing with a lot of encrypted data or connections, which is why many have to rely on expensive dedicated devices. Thing is, such devices are difficult to get into to see if they're not malfunctioning, so the only thing you have to work with is the data stream. Doing a controlled sampling seems to be the best solution in that circumstance.