Nice one again
Had not heard of this one. Very interesting read. Looking forward to the rest of the series
You've employed Benford's Law to out fraudsters hidden in seemingly random numbers. Now what do you do if you need answers but some of your data is missing? Welcome to the German tank problem, the second in The Reg's guide to crafty techniques from the world of mathematics that can help you quickly solve niggling data problems. …
Had not heard of this one. Very interesting read. Looking forward to the rest of the series
You may also enjoy "Cryptonomicon" by Neal Stephenson. Part of the plot follows a slightly fictionalised/alternate-history version of the allied code breakers (Bletchley Park, etc.) as they deal with exactly this kind of problem.
And also worry about the reverse issue of trying to counteract their own possible revelations of information back the other way. e.g. having worked out how many tanks per week the enemy is building, what if you inadvertently change your behavior in such a way that the enemy can work out what you have done? Then they might start deliberately messing with the serial numbers to spoil your analysis whilst suddenly increasing production...
If you liked this article, I can give you the web address of a website full of these things. This very article is an almost verbatim recreation of that site. Its name? Wikipedia. Check out the article on "German Tank Problem" over there and revel in awe at the similarities!
Information is everywhere but you wont see it if you dont know where to look or why you are looking. Wiki is full of information of almost anything... so having the information does not make anyone wiser. It is the pooling of such information to the relevant places for the right people.
Obviously the reg could just post a link, but we came to the reg to read the reg which would lead me to assume people find reading the articles on here is easy while the same is not necessarily true for the layout of wiki.
I first came across the serial number hack in Dr R. V. Jones' book 'Most Secret War' - a personal account of Britain's scientific and technological warfare during WWII and a worthwhile read.
This is (one of many reasons) why I read El Reg.
It's for this reason that every client gets their own invoice number when I (and many others I know) invoice them: AB001 where their company initials are AB. That way they don't don't know if I'm a) depending on them for an income and thus they can take the piss or b) doing lots of work on the side when they think they're getting all my time. Of course you can estimate the answers to both questions other ways (how much you see me for a start!) but depending on who you are and what I'm doing it can help a little, now I'm back off PAYE and back on a random selection of projects (+ a PhD)
>It's for this reason that every client gets their own invoice number
Unfortunately, if you are VAT registered, HMRC like to see a nice sequence of invoice numbers without any gaps.
Because many IT billing systems can only handle simple sequential numbering, I have used the logic of the German Tank problem to estimate the number of customers and the net fluctuation various mobile phone operators have.
"Unfortunately, if you are VAT registered, HMRC like to see a nice sequence of invoice numbers without any gaps."
That's not entirely correct. Nice sequence, yes, but no problem having multiple sequences. What they don't like is gaps in any sequence.
Per project sequences for example are not at all uncommon. Per customer is nothing else than that.
Had a VAT audit not too long ago, and they didn't complain at all about the fact that I had several sequences (by project/customer type).
Apart from Invoice numbers, I have done this with Serial Numbers, without ever having seen a German tank. When we started serial numbering a product I chose 1234 as the start of the sequence. Useful article and good example of the value of The Reg.
>Had a VAT audit not too long ago, and they didn't complain at all about the fact that I had several sequences
Thanks for the prod to revisit the VAT invoice rules! Yes the requirements are for no gaps and no duplicates.
I take it you maintain a master register/ledger, so all invoices effectively have an internal unique sequence number and a published (unique) sequence number. This certainly would help when you need to double check a VAT return.
Is this not a bit like the 'remaining stock' on an Ebay item or an online shop?
When I ran an online shop site we only used to set this value to 10 or 20 then reset it once it hit zero.
As much to give the customers the feeling that stock of their item was not infinite, as to fool the competitors about our turnover!
So if I was looking for 25 items, or concerned about future availability, I would probably order from your competitor who was showing 300 available for immediate despatch. I would probably be prepared to pay a small premium for the convenience of a single order.
I'm pretty sure the tank used for the illustration is Russian...
Looks to me like an ISU-122. Not technically a tank, but a self-propelled gun ... but certainly Russian.
(This is not the first time that I've lamented that El Reg doesn't provide larger versions of the thumbnail images used in its lists of articles.)
However, like the statistical analysis in the article, it did play a part in reducing German tank numbers, so it isn't entirely irrelevant to the subject at hand.
oops. should have read the comments.
I was thinking at first it was an ISU-152, but you're right, the cannon doesn't seem to be that big, must be an ISU-122
I was wondering if I were going to be the first to notice and post :-) not by a long shot,
Concurr - ISU122 sp assault gun
Thats why I read the Reg comments..
You are all wrong. The vehicle shown is in fact a Bolo Mk XXIV Continental Siege Engine.
Understandable mistake, there being no figure in shot to give it scale (those roadwheels are, in fact, over thirty feet or about twelve metres in diameter), but if you know where to look you can make out the unit's name (Restartus) on the glacis underneath and to the right of the ball mantlet of the rail gun.
> This is not the first time that I've lamented that El Reg doesn't provide larger versions of the thumbnail images used in its lists of articles.
Yes, the ISU 122 built on a KV1 chassis. Strictly a tank destroyer. But found a niece use in the Battle for Berlin. The infantry would ask then to blow a hole through a row of houses so they could avoid the dangerous street.
I always knew I should be respectful to my siblings' daughters...
The Mk XXIV does have four tracks not two. So this must be a lower Mark...
Your competitors' websites can be a valuable hunting ground.
Yes and no. Say your competitor has accidentally leaked 0.1% of their records on their homepage, and you notice that by clever manipulation of the URL you can make it also reveal the other 99.9% (0.1% at a time), should you then go on to extract their entire database?
Common sense says that they have published this data, the law commonly comes down on those who extract databases in this way - just ask weev.
That's not the same thing at all. URL manipulation is a form of hacking. It may only work in the presence of apallingly lax security (like accessing a system with a default user name and password), but it's still an attempt to access data you're not supposed to have access to. And the main point of law here that the 'free data' crowd tend to gloss over is that 'exposing' is not the same as 'publishing'.
With this method, one can demonstrate to any enquiry that all the source data was openly available (assuming you haven't been skimming numbers off other people's orders).
When you take data that is not available and make it available to people, it is called publishing.
If you accidentally publish and distribute 10,000 incorrect leaflets, it does not stop being publishing because it was a mistake.
I knew you wouldn't get it - in fact, I said as much.
When you break a law you don't understand you are still a criminal.
I get it perfectly well - the law says that when you accidentally give access to information to someone not authorized, you're not publishing the data, and when the unauthorized person access that data it is unauthorized access to a computer.
The law is a fucking ass. Putting something online is publishing, allowing someone access to data is authorizing them to access it. The law says that these things are not publishing nor authorization, and so the law is - obviously - wrong.
It does not matter that you did it accidentally - don't have bad processes.
It does not matter that the "someone" is an unidentified anonymous internet user - that is who you authorized to access it.
Businesses and courts don't like this because it made their lives difficult, so instead they made the law difficult. Much better to redefine what "published" and "authorized" mean in newspeak than to properly secure your data.
Anyway, the whole point of this was not about the vagaries of URL manipulation - TFA suggests you can infer information from your competitors, and indeed you very often can.
Just be wary when you realise you can extract a great deal of information from them and think about the legal implications before you fire up a script to capture all that lovely information - it might be illegal to retrieve the information they have "published" and "authorized" you to access, for the reasons listed above.
@Tom38: totally agreed with your reasoning. If as a company you don't want people to know things, don't put them unprotected on the internet. IMO, the members of the legal profession who ignore this (and interpret the law as done by your esteemed partner in this discussion) just don't get it or, as you indicate, probably don't want to get it.
Unfortunately, they get to make the rules...
So what you're saying is, I was absolutely right, you are in breach of the law when uncovering data that you were not intended to access; that inadvertently making data available does not constitute publishing it, but because you don't happen to like the law as it stands, you're still going to downvote me and try and claim some sort of moral superiority?
Re: "the whole point of this was not about the vagaries of URL manipulation" - err - yes it is, because that's exactly and entirely what your original post, and my rebuttal, was based on. We do in fact agree that the law, as it stands, does not support what you call 'common sense', but then it's not 'common sense' just because you say it is.
The real matter here is that inferring information from published data is perfectly legal. Extracting data by unauthorised means is not.
> I get it perfectly well
For values of 'well' very close to 'not at all'.
>I was absolutely right, you are in breach of the law when uncovering data that you were not intended to access
It depends on the meaning of "were not intended to access". I find it interesting just what you can get Google to uncover through a well constructed set of search terms: In my regular web searches on various aspects of IT, I keep encountering hoards of information which on investigation seem to be totally inaccessible via the normal html homepage. given Google uses 'signposts' erected by the website owner to uncover content it puts a different spin on what is meant by "intent".
The crux of the article is that you can often extrapolate, fill-in-the-gaps, connect-the-dots, the data you actually want/need from other data someone has intentionally made public, or has exposed as part of a business flow (like the engine serial numbers). It is not about extracting that information straight from the source; URL manipulation would be akin to sending in a spy to take a peek at the weekly internal production reports from the tank factories. Sure, you can do that, but you risk your spy getting shot or your data sniffing being caught. Using maths and statistics on legally available incomplete data doesn't carry the risk of being hauled before the beak.
For this same argument applied to the human race, see the Carter catastrophe here.
Ah yes, the so called doomsday argument. I've read that almost everyone who comes across it sees a flaw in it. Apparently if you actually get down to actually putting serious study into it you end up changing your mind about what the flaws are. In essence there's a consensus that it's wrong but no one can agree on WHY it's wrong.
Which, frankly, is a bad sign for the future of the human race. Better have a beer now.
I've read that almost everyone who comes across it sees a flaw in it. Apparently if you actually get down to actually putting serious study into it you end up changing your mind about what the flaws are.
Perhaps you read that in Randall Munroe's What-If? It's a nice discussion of the Doomsday Argument1, and his phrasing is similar to yours.
Which, frankly, is a bad sign for the future of the human race.
Maybe so (though I personally find myself unable to care about hypothetical long-term survival of the species), but it's a good sign for each of us as individuals, since by the same argument we're most likely not living in the End Times. So that's one fewer thing to worry about.
1In reference to Twitter and hypothetical web-page height, naturally.
Perhaps you read that in Randall Munroe's What-If?
I couldn't remember where I'd read it, but if it's been in What-If then that's most likely it. I make pretty regular visits there. Math, physics and logic applied answer to silly questions with Randall's brand of humor....what more could a nerd want?
The pic you use with this article... it appears to be an assault gun, not a tank. (No turret...) Further, it seems to be a _Soviet_ assault gun, possibly an ISU-122, the pic's not real clear and is from a funny angle.
Could y'all at least use a nice Jagdpanther?
On the far end, for big values of E it makes sense, you're increasing your estimation by a small amount, the higher the sample you get the smaller the correction you make.
But for small samples you're increasing your estimate by some factor (100% with one sample, 50% when E=2, 30% when E=3) that does not seem very reasonable. Is there some lower bound for the number of samples?
You have a very large uncertainty when dealing with small sample sizes, so I find this entirely reasonable. Basically, if you have one sample, you just assume the one number you found is somewhere in the middle of the range. (The chances of finding one in the first or last quarter of the range is much smaller than finding one in the middle somewhere). Thus you double the number and call it a day. You'll only get any decent sort of estimate with larger sample sizes. I'd say atleast E=5 to be a lower limit for any sort of "accuracy", but that doesnt make lower sample size guesstimates any less relevant or that higher sample sizes are very accurate.
There's no "lower bound" as such, but you want to have enough samples to be confident in your estimation of N.
You're looking for information on "Confidence Interval" Check out the best answer from the below page. It details how to find the confidence interval of the maximum likelihood estimators for "a" and "b", where a and b are the lower and upper bound of the distribution.
"The chances of finding one in the first or last quarter of the range is much smaller than finding one in the middle somewhere"
If we're talking about stumbling upon a piece of data, why is more likely for it to be from the centre rather than the tail i.e. what, other than it occurring more frequently in nature, makes us assume a normal rather than uniform distribution? Would the tanks not be a uniform distribution?
Old tanks are more likely to have been converted to scrap already. New tanks are more likely to still be at the factory, "enroute" or deployed to particularly strategic locations. Which means the general population is more likely to come from the middle segment. (Ofcourse, if you start looking at tanks in those particular locations that just received a shipment of new tanks this might skew the data) Overal its just a decent assumption to take the number you have and double it if you have just a single sample. Once you get 2, you get slightly more confidence, etc.
1. Many companies' products' serial numbers incorporate production year and/or month. [Not necessarily known by the researcher] so 1204**** may, just may, relate to April (20)12 production.
2. Honda 450: 1965 - 1968 serial numbers began CB450-1000001; but 1968 - 1969 serial numbers began CB450-3000001. Squaddies searching blown-up bits might find 1000321 & 3000198 but not know the year. If they were tank numbers, bulk orders for white flags ensues.
In the first example, if you get a decent sample of serial numbers production date incrementation becomes quite obvious.
In the second example, if you have a sampling from the CB450-1xxxxxx range and a sampling from the 3xxxxxxx range, it'll become obvious they come from 2 different series. Missing any data of an interlying 2xxxxxx range means you can assume that range doesnt exist.
Maybe the 2xxxxxx range is reserved for the stealth motorbikes.
1930's Aston Martins started with the month production started as a letter (A for Jan, B for Feb etc) then a single digit for the year of the decade (2 for 1932, 3 for 1933 etc) followed by the actual chassis number for the model, and a suffix letter for type. So eg C2/201/S is short chassis 201, built in March 1932. G7/722/L would be long chassis 722, laid down in July 1937. However, they don't seem to have built the chassis in order - so we have H7/717, C7/719 (5 months earlier) B40/720, (3 years later) F9/721, (back a year) A9/722, A7/730 B7/736 (finally moving forward again) and so on. Even if you spot the lack of repetition of 3 digit numbers, you'd still be caught out as there were frequent jumps to the next hundred when a new model came out/new owner bought the company. (The earlier cars up to number 74 were much easier; S for sports and T for Touring with no indication of date. Apart from MS1, a polished chassis built specially for motor show display. Subsequently though the number 273 appeared on at least 3 sets of records, but does not appear to have ever actually left the factory.)
of course similar maths is used in biology too - estimating species etc.
But an interesting twist is the knock on effect. The component supply train (in biology the food chain) is also part of the analysis.
You should try it on the Tube some day, spotting the "missing link" as we call it....
Reminds me of France declaring Mendeleev (the periodic table guy) persona non-grata 100+ years ago. He guessed correctly that their super-secret advanced smokeless gunpowder is indeed trinitrocellulose by counting the railway wagons with cotton, sulphuric asid and potash going into the plants.
He also quite correctly predicted that it will all end up in tears (due to degradation of higly nitrated and non-inhibited cellulose over time). And indeed it did: http://en.wikipedia.org/wiki/French_battleship_Libert%C3%A9