back to article Achtung! Use maths to smash the German tank problem – and your rival

You've employed Benford's Law to out fraudsters hidden in seemingly random numbers. Now what do you do if you need answers but some of your data is missing? Welcome to the German tank problem, the second in The Reg's guide to crafty techniques from the world of mathematics that can help you quickly solve niggling data problems …

COMMENTS

This topic is closed for new posts.
  1. Michael H.F. Wilkinson Silver badge
    Thumb Up

    Nice one again

    Had not heard of this one. Very interesting read. Looking forward to the rest of the series

    1. AdamT

      Re: Nice one again

      You may also enjoy "Cryptonomicon" by Neal Stephenson. Part of the plot follows a slightly fictionalised/alternate-history version of the allied code breakers (Bletchley Park, etc.) as they deal with exactly this kind of problem.

      And also worry about the reverse issue of trying to counteract their own possible revelations of information back the other way. e.g. having worked out how many tanks per week the enemy is building, what if you inadvertently change your behavior in such a way that the enemy can work out what you have done? Then they might start deliberately messing with the serial numbers to spoil your analysis whilst suddenly increasing production...

    2. DanDanDan

      Re: Nice one again

      If you liked this article, I can give you the web address of a website full of these things. This very article is an almost verbatim recreation of that site. Its name? Wikipedia. Check out the article on "German Tank Problem" over there and revel in awe at the similarities!

      1. codejunky Silver badge

        Re: Nice one again

        @ DanDanDan

        Information is everywhere but you wont see it if you dont know where to look or why you are looking. Wiki is full of information of almost anything... so having the information does not make anyone wiser. It is the pooling of such information to the relevant places for the right people.

        Obviously the reg could just post a link, but we came to the reg to read the reg which would lead me to assume people find reading the articles on here is easy while the same is not necessarily true for the layout of wiki.

        1. Fink-Nottle

          Re: Nice one again

          I first came across the serial number hack in Dr R. V. Jones' book 'Most Secret War' - a personal account of Britain's scientific and technological warfare during WWII and a worthwhile read.

  2. Zog_but_not_the_first
    Thumb Up

    Brilliant!

    This is (one of many reasons) why I read El Reg.

  3. theModge

    Invoice numbers

    It's for this reason that every client gets their own invoice number when I (and many others I know) invoice them: AB001 where their company initials are AB. That way they don't don't know if I'm a) depending on them for an income and thus they can take the piss or b) doing lots of work on the side when they think they're getting all my time. Of course you can estimate the answers to both questions other ways (how much you see me for a start!) but depending on who you are and what I'm doing it can help a little, now I'm back off PAYE and back on a random selection of projects (+ a PhD)

    1. Roland6 Silver badge

      Re: Invoice numbers

      >It's for this reason that every client gets their own invoice number

      Unfortunately, if you are VAT registered, HMRC like to see a nice sequence of invoice numbers without any gaps.

      Because many IT billing systems can only handle simple sequential numbering, I have used the logic of the German Tank problem to estimate the number of customers and the net fluctuation various mobile phone operators have.

      1. Anonymous Coward
        Anonymous Coward

        Re: Invoice numbers

        "Unfortunately, if you are VAT registered, HMRC like to see a nice sequence of invoice numbers without any gaps."

        That's not entirely correct. Nice sequence, yes, but no problem having multiple sequences. What they don't like is gaps in any sequence.

        Per project sequences for example are not at all uncommon. Per customer is nothing else than that.

        Had a VAT audit not too long ago, and they didn't complain at all about the fact that I had several sequences (by project/customer type).

        1. Roland6 Silver badge

          Re: Invoice numbers

          >Had a VAT audit not too long ago, and they didn't complain at all about the fact that I had several sequences

          Thanks for the prod to revisit the VAT invoice rules! Yes the requirements are for no gaps and no duplicates.

          I take it you maintain a master register/ledger, so all invoices effectively have an internal unique sequence number and a published (unique) sequence number. This certainly would help when you need to double check a VAT return.

    2. oldhand

      Re: Invoice numbers

      Apart from Invoice numbers, I have done this with Serial Numbers, without ever having seen a German tank. When we started serial numbering a product I chose 1234 as the start of the sequence. Useful article and good example of the value of The Reg.

  4. Nifty Silver badge

    Is this not a bit like the 'remaining stock' on an Ebay item or an online shop?

    When I ran an online shop site we only used to set this value to 10 or 20 then reset it once it hit zero.

    As much to give the customers the feeling that stock of their item was not infinite, as to fool the competitors about our turnover!

    1. vagabondo

      stock level

      So if I was looking for 25 items, or concerned about future availability, I would probably order from your competitor who was showing 300 available for immediate despatch. I would probably be prepared to pay a small premium for the convenience of a single order.

  5. This post has been deleted by its author

  6. joeldillon

    I'm pretty sure the tank used for the illustration is Russian...

    1. dajames
      Holmes

      The tank used for the illustration ...

      Looks to me like an ISU-122. Not technically a tank, but a self-propelled gun ... but certainly Russian.

      http://en.wikipedia.org/wiki/ISU-122

      (This is not the first time that I've lamented that El Reg doesn't provide larger versions of the thumbnail images used in its lists of articles.)

      However, like the statistical analysis in the article, it did play a part in reducing German tank numbers, so it isn't entirely irrelevant to the subject at hand.

      1. Potemkine Silver badge

        Re: The tank used for the illustration ...

        I was thinking at first it was an ISU-152, but you're right, the cannon doesn't seem to be that big, must be an ISU-122

      2. Paul_Murphy

        Re: The tank used for the illustration ...

        I was wondering if I were going to be the first to notice and post :-) not by a long shot,

        Concurr - ISU122 sp assault gun

        Thats why I read the Reg comments..

        1. Stevie

          Re: The tank used for the illustration ...

          You are all wrong. The vehicle shown is in fact a Bolo Mk XXIV Continental Siege Engine.

          Understandable mistake, there being no figure in shot to give it scale (those roadwheels are, in fact, over thirty feet or about twelve metres in diameter), but if you know where to look you can make out the unit's name (Restartus) on the glacis underneath and to the right of the ball mantlet of the rail gun.

          1. mmeier

            Re: The tank used for the illustration ...

            The Mk XXIV does have four tracks not two. So this must be a lower Mark...

        2. Getriebe

          Re: The tank used for the illustration ...

          Yes, the ISU 122 built on a KV1 chassis. Strictly a tank destroyer. But found a niece use in the Battle for Berlin. The infantry would ask then to blow a hole through a row of houses so they could avoid the dangerous street.

          1. Ian 55

            Re: The tank used for the illustration ...

            I always knew I should be respectful to my siblings' daughters...

      3. as2003

        Re: The tank used for the illustration ...

        > This is not the first time that I've lamented that El Reg doesn't provide larger versions of the thumbnail images used in its lists of articles.

        Here's the original

    2. James O'Shea

      oops. should have read the comments.

  7. Tom 38

    Danger!

    Your competitors' websites can be a valuable hunting ground.

    Yes and no. Say your competitor has accidentally leaked 0.1% of their records on their homepage, and you notice that by clever manipulation of the URL you can make it also reveal the other 99.9% (0.1% at a time), should you then go on to extract their entire database?

    Common sense says that they have published this data, the law commonly comes down on those who extract databases in this way - just ask weev.

    1. Anonymous Coward
      FAIL

      Re: Danger!

      That's not the same thing at all. URL manipulation is a form of hacking. It may only work in the presence of apallingly lax security (like accessing a system with a default user name and password), but it's still an attempt to access data you're not supposed to have access to. And the main point of law here that the 'free data' crowd tend to gloss over is that 'exposing' is not the same as 'publishing'.

      With this method, one can demonstrate to any enquiry that all the source data was openly available (assuming you haven't been skimming numbers off other people's orders).

      1. Tom 38

        Re: Danger!

        When you take data that is not available and make it available to people, it is called publishing.

        If you accidentally publish and distribute 10,000 incorrect leaflets, it does not stop being publishing because it was a mistake.

        1. Anonymous Coward
          FAIL

          Re: Danger!

          I knew you wouldn't get it - in fact, I said as much.

          When you break a law you don't understand you are still a criminal.

          1. Tom 38

            Re: Danger!

            I get it perfectly well - the law says that when you accidentally give access to information to someone not authorized, you're not publishing the data, and when the unauthorized person access that data it is unauthorized access to a computer.

            The law is a fucking ass. Putting something online is publishing, allowing someone access to data is authorizing them to access it. The law says that these things are not publishing nor authorization, and so the law is - obviously - wrong.

            It does not matter that you did it accidentally - don't have bad processes.

            It does not matter that the "someone" is an unidentified anonymous internet user - that is who you authorized to access it.

            Businesses and courts don't like this because it made their lives difficult, so instead they made the law difficult. Much better to redefine what "published" and "authorized" mean in newspeak than to properly secure your data.

            Anyway, the whole point of this was not about the vagaries of URL manipulation - TFA suggests you can infer information from your competitors, and indeed you very often can.

            Just be wary when you realise you can extract a great deal of information from them and think about the legal implications before you fire up a script to capture all that lovely information - it might be illegal to retrieve the information they have "published" and "authorized" you to access, for the reasons listed above.

            1. Anonymous Dutch Coward

              Re: Danger!

              @Tom38: totally agreed with your reasoning. If as a company you don't want people to know things, don't put them unprotected on the internet. IMO, the members of the legal profession who ignore this (and interpret the law as done by your esteemed partner in this discussion) just don't get it or, as you indicate, probably don't want to get it.

              Unfortunately, they get to make the rules...

              1. Anonymous Coward
                WTF?

                Re: Danger!

                So what you're saying is, I was absolutely right, you are in breach of the law when uncovering data that you were not intended to access; that inadvertently making data available does not constitute publishing it, but because you don't happen to like the law as it stands, you're still going to downvote me and try and claim some sort of moral superiority?

                Re: "the whole point of this was not about the vagaries of URL manipulation" - err - yes it is, because that's exactly and entirely what your original post, and my rebuttal, was based on. We do in fact agree that the law, as it stands, does not support what you call 'common sense', but then it's not 'common sense' just because you say it is.

                The real matter here is that inferring information from published data is perfectly legal. Extracting data by unauthorised means is not.

                1. Roland6 Silver badge

                  Re: Danger!

                  >I was absolutely right, you are in breach of the law when uncovering data that you were not intended to access

                  It depends on the meaning of "were not intended to access". I find it interesting just what you can get Google to uncover through a well constructed set of search terms: In my regular web searches on various aspects of IT, I keep encountering hoards of information which on investigation seem to be totally inaccessible via the normal html homepage. given Google uses 'signposts' erected by the website owner to uncover content it puts a different spin on what is meant by "intent".

            2. Anonymous Coward
              Anonymous Coward

              Re: Danger!

              > I get it perfectly well

              For values of 'well' very close to 'not at all'.

            3. Stoneshop
              Boffin

              Re: Danger!

              The crux of the article is that you can often extrapolate, fill-in-the-gaps, connect-the-dots, the data you actually want/need from other data someone has intentionally made public, or has exposed as part of a business flow (like the engine serial numbers). It is not about extracting that information straight from the source; URL manipulation would be akin to sending in a spy to take a peek at the weekly internal production reports from the tank factories. Sure, you can do that, but you risk your spy getting shot or your data sniffing being caught. Using maths and statistics on legally available incomplete data doesn't carry the risk of being hauled before the beak.

  8. DavCrav

    Also we are all going to die

    For this same argument applied to the human race, see the Carter catastrophe here.

    1. sisk
      Pint

      Re: Also we are all going to die

      Ah yes, the so called doomsday argument. I've read that almost everyone who comes across it sees a flaw in it. Apparently if you actually get down to actually putting serious study into it you end up changing your mind about what the flaws are. In essence there's a consensus that it's wrong but no one can agree on WHY it's wrong.

      Which, frankly, is a bad sign for the future of the human race. Better have a beer now.

      1. Michael Wojcik Silver badge

        Re: Also we are all going to die

        I've read that almost everyone who comes across it sees a flaw in it. Apparently if you actually get down to actually putting serious study into it you end up changing your mind about what the flaws are.

        Perhaps you read that in Randall Munroe's What-If? It's a nice discussion of the Doomsday Argument1, and his phrasing is similar to yours.

        Which, frankly, is a bad sign for the future of the human race.

        Maybe so (though I personally find myself unable to care about hypothetical long-term survival of the species), but it's a good sign for each of us as individuals, since by the same argument we're most likely not living in the End Times. So that's one fewer thing to worry about.

        1In reference to Twitter and hypothetical web-page height, naturally.

        1. sisk

          Re: Also we are all going to die

          Perhaps you read that in Randall Munroe's What-If?

          I couldn't remember where I'd read it, but if it's been in What-If then that's most likely it. I make pretty regular visits there. Math, physics and logic applied answer to silly questions with Randall's brand of humor....what more could a nerd want?

  9. James O'Shea

    err...

    The pic you use with this article... it appears to be an assault gun, not a tank. (No turret...) Further, it seems to be a _Soviet_ assault gun, possibly an ISU-122, the pic's not real clear and is from a funny angle.

    Could y'all at least use a nice Jagdpanther?

  10. Anonymous Coward
    Anonymous Coward

    1+1/E seems dubious for low values of E

    On the far end, for big values of E it makes sense, you're increasing your estimation by a small amount, the higher the sample you get the smaller the correction you make.

    But for small samples you're increasing your estimate by some factor (100% with one sample, 50% when E=2, 30% when E=3) that does not seem very reasonable. Is there some lower bound for the number of samples?

    1. imanidiot Silver badge

      Re: 1+1/E seems dubious for low values of E

      You have a very large uncertainty when dealing with small sample sizes, so I find this entirely reasonable. Basically, if you have one sample, you just assume the one number you found is somewhere in the middle of the range. (The chances of finding one in the first or last quarter of the range is much smaller than finding one in the middle somewhere). Thus you double the number and call it a day. You'll only get any decent sort of estimate with larger sample sizes. I'd say atleast E=5 to be a lower limit for any sort of "accuracy", but that doesnt make lower sample size guesstimates any less relevant or that higher sample sizes are very accurate.

      1. Mark 65

        Re: 1+1/E seems dubious for low values of E

        "The chances of finding one in the first or last quarter of the range is much smaller than finding one in the middle somewhere"

        If we're talking about stumbling upon a piece of data, why is more likely for it to be from the centre rather than the tail i.e. what, other than it occurring more frequently in nature, makes us assume a normal rather than uniform distribution? Would the tanks not be a uniform distribution?

        Just curious.

        1. imanidiot Silver badge

          Re: 1+1/E seems dubious for low values of E

          Old tanks are more likely to have been converted to scrap already. New tanks are more likely to still be at the factory, "enroute" or deployed to particularly strategic locations. Which means the general population is more likely to come from the middle segment. (Ofcourse, if you start looking at tanks in those particular locations that just received a shipment of new tanks this might skew the data) Overal its just a decent assumption to take the number you have and double it if you have just a single sample. Once you get 2, you get slightly more confidence, etc.

    2. DanDanDan

      Re: 1+1/E seems dubious for low values of E

      There's no "lower bound" as such, but you want to have enough samples to be confident in your estimation of N.

      You're looking for information on "Confidence Interval" Check out the best answer from the below page. It details how to find the confidence interval of the maximum likelihood estimators for "a" and "b", where a and b are the lower and upper bound of the distribution.

      http://stats.stackexchange.com/questions/20158/determining-sample-size-for-uniform-distribution

  11. Infury8r

    1. Many companies' products' serial numbers incorporate production year and/or month. [Not necessarily known by the researcher] so 1204**** may, just may, relate to April (20)12 production.

    2. Honda 450: 1965 - 1968 serial numbers began CB450-1000001; but 1968 - 1969 serial numbers began CB450-3000001. Squaddies searching blown-up bits might find 1000321 & 3000198 but not know the year. If they were tank numbers, bulk orders for white flags ensues.

    1. imanidiot Silver badge

      Not necessarily

      In the first example, if you get a decent sample of serial numbers production date incrementation becomes quite obvious.

      In the second example, if you have a sampling from the CB450-1xxxxxx range and a sampling from the 3xxxxxxx range, it'll become obvious they come from 2 different series. Missing any data of an interlying 2xxxxxx range means you can assume that range doesnt exist.

      1. phuzz Silver badge
        Joke

        Re: Not necessarily

        Maybe the 2xxxxxx range is reserved for the stealth motorbikes.

    2. J P

      1930's Aston Martins started with the month production started as a letter (A for Jan, B for Feb etc) then a single digit for the year of the decade (2 for 1932, 3 for 1933 etc) followed by the actual chassis number for the model, and a suffix letter for type. So eg C2/201/S is short chassis 201, built in March 1932. G7/722/L would be long chassis 722, laid down in July 1937. However, they don't seem to have built the chassis in order - so we have H7/717, C7/719 (5 months earlier) B40/720, (3 years later) F9/721, (back a year) A9/722, A7/730 B7/736 (finally moving forward again) and so on. Even if you spot the lack of repetition of 3 digit numbers, you'd still be caught out as there were frequent jumps to the next hundred when a new model came out/new owner bought the company. (The earlier cars up to number 74 were much easier; S for sports and T for Touring with no indication of date. Apart from MS1, a polished chassis built specially for motor show display. Subsequently though the number 273 appeared on at least 3 sets of records, but does not appear to have ever actually left the factory.)

  12. phil dude
    Joke

    biology....

    of course similar maths is used in biology too - estimating species etc.

    But an interesting twist is the knock on effect. The component supply train (in biology the food chain) is also part of the analysis.

    You should try it on the Tube some day, spotting the "missing link" as we call it....

    P.

  13. Voland's right hand Silver badge

    Not just german and not just math

    Reminds me of France declaring Mendeleev (the periodic table guy) persona non-grata 100+ years ago. He guessed correctly that their super-secret advanced smokeless gunpowder is indeed trinitrocellulose by counting the railway wagons with cotton, sulphuric asid and potash going into the plants.

    He also quite correctly predicted that it will all end up in tears (due to degradation of higly nitrated and non-inhibited cellulose over time). And indeed it did: http://en.wikipedia.org/wiki/French_battleship_Libert%C3%A9

    1. Michael Wojcik Silver badge

      Re: Not just german and not just math

      Yup. Just one of many historical examples of a side-channel (aka "covert channel") attack.

      When I were a lad, I read a novel in the Danny Dunn series where Danny and friends learn to heuristically interpret product codes and the like. I thought it was great fun, though the plot mostly revolves around mundane exploits like finding expired product on the shelves of a local store. Gave me a lasting appreciation of side channels.

      These days they're well-known in computer cryptanalysis for things like the timing and power attacks described by Kocher and others. (Kocher's original timing attack against RSA and other systems was particularly important as it demonstrated using these side channels to break security was feasible; blinding is now considered a requirement for timing- and power-sensitive algorithms.)

  14. Anonymous Coward
    Anonymous Coward

    Bag tag numbers are not allocated sequentially for security reasons.

    +++++

    When I first went to USA and opened a bank account with a cheque book, a colleague advised me to rip out the first 50 or so cheques. The reason - he said retailers would be suspicious of a low serial number cheque, indicating a.... recently opened cheque account.

    1. Gene Cash Silver badge

      retailers would be suspicious of a low serial number cheque

      Yeah, I got that reaction once. I gave her the gimlet eye and said "yeah it is, what of it?" and let the awkward pause continue very uncomfortably. She finally rang it up and never made eye contact again.

      Of course, I'm also antisocial enough that when ATMs were invented, I immediately thought "oh! I no longer have to deal with making inane small talk with dumbass condescending bank tellers!! thank ******ing god!"

  15. John Smith 19 Gold badge
    Unhappy

    Please understand it's about interpretion of information that is *legally* visible.

    It's a technique of inference.

    Like the old CIA section that dealt with "crateology, " the study of shipping containers.

    Keep in mind that incrementing counts are used because they are cheap to track and to generate ( "Is SN 37256 one of ours?" Well if the SN counter is up to 40000 probably).

    May be important, may not be

  16. fnj

    Claim is wrong

    <blockquote>Intelligence sources suggested the production was about 1,000 to 1,500 tanks per month. In practice it was nowhere near that high</blockquote>

    Bull. Germany produced 18,956 tanks in 1944 alone - that is an average of 1580 per month.

    1. JLV

      Re: Claim is wrong

      Correct if you are looking at the entire tank line production. However if you read the Wikipedia article mentioned higher up, you'll see that the investigation, as described, concerned one particular tank model.

      Which kind of makes sense intuitively, because serial numbers would not necessarily carry across model lines.

      But kudos for applying a bit of sanity-checking to a numerical claim. People often have no grasp of numbers, take things at face value and fail to see gaping holes in the most stupid claims.

      Awesome article! Encore! Encore!

  17. A.P.Richelieu

    A friend working at Intel,once (1998) told me that "tomorrow there would be a press release

    in all the morning newspapers", so I told him the subject of the press release.

    He looked a little surprised, and asked why I believed this.

    So I told him that it had to be something really special for the morning

    newspapers to pick it up, and unless Intel was doing something completely

    different from what they were doing, the only thing that would make the

    morning newspapers, would be a highly integrated x86 processor.

    Intel releasing a 2 x size flash memory, would not be interesting to the general public,

    and would only make it into Electronics Magazines.

    True enough, the next day the i386SL was released.

  18. Britt Johnston

    Each domain is relevant

    This used to be a big question in the mainframe age.

    I maintained material numbers between 100000 - 153000, of which those below 125000 were migrated from two 20th C. systems.

    Then a new application split the materials by type and gave each a domain - from 700000 for sales products, 800000 for raw materials, 600000 for manufactured parts, etc., keeping the old "mixed" ones as legacy.

    An external customer sees numbers from 100000 to 730000 on his shipping papers, and has less overview of what is going on.

This topic is closed for new posts.

Other stories you might like