back to article How a tiny leap-day miscalculation trashed Microsoft Azure

As soon as Microsoft's cloudy platform Azure crashed to Earth, and stayed there for eight hours, on 29 February, every developer who has ever had to handle dates immediately figured it was a leap-day bug. Now the software biz behemoth has put its hands up and admitted in a detailed dissection of the blunder how a calendar …

COMMENTS

This topic is closed for new posts.

Page:

  1. Bob Vistakin
    FAIL

    Even a £5 watch gets the leap year right!

    Azure. Clippy. Metro. Bob. Vista. Kin.

    1. Anonymous Coward
      Anonymous Coward

      Re: Even a £5 watch gets the leap year right!

      Wow, bob you've changed your tune, you usually love MS, oh hang on...

      This is a massive balls up, but it's one that deals with cryptographically signed certificates which are used to secure an information channel between distributed virtual and physical systems. This is not a £5 digital watch, of which BTW many got the 2000 leap year wrong.

      1. ElReg!comments!Pierre
        WTF?

        @AC

        I don't care who did the mistake or who is commenting on it, it is a pretty huge blunder, and a stupid one, too. One of the things that could have been, and should have been, predicted and avoided. From one of the biggest software vendors in the world, it does look pretty amateurish.

        1. Anonymous Coward
          Anonymous Coward

          Re: @AC

          Pierre, did you miss the bit where I said "This is a massive balls up"?

          The point I was making is that while this is a massive balls up, it is not as simple as was suggested by OP who is a perennial MS basher.

          Time is complex, especially on globally distributed systems, it should have been caught, a lot of people think time as simple, it's not.

          1. Bob Vistakin
            Facepalm

            Re: @AC

            Watching shills squirm and blame everyone but themselves after being caught out red handed is a marvel to behold. Why, its almost as entertaining as seeing them try to deny Microsoft’s Bing uses Google search results—and denies it: http://goo.gl/Bi0JH

            1. Anonymous Coward
              Anonymous Coward

              Re: @AC

              Again with the reading comprehension skills:

              This was a massive balls up, but is significantly more complex than you initially represented it. This hardly makes me a shill, you however seem to be a big straw man.

              1. Richard 12 Silver badge
                Mushroom

                Re: @AC

                No it wasn't.

                It was a total and utter **** up that is only possible if you genuinely have no idea what you are doing.

                The reason is simple: This failure is only possible if you're processing the date as three independent numbers.

                Listen very carefully Microsoft, I will scream this into your ear only once:

                DATES ARE NOT THREE NUMBERS.

                DATES ARE NOT TEXT.

                A datetimeis a number of intervals after an epoch. Never anything else.

                Feel free to pick your interval (either days or seconds would be sensible in this case) and your epoch, but doing anything else is sheer insanity that should result in instant termination because no programmer working with dates in any capacity should be that ****ing stupid.

                I've known this since I was 12. Yes, this is quite literally a childish blunder.

                The worst part is that you have to deliberately make this mistake these days, because every single modern framework comes with a Date or DateTime object that handles it for you. (Though 1900 and 2100 might be a problem in some.)

                Heck, even Excel handles it!

                1. Elmer Phud

                  Re: Excel

                  Excel used to have problems handling negative time - unless you changed to the date and time format used for Macs.

              2. Bob Vistakin
                Facepalm

                Re: @AC

                You're making my point for me - please continue, it's really entertaining. What you're saying is *Microsoft* alone finds date computation "significantly more complex". Linux doesn't. £5 watches don't. All the other posters in this thread show exactly why, too.

                1. Anonymous Coward
                  Anonymous Coward

                  Re: @AC

                  Yeah Bob, all the other posters think something is simple, so it must be... I however, have designed and implemented a Mainframe to desktop global time synchronisation service for a FTSE100 corporation, which synced z/OS, Tandem, AS/400, various UNIXes, Linux, physical access systems, Windows server and desktops. Let me assure you time is not simple. It is not unheard of for major corporations to have change freezes around daylight saving changes, for example, because the risk is of screw up is so high.

                  And no, crappy £5 digital watches don't handle leap years properly, no matter how many times you say they do..

                  1. Anonymous Coward
                    Anonymous Coward

                    Re: @AC

                    > I however, have designed and implemented a Mainframe to desktop global time synchronisation service for a FTSE100 corporation,

                    Err... so have I, on 400+ servers (mixed OS'es) located in over 30 different sites. It was trivial and it is called NTP. The most difficult part was ensuring the relevant UDP port was allowed by the firewalls and that was a network problem not a time problem.

                    Oh yeah, and a couple of the servers where running a version of solaris that had to be patched because a dodgy NTP service let the time drift out of sync.

                    1. Vic

                      Re: @AC

                      > The most difficult part was ensuring the relevant UDP port was allowed

                      To be fair, I have encountered a situation where it would be a good idea not to permit code changes across a DST change.

                      Many years ago, I inherited a project that used Visual SourceSafe as its revision control system. I found an interesting feature. If two users committed the same file, the order of commits on the server would not be the order in which they were received - it would be according to the timestamp placed on the file by the *client* machine doing the commit. I had one PC with a bit of a clock drift that kept rolling back other people's changes...

                      I have no idea if this has been fixed - I haven't used that product since then, and I have no intention to do so in the future.

                      Vic.

                  2. Bob Vistakin
                    Pint

                    Re: @AC

                    Time to pull up an armchair, crack open a sixpack and enjoy the entertainment this fool is giving everyone by digging his hole ever deeper.

                    1. Anonymous Coward
                      Anonymous Coward

                      Re: @AC

                      Yeah NTP is simple, I'm a fool, that's all there is to time sync.

                      In the words of Ben Goldacre: I think you'll find it's a bit more complicated than that.

                      1. Anonymous Coward
                        Anonymous Coward

                        Re: @AC

                        > Yeah NTP is simple,

                        Yes it is. The concept is simple, the configuration is simple, the implementation is simple, securing it with key exchanges is simple, starting and stopping it is simple, ensuring it skews time instead of steps it is simple, monitoring it is simple, setting up the stratum zero clocks is simple (okay that can be complicated).

                        The bureaucracy involved in deploying it to 400+ servers is not simple, but that is bureaucracy and not the technical aspects.

                        1. Vic

                          Re: @AC

                          > The bureaucracy involved in deploying it to 400+ servers is not simple

                          It is if you use puppet or similar.

                          Vic.

                          1. Anonymous Coward
                            Anonymous Coward

                            Re: Vic

                            > It is if you use puppet or similar.

                            You are joking? How would puppet help with the bureaucracy? Do you even know what bureaucracy means?

                            The bureaucracy means getting the owners of the various platforms, who are usually PHBs without a frigging clue, to approve the change request to either give you access to their systems or get one of their own people to follow the instructions on the idiot sheet you will provide them with. Of course, the PHB will often get the department idiot, whose shoes have Velcro straps because he can not tie his shoelaces, to implement the change (who would have thought that you would have to explicitly state in the idiot sheet that copy the file does mean print it out and photocopy it). Then six weeks later you have to attend a critical incident phone conference because their server crashed for the eighth time this year and this one "must be because of your change".

                            So no, puppet wont help because bureaucracy means dealing with living breathing people and not some multi-platform configuration system.

          2. Richard 12 Silver badge
            FAIL

            Re: @AC

            Time is actually very easy:

            Store and process in UTC.

            Displaying time to the user and parsing user input is harder, but once you're always storing and processing in UTC it is no longer critical to the operation of the machine.

            I've long since lost count of the number of failures caused by storing and processing in local time.

            Local time changes.

          3. Anonymous Coward
            Anonymous Coward

            Re: @AC

            Time isn't complex, you just need to decide on a sensible way of counting it. Microsoft made a huge mistake by using "local" time instead of UTC. Every sensible system uses UTC, and hence this works without problems:

            $ date --date="29 feb 2012 +1 year"

            Fri Mar 1 00:00:00 CET 2013

  2. Tom 38

    So, what MS are telling is us that their programmers use their APIs like this (pseduo code):

    mydate = date.today()

    mydate.year += 1

    instead of this:

    mydate = date.today()

    mydate += delta(years=1)

    Awesome. Makes you think what other shitnuggets Azure has yet to shake free.

    1. Steve Knox

      Your pseudocode seems to imply that they used a date object, which I doubt. Since a date object is usually represented internally as a count (usually in milliseconds) from an epoch, and adding to the year property simply increments the core value by the correct number of milliseconds or whatever unit, it would not be dependent upon calendar date, and there would not likely be a problem.

      More likely, the date was stored as a calendar date in integer or text format, and they manipulated the year portion of that less intelligent data type directly.

      That fail for an integer format would look something like this:

      intValidDate = getCertificateDate()

      /* Certificate Date is store as an integer in YYYYMMDD format, so all we have do to is... */

      intExpireDate = intValidDate + 10000

      If text, they probably had a delimiter like "/" and parsed the pieces into integers, added 1 to the year, and concatenated them back to text.

      Either way, this is exactly why you should use a well written date object rather than try a shortcut.

      1. ElReg!comments!Pierre

        date object, yes.

        Or, failing that, at the very least a check for leap years.

      2. Kanhef
        FAIL

        Even if dates are stored in a discrete year/month/day format, a competent programmer would never have let this happen. Any function that creates or modifies such a date should normalize it into a valid form. (For example, a user should be able to add 60 days to a date and get the correct result.) This is not difficult:

        While day is greater than numDaysInMonth: subtract numDaysInMonth from day, increment month.

        Proper handling of invalid months is left as an exercise for the reader, should take about 5 minutes. Add another 5 if you want to make if bulletproof and handle negative values as well. First-year CS students can do this; for a company such as Microsoft to screw it up requires sheer incompetence.

      3. Peter Fox

        Sorry to go on about this but...

        'Well written date object' Eh? If it is based on a timeline then it isn't.

        What date objects can represent 12 Mar 2012, Mar 2012, 12 Mar, 2012, Not-known, and End-of-time/unknown-in-the-future?

        See http://vulpeculox.net/day/index.htm for the answer.

      4. Jonathan Richards 1
        Boffin

        Code leak

        People are posting pseudocode, whereas a hack into the Azure version control system (notepad text objects with local timestamps in the 8.3 filename) reveals the ACTUAL code at fault:

        10 Y$=RIGHT$(D$,4)

        20 Y=VAL(D$)

        30 Y=Y+1

        40 RETURN

      5. Michael Wojcik Silver badge

        @Steve Knox

        > Your pseudocode seems to imply that they used a date object, which I doubt.

        Why? This was in the generation of Azure transfer certificates. That code is very likely written in C++, if it's native, or C#, if it's managed. Microsoft already have certificate-generation APIs for both native and managed code, so it seems likely they used them.

        > More likely, the date was stored as a calendar date in integer or text format

        Why? Even if the transfer-cert generation code is written in C, the date-manipulation code should be using either the standard C library time-manipulation structures and functions, or the Windows FILETIME ones. They'd use one or the other to get the current date in the first place, so they'd already have a struct tm or similar. And canonical date editing with mktime and friends is standard and well-documented.

        There is ABSOLUTELY NO REASON for the certificate-generation code to have manipulated the year portion of the date directly. At some point the original ("today's") date was almost certainly in some form suitable for canonical manipulation: a FILETIME, a .NET DateTime object, a struct tm, a time_t, etc.

        > they manipulated the year portion of that less intelligent data type directly

        Well, yes, that's exactly what happened. But it's vanishingly unlikely that the programmer who did so, did it because the current date wasn't already available in a format suitable for canonical manipulation. This isn't a case of reading a textual date from some source and then adding a year to it; they had to get the current date in the first place.

        I've generated certificates using the .NET Framework (and using OpenSSL, etc, but I doubt the Azure infrastructure uses OpenSSL). It does nearly all the work for you. This screwup is perverse; it's harder than doing it the right way.

    2. Anonymous Coward
      Anonymous Coward

      Well...

      This is Microsoft we're talking about here. Well known for their crappy software.

    3. bazza Silver badge

      @Tom 38

      Yep, it'll be something like that, possibly they've done it as direct manipulation of some time string. I've not read their report.

      Yet again some programmers somewhere have been shown to be a bunch of lazy ******s. Symantec had a similar problem with their antivirus software updater thinking that the year 2010 came before the year 2009... And are Apple devices capable yet of setting an alarm off properly at the appointed time? I suspect not.

      I honestly don't know what goes on in such programmer's heads. If they cared to take even a casual glance at the reference manuals for things like the ANSI C library, Java class libraries, etc. they would find a wealth of functions that a bunch of careful people spent time and effort on so as to make it easy for other programmers to avoid this sort of mistake. Why don't they just ******g use those well thought out routines instead of thinking "I know, I'll do it all over again myself in my own code, how hard can it be, I'm sure a string will do?". It's unbelievable madness. Who supervises these idiots and reviews their code, designs their systems? Sure, the purpose of the routines available in the libraries may be a bit tricky to fully understand, but then time measurement systems (e.g UTC plus the various local timezones) are not a trivial topic. But that's no excuse to ignore the complexity.

      1. bazza Silver badge

        Aha, a downposter!

        Clearly someone in favour of poor programming and buggy software.

  3. ratfox
    Angel

    Not the first, not the last...

    It is a rare software company that never had an embarrassing leap year bug... This one is still going to follow Microsoft for a while, though.

    1. Bob Vistakin
      FAIL

      Re: Not the first, not the last...

      Well, yeah, sure there are plenty of horror stories around, but is this the first time the British Government chose a partner so clueless they literally didn't know what fucking day it was?

      1. dogged

        Re: Not the first, not the last...

        > implying the British Government is less stupid

        Azure also runs about 50% of iCloud, so presumably Apple are that stupid too. Enjoy the moment, Bob. It'll make you feel better when MS cough out another record-breaking set of sales figures for Win7 later in the year.

        1. Bob Vistakin
          FAIL

          Re: Not the first, not the last...

          "Later in the year"? Hopefully you're not using one of these comedy calendars - that could mean anything from next week to sometime in the next decade.

        2. Richard Plinston

          Re: Not the first, not the last...

          > record-breaking set of sales figures for Win7 later in the year.

          I am in two minds about what will happen.

          Either the Osborne effect will kick in and people will delay buying new computers, or upgrading their XP machine, until they have evaluated Windows 8 released versions.

          Or they will rush out and buy Windows 7 so that they don't have to move to Windows 8 and can wait for W8 SP2 (to be called Windows 9).

    2. Anonymous Coward
      Anonymous Coward

      Re: Not the first, not the last...

      > It is a rare software company that never had an embarrassing leap year bug

      Name them

  4. Anonymous Coward
    Thumb Up

    Whoever designs system that have changable clocks will always have problems

    Time is relative, but as it's historicaly far from perfect we endedup with a system that has bits added on and days added on here and there and ontop we change the time twice a year becasue of some fetish to have cockrels cocking away in the early hours of the morning.

    We then take all these sun following fetish's and impose them upon computers who logicaly couldn't care less if the sun is up or not and only care about things being in order. This is were we have the issue as when we start taking that time used to control the order and jump it backwards or forwards in a large chunk we can end up with that level of order getting a little bit our of step and this as we all know upsets programs. Now you can check for TZ changes in your code and cater for these types of exceptions, but thats alot of checks for what is only going to happen in a few small windows during the year.

    Personly I wish computers had two clocks, one thats set and just goes and is used by data processing in code and another one that does all the human quirks and a log is used to map that onto the computer one so it ends up with a new entry of the computer time and the old/new human time every TZ/leap year etc and is only needed to be converted for any reports/display/input from the users. You can then do any processing without a care about TZ changes and handle all the mapping in the input/output. But there is always an exception to the rule.

    This is what makes computers fun and people employed. Maybe not today but one day there will be somebody out there who has the job title - Digital Timezone consultant. It will be a sad day, especialy for those sysadmins who fill out BCS forms and realise there actualy doing 20 different job roles :).

    1. Yet Another Anonymous coward Silver badge

      Re: Whoever designs system that have changable clocks will always have problems

      "Digital Timezone consultant" - I spent a week doing that once.

      We had to coordinate some environmental observations done by schools/volunteer groups all over the world

      Someone in East Timor say they measured it at 8:00am on 21Mar = what timezone do they use, are they in daylight saving time, when did they change clocks, do they change clocks?

      Multiply by a 1000 observations!

      1. Vic

        Re: Whoever designs system that have changable clocks will always have problems

        > I spent a week doing that once.

        A week?

        > Someone in East Timor say they measured it at 8:00am on 21Mar

        [vic@OldEmpire ~]$ TZ=Asia/Dili date +%s -d "21 Mar 2012 08:00"

        1332284400

        [vic@OldEmpire ~]$ date -u -d @1332284400

        Tue Mar 20 23:00:00 UTC 2012

        There's probably a simpler way...

        Vic.

        1. Anonymous Coward
          Anonymous Coward

          Re: Whoever designs system that have changable clocks will always have problems

          LIinux + GNU is not UNIX/Solaris/Aix/BSD*....

          Always quote the format string. Use the time libraries in C or Perl.

          Do n't be smart on line unless you know the facts and what you are doing. Know your time libraries and use man(1). Assume no more that Posix and XPG4

          Perhaps he was working with Windows or a real UNIX and not using GNU date(1),

          Sun Microsystems Inc., SunOS 5.10 (sun4u), Generic_138888-08 CSS 2.1-IB

          $ TZ=Asia/Dili date +%s -d "21 Mar 2012 08:00"

          %s

          $ date -u -d @1332284400

          date: illegal option -- d

          usage: date [-u] mmddHHMM[[cc]yy][.SS]

          date [-u] [+format]

          date -a [-]sss[.fff]

          $ which date

          /usr/bin/date

          $

          1. Anonymous Coward
            Anonymous Coward

            Re: Whoever designs system that have changable clocks will always have problems

            Would that be the same LIinux + GNU is not UNIX/Solaris/Aix/BSD that has to use tables produced by a bunch of astrologers to get their time correct?

    2. Anonymous Coward
      Anonymous Coward

      Re: Whoever designs system that have changable clocks will always have problems

      "Personly I wish computers had two clocks, one thats set and just goes and is used by data processing in code and another one that does all the human quirks and a log is used to map that onto the computer one so it ends up with a new entry of the computer time and the old/new human time every TZ/leap year etc and is only needed to be converted for any reports/display/input from the users."

      That's exactly what most real compouters do. Maintain the system clock in UTC, and only convert for display. Except Microsoft OSes, of course, although it seems that Windows7has finally learned to get it right.

      Wouldn't have prevented this snafu, though. This one was just down to lazy programming. I remember learning how to programme a computer to convert dates allowing for leap years in 4th form, and that was 40 years ago.

      1. the spectacularly refined chap

        Re: Whoever designs system that have changable clocks will always have problems

        I think he's talking about something a little more substantial than simply tracking UTC and converting it whenever necessary to the local timezone. Simply tracking UTC doesn't really buy you anything in terms of simplicity - you still have leap years and leap seconds. The only real motivation for tracking UTC as opposed to local time is for genuinely multi-user systems where users may reside in differing timezones: it adds complexity, not removes it. When you look in detail at the complexities of the way "real" systems do it is looks more and more like an awkward fudge: for example some parts of POSIX allow for there to be 59, 60 or 61 seconds in a minute. Others _require_ that there only ever be 60 seconds in a minute. Squaring the circle between those two contradictory measures requires double-think of the worst kind.

        There are simple, monotonically rising time scales for when they are required: TAI for example. Nobody really uses it outside the scientific community since it doesn't really bear any relationship with the real world.

        1. Richard 12 Silver badge

          Re: Whoever designs system that have changable clocks will always have problems

          Storing and processing in UTC removes >99% of the complexity.

          You're left with the two (and only two) issues of leap year and leap second which happen roughly every four years and 1-7 years respectively.

          As opposed to using local time, which for most people changes twice every year as well as the above leap years and leap seconds, and doesn't stay the same year-to-year either.

          On top of that, most people who can afford computers also travel, so that's additional local time changes.

          So, what to do? Store UTC and handle leap years and leap seconds, or local time and handle leap years, leap seconds, DST, political timezone changes, travel etc?

          TAI would be better, but no common OS uses it internally and neither does the general Internet, making it more likely to be wrong.

          1. mad physicist Fiona
            Facepalm

            Re: Whoever designs system that have changable clocks will always have problems

            "Storing and processing in UTC removes >99% of the complexity."

            You've obviously never done any of this work. How does defining everything to UTC solve the problem of six different units in customary use - seven if you include the week (which you need to to accomodate daylight savings)? Determining a time offset from UTC or any other time zone is a matter of one extra term in an expression referring to a lookup table. Add automatic detemination of daylight savings and you are still only looking at a dozen or so lines of code.

            Compare that dozen lines to the amount needed simply to deal with all those units. Leap years and leap seconds notwithstanding, months are still not the same length, and yes handling leap years alone (yet alone leap seconds) takes vastly more than that 0.1 line of code that is allowed for in your 99% assertion. An assertion now shown as the ignorant crap it really is, to the extent that you have demonstrated that not only do you not understand the problem, but fail to grasp even its dimensions.

            As a final case in point: how many systems actually deal in UTC internally? I'll give you a clue - it isn't ALL "real" systems, in fact I can't think of any more recent than MS-DOS. Windows doesn't and Unix certainly doesn't, both use simple synthetic escalating timers of the nature suggested by Refined Chap. They convert to natural units as and when needed. What do you know that the designers of those systems don't? Or more realistically, what did they realise based on factors that you haven't even considered?

            1. Richard 12 Silver badge
              Facepalm

              Re: Whoever designs system that have changable clocks will always have problems

              mad physicist Fiona, you appear to have spectacularly missed my point.

              You aren't describing processing, that is all formatting. You'll need to do that no matter how you store the time internally, but you don't need to do it very often.

              It's not as complex as the existential questions you get from storing and processing in local time - that way you don't know what time it was by the time it's stored to disk, because the local time definitions may have changed. Thus any stored local time also needs the definition of local time at the time to be stored alongside it to use in all future processing.

              UTC changes much less often than local time, so that processing lookup table is much smaller - it will have 35 entries in total as of the end of 2012, all of which are +1 second and published in advance.

              As opposed to the local time tables which are complicated enough to be worth defending a copyright claim over and change regularly on the whim of world politicians!

              Storing local times means that your data set is dependant on those local time tables, and every single data point must state the timezone it was recorded in, for the data to be useful for any purpose at all.

              Storing UTC means you can do almost all processing with no lookup tables at all and be fairly accurate about intervals - only 35 seconds out in 50 years - or have one adjustment lookup table that is valid for all data points.

              Yes, you still need those complex lookups to display to the user but you don't need them for your data set to be useful.

              Incidentally, Windows has been UTC internally since Vista, although its monotonic clock remains irritatingly 32-bit. (49.7 days is a magic number)

              1. mad physicist Fiona

                Re: Whoever designs system that have changable clocks will always have problems

                No, it is not me that has missed the point: if you define an internal representation in terms of a monotonic counter that is no longer UTC, even if the epoch is defined in terms of UTC: that simply calibrates the meaning of the counter against the real world: it is _not_ itself UTC and immune to this kind of snafu. Windows uses such a monotonic timer internally, not UTC. Nothing uses UTC, at least you haven't given an example of a system that does yet.

          2. the spectacularly refined chap

            Re: Whoever designs system that have changable clocks will always have problems

            Simply using UTC would not have solved this problem. Nor as you would admit would it solve the leap second/year problems. Nor would it solve the simple problem of months being different lengths. Nor would it solve the simple problem on non-normalised units.

            Why do you think the POSIX time_t type exists for example? It and similar artificial measures simply to get away from these real world complexities that inevitably introduce these kinds of corner cases and flakiness with it. Simply adopting UTC internally does nothing to solve that and indeed can introduce subtleties of its own, such as the same being on a different date in two nearby pieces of code.

  5. Adrian Challinor
    Boffin

    Oracle has some interesting date calcs

    In Oracle DB, the DateAdd function has an interesting little tweak. If you add a month the 29-Feb-2012, you get 30-Mar-2012. It does special processing on the last day of the month to give you the last day of the next month. That can catch people out too.

    1. Phil O'Sophical Silver badge
      Coat

      Catch people out

      Yes, especially the ones who expect March to end on the 31st..

    2. Anonymous Coward
      Anonymous Coward

      Re: Oracle has some interesting date calcs

      SELECT ADD_MONTHS( DATE '2012-02-29', 1 ) last_day_of_march

      FROM dual;

      LAST_DAY_OF_MARCH

      2012-03-31

      Can you be clearer?

Page:

This topic is closed for new posts.

Other stories you might like