Even a £5 watch gets the leap year right!
Azure. Clippy. Metro. Bob. Vista. Kin.
As soon as Microsoft's cloudy platform Azure crashed to Earth, and stayed there for eight hours, on 29 February, every developer who has ever had to handle dates immediately figured it was a leap-day bug. Now the software biz behemoth has put its hands up and admitted in a detailed dissection of the blunder how a calendar …
Wow, bob you've changed your tune, you usually love MS, oh hang on...
This is a massive balls up, but it's one that deals with cryptographically signed certificates which are used to secure an information channel between distributed virtual and physical systems. This is not a £5 digital watch, of which BTW many got the 2000 leap year wrong.
Pierre, did you miss the bit where I said "This is a massive balls up"?
The point I was making is that while this is a massive balls up, it is not as simple as was suggested by OP who is a perennial MS basher.
Time is complex, especially on globally distributed systems, it should have been caught, a lot of people think time as simple, it's not.
No it wasn't.
It was a total and utter **** up that is only possible if you genuinely have no idea what you are doing.
The reason is simple: This failure is only possible if you're processing the date as three independent numbers.
Listen very carefully Microsoft, I will scream this into your ear only once:
DATES ARE NOT THREE NUMBERS.
DATES ARE NOT TEXT.
A datetimeis a number of intervals after an epoch. Never anything else.
Feel free to pick your interval (either days or seconds would be sensible in this case) and your epoch, but doing anything else is sheer insanity that should result in instant termination because no programmer working with dates in any capacity should be that ****ing stupid.
I've known this since I was 12. Yes, this is quite literally a childish blunder.
The worst part is that you have to deliberately make this mistake these days, because every single modern framework comes with a Date or DateTime object that handles it for you. (Though 1900 and 2100 might be a problem in some.)
Heck, even Excel handles it!
Yeah Bob, all the other posters think something is simple, so it must be... I however, have designed and implemented a Mainframe to desktop global time synchronisation service for a FTSE100 corporation, which synced z/OS, Tandem, AS/400, various UNIXes, Linux, physical access systems, Windows server and desktops. Let me assure you time is not simple. It is not unheard of for major corporations to have change freezes around daylight saving changes, for example, because the risk is of screw up is so high.
And no, crappy £5 digital watches don't handle leap years properly, no matter how many times you say they do..
> I however, have designed and implemented a Mainframe to desktop global time synchronisation service for a FTSE100 corporation,
Err... so have I, on 400+ servers (mixed OS'es) located in over 30 different sites. It was trivial and it is called NTP. The most difficult part was ensuring the relevant UDP port was allowed by the firewalls and that was a network problem not a time problem.
Oh yeah, and a couple of the servers where running a version of solaris that had to be patched because a dodgy NTP service let the time drift out of sync.
> The most difficult part was ensuring the relevant UDP port was allowed
To be fair, I have encountered a situation where it would be a good idea not to permit code changes across a DST change.
Many years ago, I inherited a project that used Visual SourceSafe as its revision control system. I found an interesting feature. If two users committed the same file, the order of commits on the server would not be the order in which they were received - it would be according to the timestamp placed on the file by the *client* machine doing the commit. I had one PC with a bit of a clock drift that kept rolling back other people's changes...
I have no idea if this has been fixed - I haven't used that product since then, and I have no intention to do so in the future.
> Yeah NTP is simple,
Yes it is. The concept is simple, the configuration is simple, the implementation is simple, securing it with key exchanges is simple, starting and stopping it is simple, ensuring it skews time instead of steps it is simple, monitoring it is simple, setting up the stratum zero clocks is simple (okay that can be complicated).
The bureaucracy involved in deploying it to 400+ servers is not simple, but that is bureaucracy and not the technical aspects.
> It is if you use puppet or similar.
You are joking? How would puppet help with the bureaucracy? Do you even know what bureaucracy means?
The bureaucracy means getting the owners of the various platforms, who are usually PHBs without a frigging clue, to approve the change request to either give you access to their systems or get one of their own people to follow the instructions on the idiot sheet you will provide them with. Of course, the PHB will often get the department idiot, whose shoes have Velcro straps because he can not tie his shoelaces, to implement the change (who would have thought that you would have to explicitly state in the idiot sheet that copy the file does mean print it out and photocopy it). Then six weeks later you have to attend a critical incident phone conference because their server crashed for the eighth time this year and this one "must be because of your change".
So no, puppet wont help because bureaucracy means dealing with living breathing people and not some multi-platform configuration system.
Time is actually very easy:
Store and process in UTC.
Displaying time to the user and parsing user input is harder, but once you're always storing and processing in UTC it is no longer critical to the operation of the machine.
I've long since lost count of the number of failures caused by storing and processing in local time.
Local time changes.
Your pseudocode seems to imply that they used a date object, which I doubt. Since a date object is usually represented internally as a count (usually in milliseconds) from an epoch, and adding to the year property simply increments the core value by the correct number of milliseconds or whatever unit, it would not be dependent upon calendar date, and there would not likely be a problem.
More likely, the date was stored as a calendar date in integer or text format, and they manipulated the year portion of that less intelligent data type directly.
That fail for an integer format would look something like this:
intValidDate = getCertificateDate()
/* Certificate Date is store as an integer in YYYYMMDD format, so all we have do to is... */
intExpireDate = intValidDate + 10000
If text, they probably had a delimiter like "/" and parsed the pieces into integers, added 1 to the year, and concatenated them back to text.
Either way, this is exactly why you should use a well written date object rather than try a shortcut.
Even if dates are stored in a discrete year/month/day format, a competent programmer would never have let this happen. Any function that creates or modifies such a date should normalize it into a valid form. (For example, a user should be able to add 60 days to a date and get the correct result.) This is not difficult:
While day is greater than numDaysInMonth: subtract numDaysInMonth from day, increment month.
Proper handling of invalid months is left as an exercise for the reader, should take about 5 minutes. Add another 5 if you want to make if bulletproof and handle negative values as well. First-year CS students can do this; for a company such as Microsoft to screw it up requires sheer incompetence.
> Your pseudocode seems to imply that they used a date object, which I doubt.
Why? This was in the generation of Azure transfer certificates. That code is very likely written in C++, if it's native, or C#, if it's managed. Microsoft already have certificate-generation APIs for both native and managed code, so it seems likely they used them.
> More likely, the date was stored as a calendar date in integer or text format
Why? Even if the transfer-cert generation code is written in C, the date-manipulation code should be using either the standard C library time-manipulation structures and functions, or the Windows FILETIME ones. They'd use one or the other to get the current date in the first place, so they'd already have a struct tm or similar. And canonical date editing with mktime and friends is standard and well-documented.
There is ABSOLUTELY NO REASON for the certificate-generation code to have manipulated the year portion of the date directly. At some point the original ("today's") date was almost certainly in some form suitable for canonical manipulation: a FILETIME, a .NET DateTime object, a struct tm, a time_t, etc.
> they manipulated the year portion of that less intelligent data type directly
Well, yes, that's exactly what happened. But it's vanishingly unlikely that the programmer who did so, did it because the current date wasn't already available in a format suitable for canonical manipulation. This isn't a case of reading a textual date from some source and then adding a year to it; they had to get the current date in the first place.
I've generated certificates using the .NET Framework (and using OpenSSL, etc, but I doubt the Azure infrastructure uses OpenSSL). It does nearly all the work for you. This screwup is perverse; it's harder than doing it the right way.
Yep, it'll be something like that, possibly they've done it as direct manipulation of some time string. I've not read their report.
Yet again some programmers somewhere have been shown to be a bunch of lazy ******s. Symantec had a similar problem with their antivirus software updater thinking that the year 2010 came before the year 2009... And are Apple devices capable yet of setting an alarm off properly at the appointed time? I suspect not.
I honestly don't know what goes on in such programmer's heads. If they cared to take even a casual glance at the reference manuals for things like the ANSI C library, Java class libraries, etc. they would find a wealth of functions that a bunch of careful people spent time and effort on so as to make it easy for other programmers to avoid this sort of mistake. Why don't they just ******g use those well thought out routines instead of thinking "I know, I'll do it all over again myself in my own code, how hard can it be, I'm sure a string will do?". It's unbelievable madness. Who supervises these idiots and reviews their code, designs their systems? Sure, the purpose of the routines available in the libraries may be a bit tricky to fully understand, but then time measurement systems (e.g UTC plus the various local timezones) are not a trivial topic. But that's no excuse to ignore the complexity.
> implying the British Government is less stupid
Azure also runs about 50% of iCloud, so presumably Apple are that stupid too. Enjoy the moment, Bob. It'll make you feel better when MS cough out another record-breaking set of sales figures for Win7 later in the year.
> record-breaking set of sales figures for Win7 later in the year.
I am in two minds about what will happen.
Either the Osborne effect will kick in and people will delay buying new computers, or upgrading their XP machine, until they have evaluated Windows 8 released versions.
Or they will rush out and buy Windows 7 so that they don't have to move to Windows 8 and can wait for W8 SP2 (to be called Windows 9).
Time is relative, but as it's historicaly far from perfect we endedup with a system that has bits added on and days added on here and there and ontop we change the time twice a year becasue of some fetish to have cockrels cocking away in the early hours of the morning.
We then take all these sun following fetish's and impose them upon computers who logicaly couldn't care less if the sun is up or not and only care about things being in order. This is were we have the issue as when we start taking that time used to control the order and jump it backwards or forwards in a large chunk we can end up with that level of order getting a little bit our of step and this as we all know upsets programs. Now you can check for TZ changes in your code and cater for these types of exceptions, but thats alot of checks for what is only going to happen in a few small windows during the year.
Personly I wish computers had two clocks, one thats set and just goes and is used by data processing in code and another one that does all the human quirks and a log is used to map that onto the computer one so it ends up with a new entry of the computer time and the old/new human time every TZ/leap year etc and is only needed to be converted for any reports/display/input from the users. You can then do any processing without a care about TZ changes and handle all the mapping in the input/output. But there is always an exception to the rule.
This is what makes computers fun and people employed. Maybe not today but one day there will be somebody out there who has the job title - Digital Timezone consultant. It will be a sad day, especialy for those sysadmins who fill out BCS forms and realise there actualy doing 20 different job roles :).
"Digital Timezone consultant" - I spent a week doing that once.
We had to coordinate some environmental observations done by schools/volunteer groups all over the world
Someone in East Timor say they measured it at 8:00am on 21Mar = what timezone do they use, are they in daylight saving time, when did they change clocks, do they change clocks?
Multiply by a 1000 observations!
> I spent a week doing that once.
> Someone in East Timor say they measured it at 8:00am on 21Mar
[vic@OldEmpire ~]$ TZ=Asia/Dili date +%s -d "21 Mar 2012 08:00"
[vic@OldEmpire ~]$ date -u -d @1332284400
Tue Mar 20 23:00:00 UTC 2012
There's probably a simpler way...
LIinux + GNU is not UNIX/Solaris/Aix/BSD*....
Always quote the format string. Use the time libraries in C or Perl.
Do n't be smart on line unless you know the facts and what you are doing. Know your time libraries and use man(1). Assume no more that Posix and XPG4
Perhaps he was working with Windows or a real UNIX and not using GNU date(1),
Sun Microsystems Inc., SunOS 5.10 (sun4u), Generic_138888-08 CSS 2.1-IB
$ TZ=Asia/Dili date +%s -d "21 Mar 2012 08:00"
$ date -u -d @1332284400
date: illegal option -- d
usage: date [-u] mmddHHMM[[cc]yy][.SS]
date [-u] [+format]
date -a [-]sss[.fff]
$ which date
"Personly I wish computers had two clocks, one thats set and just goes and is used by data processing in code and another one that does all the human quirks and a log is used to map that onto the computer one so it ends up with a new entry of the computer time and the old/new human time every TZ/leap year etc and is only needed to be converted for any reports/display/input from the users."
That's exactly what most real compouters do. Maintain the system clock in UTC, and only convert for display. Except Microsoft OSes, of course, although it seems that Windows7has finally learned to get it right.
Wouldn't have prevented this snafu, though. This one was just down to lazy programming. I remember learning how to programme a computer to convert dates allowing for leap years in 4th form, and that was 40 years ago.
I think he's talking about something a little more substantial than simply tracking UTC and converting it whenever necessary to the local timezone. Simply tracking UTC doesn't really buy you anything in terms of simplicity - you still have leap years and leap seconds. The only real motivation for tracking UTC as opposed to local time is for genuinely multi-user systems where users may reside in differing timezones: it adds complexity, not removes it. When you look in detail at the complexities of the way "real" systems do it is looks more and more like an awkward fudge: for example some parts of POSIX allow for there to be 59, 60 or 61 seconds in a minute. Others _require_ that there only ever be 60 seconds in a minute. Squaring the circle between those two contradictory measures requires double-think of the worst kind.
There are simple, monotonically rising time scales for when they are required: TAI for example. Nobody really uses it outside the scientific community since it doesn't really bear any relationship with the real world.
Storing and processing in UTC removes >99% of the complexity.
You're left with the two (and only two) issues of leap year and leap second which happen roughly every four years and 1-7 years respectively.
As opposed to using local time, which for most people changes twice every year as well as the above leap years and leap seconds, and doesn't stay the same year-to-year either.
On top of that, most people who can afford computers also travel, so that's additional local time changes.
So, what to do? Store UTC and handle leap years and leap seconds, or local time and handle leap years, leap seconds, DST, political timezone changes, travel etc?
TAI would be better, but no common OS uses it internally and neither does the general Internet, making it more likely to be wrong.
"Storing and processing in UTC removes >99% of the complexity."
You've obviously never done any of this work. How does defining everything to UTC solve the problem of six different units in customary use - seven if you include the week (which you need to to accomodate daylight savings)? Determining a time offset from UTC or any other time zone is a matter of one extra term in an expression referring to a lookup table. Add automatic detemination of daylight savings and you are still only looking at a dozen or so lines of code.
Compare that dozen lines to the amount needed simply to deal with all those units. Leap years and leap seconds notwithstanding, months are still not the same length, and yes handling leap years alone (yet alone leap seconds) takes vastly more than that 0.1 line of code that is allowed for in your 99% assertion. An assertion now shown as the ignorant crap it really is, to the extent that you have demonstrated that not only do you not understand the problem, but fail to grasp even its dimensions.
As a final case in point: how many systems actually deal in UTC internally? I'll give you a clue - it isn't ALL "real" systems, in fact I can't think of any more recent than MS-DOS. Windows doesn't and Unix certainly doesn't, both use simple synthetic escalating timers of the nature suggested by Refined Chap. They convert to natural units as and when needed. What do you know that the designers of those systems don't? Or more realistically, what did they realise based on factors that you haven't even considered?
mad physicist Fiona, you appear to have spectacularly missed my point.
You aren't describing processing, that is all formatting. You'll need to do that no matter how you store the time internally, but you don't need to do it very often.
It's not as complex as the existential questions you get from storing and processing in local time - that way you don't know what time it was by the time it's stored to disk, because the local time definitions may have changed. Thus any stored local time also needs the definition of local time at the time to be stored alongside it to use in all future processing.
UTC changes much less often than local time, so that processing lookup table is much smaller - it will have 35 entries in total as of the end of 2012, all of which are +1 second and published in advance.
As opposed to the local time tables which are complicated enough to be worth defending a copyright claim over and change regularly on the whim of world politicians!
Storing local times means that your data set is dependant on those local time tables, and every single data point must state the timezone it was recorded in, for the data to be useful for any purpose at all.
Storing UTC means you can do almost all processing with no lookup tables at all and be fairly accurate about intervals - only 35 seconds out in 50 years - or have one adjustment lookup table that is valid for all data points.
Yes, you still need those complex lookups to display to the user but you don't need them for your data set to be useful.
Incidentally, Windows has been UTC internally since Vista, although its monotonic clock remains irritatingly 32-bit. (49.7 days is a magic number)
No, it is not me that has missed the point: if you define an internal representation in terms of a monotonic counter that is no longer UTC, even if the epoch is defined in terms of UTC: that simply calibrates the meaning of the counter against the real world: it is _not_ itself UTC and immune to this kind of snafu. Windows uses such a monotonic timer internally, not UTC. Nothing uses UTC, at least you haven't given an example of a system that does yet.
Simply using UTC would not have solved this problem. Nor as you would admit would it solve the leap second/year problems. Nor would it solve the simple problem of months being different lengths. Nor would it solve the simple problem on non-normalised units.
Why do you think the POSIX time_t type exists for example? It and similar artificial measures simply to get away from these real world complexities that inevitably introduce these kinds of corner cases and flakiness with it. Simply adopting UTC internally does nothing to solve that and indeed can introduce subtleties of its own, such as the same being on a different date in two nearby pieces of code.
> Err, how about treating that at 1 March 2013, perhaps?
X.509 certificates are used to provide security functions. Security measures are usually designed to fail secure: when an error is detected, and the system can't verify secure operation, it denies the request / terminates the action / etc. (That's not always what "secure" means, of course. In some cases it might be more secure to reset to a set of hard-coded defaults, for example.)
So generally speaking, the recipient of an X.509 certificate that has an invalid date should reject that certificate. It's hard to see what attack modes would produce a certificate that has an invalid expiration date but is otherwise valid; but that doesn't mean there aren't any.
More generally, there's always a tension between best-effort design principles (like the Postel Interoperability Principle), where the recipient tries its best to determine what the sender wanted, on the one hand; and strict-adherence design principles, where the recipient insists on well-formed data, on the other. The former allow for sloppy implementations and occasional misinterpretation in exchange for making it easier to get things working. The latter make it harder for legitimate use, but they also make the system harder to exploit.
I remember having an issue with Exchange 2007 and a leap year spent a whole day looking at that issue and ended up calling Microsoft
they are not the only one with time issues Apple iOS has had issues when changing to summer time and Symantec Backup exec does funny things as well
Man, trips down memory lane. I remember some crazy ass date time bugs...
Just about everyone gets dates wrong, but Calendar appointments while appearing simple wind up being complex with recurring appointments being particularly vulnerable. The fix cannot tell if items were created before or after the fix, and there is nowhere in the metadata to keep things straight. Updated or not? Of course these are the items that bite folks in the butt since they are one hour late or early if something goes wrong. Especially if a recurring appointment such as a Birthday gets flung into the next day. Or scheduled tasks like backup, archiving runs, replication schedules. When is the 13th, 14th, 15th, month of the year? Hardcoded MM/DD/YYYY makes me want to send entire teams of developers to internationalization jail.
the microsoft apologists and excusers are overlooking the fact that the situation which caused the crash was entirely foreseeable, not a weird combination of circumstances but an inherent part of what the routine was supposed to do. It was apparently not tested to see what would happen on a leap year, and if that is the case, it is inexcusable
As Twain (Mark, not MS) said, "It aint what you you don't know that gets you into trouble, its what you know for sure that aint so." In this case, the software decided that there's a hardware fault without actually having any hardware monitoring flag a problem. I've seen banks with whole mainframes dedicated to testing. They roll the clocks forward and backwards to test what happens over time with their applications before deploying to a live environment. It appears that MS doesn't run such tests. That's a little scary. They don't even check their hosted, must-be-up-at-all-costs cloud software for leap-year date problems.
Anyone can make date handling mistakes, the question is whether the testing is done and architecture is right and fault isolation (or even diagnosis) is baked into the design. I guess that's why people buy mid-range unix systems and mainframes. Better hardware design and diagnostics and a real reluctance to imagine that hitting the reset button is a valid solution. This might be ok in SMEs but it is just not fit for the enterprise.
Thank-you for participating in the MS "Train the Software Release Manager," "Train the Designer" and "Train the Coder" Programs. Your data is appreciated. Please hold.
Try coding the Monthly Expiry Dates on a Pay As You Go insurance policy ;)
Neither the "End of next month" nor the "1st March following year" approaches, mentioned in comments above, are correct. This applies to most systems where you may need to increment more than once - if you don't always add n period to the start date, you end up with the end date continually creeping forwards, which is seldom an appropriate solution.
Just read the summary linked to in the article. They did indeed just take the date and add 1 to the year, and not with a date object and the dateadd function.
Unbelievable that this happened on their production service. If I was a customer I'd be migrating to another platform right now...
If I was a customer I'd be migrating to another platform right now...
That was my first thought.
But is it a bit like flying after a plane crash - this will be a shake up that makes sure everything gets checked properly the way it should have been in the first place?
I mean I don't know whether their competitors have equally stupid assumptions programmed in that just haven't come to light yet.
These occasional IT bush fires can be good for clearing away some of the useless clutter.
A few years ago I spent many hours trying to get Exchange 2003 installed and it kept falling over. Several hours in my search on the error message started producing hundreds of additional results from countries several hours ahead of me.
That's right. It couldn't be installed on Feb 29th. I had to wait until March 1st.
"Now the software biz behemoth has put its hands up and admitted in a detailed dissection of the blunder how a calendar glitch trashed its server farm. It's also a handy guide to setting up your own wholesale-sized cloud platform."
Surely that should be "a handy guide on how not to set up your own wholesale-sized cloud platform?"
Don't use a cloud. The end. The reason this time was Leap Day, what will it be next time?
If someone says "cloud" in your organization, squash it. Explain to them that there is no such a thing as a cloud, only mirrored data centers, otherwise known as off site storage and thin client services that are pay as you go. It is good for no one (well maybe "cloud" providers), least of all for on site techs (soon to be outsourced). There were reasons we went away from thin clients years ago, those reasons are still there.
Although everyone blames the guy who thought he could just do year++, it was the fancy timestamp style date handling that made it actually crash. If the whole system used the naive method, it would have had worked fine. The certificate would have been valid on 2013-02-28 and invalid on the 2013-03-01.
Biting the hand that feeds IT © 1998–2019