*Delta knocks on the door of another airline*
"Hi, I was just going to check if there was a power cut, but I saw one of your planes fly overhead with it's lights on"
A computer outage has caused worldwide delays for thousands of passengers using Delta Airlines. The US carrier tweeted about the issues on Monday morning, blaming delayed and cancelled flights on a “computer outage." Delta, based in Atlanta, Georgia, subsequently blamed the crash on a massive power cut at 2.38am ET (7.38am …
This is why there is so much pressure to kill off leap seconds. The ITU recently kicked that can down the road for another few years, but personally, I don't see why this is still such a problem. We've had leap seconds for decades and computer time protocols have been designed to signal future leap seconds for a *long* time. It does involve the strange concept of a specific minute at the end of June or December should have 61 secs. The specs also allow for 59 secs too but that is unlikely now ever to happen.
You could write the software to deal with a step, but Google decided to just slow down the computer notion of the time in a controlled fashion for several hours so that after the leap second actually occurs the clocks are exactly back in sync. That is probably much more friendly to existing applications.
And long before that we had ephemeris time (1952), and then TDT (1976), and then GPS from 1980 using continuous time with a leap-second offset rather like a time-zone.
As I keep saying IT IS A KNOWN FEATURE and if your code can't handle it gracefully you are incompetent due to either:
1) Not using tested system libraries to handle time, delays, etc.
2) Writing or modifying said libraries without knowing what you are doing.
And most of all NOT TESTING YOUR DAMN CODE! Really, just set up a fake NTP time server and have it generate leap seconds regularly backwards and forwards and see if your code works.
Uh, wouldn't that be because a warmer atmosphere expands, moving mass away from the centre of the Earth, so conservation of angular momentum demands that the Earth's rate of spin slows to compensate. Same thing with the seas, which although they wouldn't move as much, are considerably denser. I'm surprised if that's sufficient to affect the rotation by as much as several seconds in just a few years, but I can't be bothered to do the maths.
IIRC it's not the expansion of the atmosphere, but a reduction in viscosity that allows atmospheric tides to counterbalance lunar torque. However I'm being disingenuous: it was coming out an ice age and at a time when the moon was a lot closer. (600Myr ago, a 21 hour day.) And while the corner in the delta-T is striking, the earth's mass distribution is changing all the time and it's far more likely that's the cause.
The length of the mean solar day *was* 86400 SI seconds around the middle of the 19th century. That rotation rate was embodied in the astronomical observations that were used to define Universal Time in the late 19th century. When UT was replaced as the best measure of time by Ephemeris Time and then by International Atomic Time in the 20th century, both ET and TAI were defined to have the same length of second as UT. And that's why we have a problem with leap seconds: the SI second reflects the rotation speed of the Earth almost 200 years ago, not today.
"Will people be ready for that one?"
Well the one that followed the aircraft-bothering incident went with practically no issues at all. Simply because folk had woken up and tested things for the inevitable occurrence of another leap-second.
In fact the Linux bug mentioned had been created by somebody modifying already-working time related code and not testing the damn thing for this situation. As others have already said, leap seconds and means to deal with them have been with us for decades already so its not new stuff. But every new generation of code monkeys seems to be able to break things...
"In fact the Linux bug mentioned had been created by somebody modifying already-working time related code and not testing the damn thing for this situation."
Was it actually a Linux bug? I find that hard to believe since given the 10s of millions of installations of linux in backend server systems not to mention embedded systems around the world. I think an OS timing bug it would have caused more problems than just an airlines reservation system going down. Far more likely an application bug which was conveniently blamed on the OS. Also, what application crashes just because of a 1 second difference even if the OS was at fault??
While we're at it, it's Sabre, not "Sebre", and Virgin Australia are on the Sabre system, not the Altea system.
Oh and Delta bought the code rights to the mainframe they are on (Deltamatic). It is still managed by Travelport (for infrastructure). http://www.travelmarketreport.com/articles/Delta-Reacquires-Res-Operations-Systems-From-Travelport
It's not really clear what data centre was hit and whether it was the one that houses the mainframe. Delta reservations appeared to be okay.
Grumpy old mainframer...
Really? They have their main business-critical system at a single datacenter, without geographical redundancy, so a power cut at that datacenter can bring down the whole thing?
I would have hoped such an essential system would be spread over 2 or 3 sites, so that losing one site has no impact on operations.
Reasoning which holds up well until the catastrophe occurs and you see the bill for repairs. More often than not, you will then reevaluate your opinion of what "makes sense" as far as investments are concerned.
True story : at an important government-level organization I will not name further, there was a kerfluffle when a senior engineer warned, in writing, all the way up the hierarchy, that the currently-at-the-time PC upgrade process was an open invitation to virii and expensive downtime.
He was hauled into his managers' office for a right chewing out, which, being a senior engineer in a function from which nobody could oust him, he took with a verbal barrage of his own (likely containing many words such as "idiotic", "moronic", "abysmally stupid" etc - don't know, wasn't there, but I damn well hope so). Still, he was told that the investment "wasn't worth it" and that he should "stop making waves".
As fate would have it, the tsunami hit later that year. An outdated PC piloted by a nincompoop got infected, the infection spread to the servers, and everything was shut down for at least 3 days. That's over 500 people with no more PCs for 24 work hours. You do the math.
He did the math, and presented the cleanup bill with a scathing "I told you so" that, curiously, all the managers took quite meekly.
The PC upgrade schedule was changed after that. Unbelievable, ain't it ?
I've been involved in several incidents over the years where I've said to the boss: "We need to spend X to replace an aging/failing system." I'd be asked: "Is it currently broken or about to fail?", and when I replied "No", was told to forget about it.
Later on, the system in question would die and management would complain about people not being able to do their jobs. A blank cheque was usually swiftly provided to replace said faulty system.
Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.
> UPS+backup generator=no problem?
Then your single-points-of-failure include the UPS, the generator, the power distribution units, and the emergency-power-cut-off buttons. UPSs, generators and PDUs all break. Idiots switch things off for maintenance without thinking of the consequences, or press emergency-power-cut-off buttons by mistake.
You can have multiple generators or UPSs, although you still have the risk of a design flaw taking them all out when they're needed (e.g. http://www.zdnet.com/article/365-main-details-sf-outage-problems/ ).
There have been plenty of stories about datacenter power outages on The Register, despite the standard UPS+generator.
I remember about 15 years back having my terminal screen wink out while working on a system a thousand or so miles distant at a US military data center. Not coincidentally, others nearby working on various other systems there had the same experience at the same time, and ensuing discussions with the SA revealed that all power to the main computer building had dropped because a contractor (WHO HAD BEEN TOLD) severed the cables from the oubuilding containing the substation, the redundant UPSs, and the backup generators. Power was restored around 6 hours later.
It's actually possible to RAID your UPS systems and run everything via the UPS rather than introduce a switchbreak (large systems use a flywheel and this has the effect of supplying conditioned power to the site).
It's also possible to RAID the generators that back the UPSes.
As someone else has mentioned, the problem is managers looking to get a bonus for cutting costs who end up ripping resiliance out of the systems. I wonder if they'd be so keen if they were made liable for the costs of system failures if it traces back to their cost-cutting.
Worked a privately owned company where the owner decided that the backup generator hadn't been needed in years, so he had it moved and connected to his house. I'll leave to the reader to visualize what happened 6 months later when the power failed at the plant/office. Two months later, we were getting new dual UPS's and two diesel generators.
The problem tends to be that nobody ever has the rights and the guts to actually force a real time test. I suppose it's as well that they don't do real time nuclear rockets tests, 8 inch floppy disks or not. What I have seen more than once is companies who think their backups are fine until they find out it's never tried and total rubbish for years (typically backing up stuff from the wrong place, that was the right place long ago).
Go and look up the Netflix Chaos Monkey and its parent Simian Army which does just that, it generates different failures in various tiers of the of their applications within Amazon AWS to ensure that the applications and infrastructure can fail over in the correct manner when things fail.
See Simian Army
Worse. Airline. I. Have. Ever. Flown. With.
LHR to ATL about 20 years ago. Rude staff, shoddy aircraft, and the code-share they had with Virgin coming back was a joke, not knowing which check-in desk to use at JFK was bonkers. At least we flew back with Virgin, which was one of the best flights I've ever had (although that might be because we got bumped...)
Biting the hand that feeds IT © 1998–2019