back to article Leap second bug cripples Linux servers at airlines, Reddit, LinkedIn

The leap second inserted at the weekend crippled Linux-powered servers running one of the world’s largest airline reservation systems - delaying and cancelling flights. Machines running the mighty Amadeus Altea system were brought down soon after an extra second was added to Coordinated Universal Time (UTC) at midnight on …

COMMENTS

This topic is closed for new posts.
  1. jai

    simple solution

    can't we just trigger all the volcanoes on one side of the world to erupt at the same time, thereby adding some extra spin to the planet, negating the slow-down, and negating the need to keep on adjusting the time by a second? c'mon scientists! why haven't you sorted this out yet??

    1. Anonymous Coward
      Anonymous Coward

      Re: simple solution

      Just string a row of Bussard ramjets along a meridian...

    2. joejack
      Megaphone

      Re: simple solution

      Took us out as well. This plus cycling the app servers fixed us:

      sudo date -s "`date -u`"

  2. Tom 7

    Its a kernel bug

    that only affects Java?

    1. Nigel 11

      Re: Its a kernel bug

      Don't know the details here, but it's not impossible. Kernel documented to do X, actually does Y which is subtly different. Java is the only widespread app which does something noticeably bad as a consequence. Everything else "just works" much the same under X or Y.

    2. diodesign (Written by Reg staff) Silver badge

      Re: Its a kernel bug

      As I understand it, the system ideally should be a multicore machine on a 2.6.32-3.3 production kernel for the bug to occur: if one core grabs a high-resolution kernel timer lock while setting the time via the adjtime() system call and another core holds two other related timer locks. This will livelock the kernel, manifesting in services burning up all available CPU time.

      I'm not entirely sure so I'd rather not say for certain at this stage; last time I looked there was a lot of discussion on exactly how the crash happens. You have to be fairly unlucky to encounter it, but if you've got a server farm, you will see machines dropping off.

      C.

      1. Destroy All Monsters Silver badge
        Unhappy

        Re: Its a kernel bug

        Fairly unlucky ... at 66% hitrate.

      2. AdamWill
        Stop

        Re: Its a kernel bug

        "As I understand it, the system ideally should be a multicore machine on a 2.6.32-3.3 production kernel for the bug to occur"

        The RHEL advisory says it affects only RHEL 4 and RHEL 5, though, which are running *far* older kernels (like, 2.6.9 and 2.6.18 or something; ancient history). It specifically says the RHEL 6 kernel - which is newer - isn't affected. That suggests you're wrong.

        Disclaimer: I work for RH but on Fedora, not RHEL, and I know nothing about this issue beyond a vague impression that it only affects really old kernels, which is why it's older enterprise installs that are having trouble.

        1. diodesign (Written by Reg staff) Silver badge

          Re: AdamWill

          "That suggests you're wrong"

          With respect to version numbers, probably - it was a range some had suggested through experience rather than a triaged ranged. The reason why is more or less there - see the LKML.

          C.

          1. Ramazan
            Holmes

            @diodesign

            BTW, Intagram also went down yesterday, and this was probably caused by the same bug.

            P.S. Will we see anyone fron the Linux community falling on sword?

    3. pete23
      Holmes

      Re: Its a kernel bug

      Heh. I have one where a combination of DTRACE and the jvm results in spiralling startup times for shortlived java process on Solaris 10u7. Any suitably complex systems can interact in unfathomably chaotic ways...

  3. Anonymous Coward
    Anonymous Coward

    Indian's again?

    I remember back in the day Amadeus was one of my first experiences of outsourcing to India.

    Speaking to there support team was great fun... "Oh what's that? Internet Explorer isn't working with Amadeus?... Now sir, turn off all security in Internet Explorer this will make it work"...

    Ok, that was for the end user app.. But I guess the developers are the same.

    1. Anonymous Coward
      Anonymous Coward

      Re: Indian's again?

      I have met some Indians with very good English though.

      They know that Indian's again should be spelt Indians again and there support should be their support.

      1. Anonymous Coward
        Anonymous Coward

        Re: Indian's again?

        I think you just got trolled.

  4. Frederic Bloggs
    Linux

    Linux bug? Really?

    We have linux boxes of various ages (some more than 5 years old) in some fairly critical places running messaging software with NTP direct from gps hardware. They all simply carried on working perfectly happily. Mind you all the apps are written in boringly old fashioned C.

    Java? I've heard of it.

    1. BobaFett

      Re: Linux bug? Really?

      Has anyone figured out why the Java VM running on Linux is so adversely affected?

      1. Oninoshiko
        Boffin

        Re: Linux bug? Really?

        From what I've read elsewhere the problem has to do with how it handles threading. Java was apperently not the only thing affected, just one of the most common things. I've seen reports of moderate to heavily loaded mysql instnaces being affected also.

        Java just happens to be widely deployed, and very likely to create the proper situation for this bug to manifest, but java is not doing anything which is outside of what the system says should work (ergo, it's not a java bug). furthermore this problem did not manifest on non-linux systems running java.

    2. This post has been deleted by its author

  5. Anonymous Coward
    Anonymous Coward

    Paging Bob Vistakin

    We now need 4 months worth of shit jokes about Linux being a massive turd because some systems crashed.

    Sauce for the goose....

    1. Number6

      Re: Paging Bob Vistakin

      I have vague recollection of an OS/2 time bug that crashed ATMs across the world at the same time. If you're going to screw up, you might as well do it in a high-profile manner and make it worthwhile.

      Anyway, highly-polished turd, if you please.

    2. Anonymous Coward
      Anonymous Coward

      Re: Paging Bob Vistakin

      I also notice that we've not heard from eulampios... When MS had their certificate/Leap year problem they were really vocal, maybe they're both on holiday...

  6. Anonymous Coward
    Anonymous Coward

    Phew...

    ...looks like I somehow engineered a miraculous escape. Three Linux-based systems at my house (nettop, netbook and a Raspberry Pi), and I didn't find a single one of them on its back on Sunday morning, having joined the choir invisible.

    Maybe I should've lent my Pi to these folks...

    1. Anonymous Coward
      Anonymous Coward

      Re: Phew...

      Well, clearly you are running skanky hardware like me, nothing too fancy, so you'll be fine ;)

    2. Peter Gathercole Silver badge

      Re: Phew...

      According to various write-ups, you needed a multi-CPU system for the problem to show itself.

      1. Greg J Preece

        Re: Phew...

        Like that multi-CPU desktop I've got at home. Dreading booting it up tonight now.

        1. GreyWolf
          Go

          Re: Phew...

          Fear not - the bug only manifested if your machine was (a) running at midnight (b) using more than one core AND had a really old kernel.

      2. Vic

        Re: Phew...

        > you needed a multi-CPU system for the problem to show itself.

        I'm running quite a lot of those, across multiple sites. I didn't have any machines lock up.

        Of course, those machines run very little Java. That's probably unrelated.

        Vic.

  7. Steve Hosgood
    WTF?

    Leap seconds: not a one-off unique event

    How did this ever happen anyway? There's a leap second about once every 18 months IIRC, so it's not like Y2K where no-one had ever experienced the event before it happened!

    1. Brewster's Angle Grinder Silver badge
      Coat

      Re: Leap seconds: not a one-off unique event

      The last one was 31st December 2008.

      (Amusingly, my copy of Bulletin B arrived while I was reading this.)

    2. Anonymous Coward
      Anonymous Coward

      Re: Leap seconds: not a one-off unique event

      There hadn't been one since 2009, which is longer than some of those big-name websites have been running on their current infrastructure.

      I support a major industrial / scientific application and have been shitting myself for the last couple of weeks, when I found out accidentally about the leap second. I really should read those NANUs more often :-P

      It also hit a major GPS vendor in a similar way as us: by the time they found there was a problem in their firmware, there was not enough time to fix it and push out the patches, so they had to email users with work-around instructions instead. Our problem was slightly different in that we didn't realise about the leap second until two weeks ago (they are published at least six months in advance), so we didn't have time to test--in the end, none of our users have reported anything amiss that I'm aware of, so all's well.

      The point here is that, while not unique, it's infrequent enough that I lot of us don't give it the attention it deserves, particularly since high-precision timing has a lot more users nowadays than it had only a few years ago.

      Be interesting to see what will happen when/if the first negative leap second is introduced.

      1. Anonymous Coward
        Anonymous Coward

        Re: Leap seconds: not a one-off unique event

        Incidentally, that Google blog post mentioned in the article makes for an interesting read. A proper engineering team, their 7Ps figured out, and learned from previous mistakes.

        URL: http://googleblog.blogspot.co.uk/2011/09/time-technology-and-leaping-seconds.html

    3. Sureo
      Coat

      Re: Leap seconds: not a one-off unique event

      Why not add 2 seconds so we can skip the next one?

      1. Boris the Cockroach Silver badge
        Thumb Down

        Re: Leap seconds: not a one-off unique event

        Easy... because the next big mega thrust earthquake will speed up the earth's rotation slightly meaning you'll be 3 seconds out

        Or the difference between hitting the atmosphere of Mars at the right angle or missing completely

    4. Paul Crawford Silver badge
      Linux

      Re: Leap seconds: not a one-off unique event

      I don't know why they hung. All of our Linux machines ran quite happily (as they have done fore years before including this event) using NTP and Trimble GPS for precise time-keeping.

      It is not like the folk behind NTP don't know about this, it has been supported and documented for a long time:

      http://www.eecis.udel.edu/~mills/leap.html

      Sounds like something that was not tested during the last leap second event, but still, I fail to understand why it would take the system down for more than 1 second?

      1. Anonymous Coward
        Anonymous Coward

        Re: Leap seconds: not a one-off unique event

        Looks like almost all Linux servers/desktops/embedded systems were NOT affected by this. Shows the strength of a heterogeneous system

  8. Destroy All Monsters Silver badge
    Mushroom

    Yeah...

    Sod this. Red Hat 6 servers going haywire *simultaneously* from 02:00 CEST (Red Hat 5 holding though), with load average at around 150 or so. JBoss was gummed up, restart was not helping, everything dying a crawling slow death, everything waiting for some futex. The only thing I could think of was that mcelogd is triggered from cron at that time. Bad lead.

    That was not a good night.

    Who injects leap seconds on weekends, at night, during the Euro football??!?

    1. Steve Knox
      Meh

      Who injects leap seconds on weekends, at night, during the Euro football??!?

      People who realize that doing it during the week is likely to cause even more business disruption.

      As for your other two time references , it's always "at night" somewhere, and there's pretty much always a popular sporting event going on. So those references are moot.

      1. Destroy All Monsters Silver badge
        Facepalm

        Re: Who injects leap seconds on weekends, at night, during the Euro football??!?

        > People who realize that doing it during the week is likely to cause even more business disruption.

        Really I would like to see those people. Probably managers. Or freshmen.

        Definitely not people who have to pay for the sysops to be called in pronto.

        1. Anonymous Coward
          Anonymous Coward

          Re: Who injects leap seconds on weekends, at night, during the Euro football??!?

          Conversely, who schedules a football championship or a weekend, during a leap second event?

          The time of the day issue can be solved by temporarily moving to a different time zone.

          Btw, am I the only sad git who looked for this line in his /var/log/messages?

          Jun 30 23:59:59 refaim kernel: [215399.847017] Clock: inserting leap second 23:59:60 UTC

        2. Steve Knox
          Thumb Down

          Re: Who injects leap seconds on weekends, at night, during the Euro football??!?

          Sysops aren't users or customers. Inconvenience is why they're paid. If you don't want to work weekend or late hours, IT support is not the field for you.

    2. Number6

      Re: Yeah...

      OK, maybe it did bite me. I woke up on Sunday morning to find my main machine with all four cores at 100% and hitting 80C and tripping alarms. The main machine runs Mint 12 but it's running a VirtualBox instance of Centos 6. I wonder if the time bug caused it? I had to reboot the whole shebang, just restarting the VM didn't fix it.

    3. Velv
      Headmaster

      Re: Yeah...

      "Who injects leap seconds on weekends, at night, during the Euro football??!?"

      Boffins who wear socks and sandals, have no concept of day, night or weekend (they work when it interests them), and who avoided the Euro football by more than 14400 seconds so you should count yourself lucky.

    4. J.G.Harston Silver badge
      FAIL

      Re: Yeah...

      "Who injects leap seconds on weekends, at night, during the Euro football??!?"

      By definition, leap seconds are always added (or removed) at the end of the last day of December or the end of the last day of June.

  9. Khaptain Silver badge
    Joke

    One second

    When Linus was presenting the finger to nVidia, he was not being rude. He was just showing them the 1 second that he forget to account for in the kernel ........

  10. Anonymous Coward
    Anonymous Coward

    My Windows servers never missed a beat :-P

    Oh come on, the Penguin fanbois never usually miss a chance to poke fun at MS stuffware! :-)

    1. Greg J Preece
      Trollface

      Re: My Windows servers never missed a beat :-P

      Were they too busy remote-installing Skype across your network? :-p

      1. Anonymous Coward
        Anonymous Coward

        Re: My Windows servers never missed a beat :-P

        No, you see, I didn't leave the out of the box settings without changing them.

        Also, I've had updates from RedHat knacker bits of my systems, I can't think of any software company who is perfect in this record.

  11. wheelybird

    Not just java...

    MySQL was also hit by this bug. Specifically MySQL on Debian based servers (and Slackware).

    Here's a fix (which would have been handy to know yesterday);

    /etc/init.d/ntp stop;

    date;

    date `date +"%m%d%H%M%C%y.%S"`;

    date;

    /etc/init.d/ntp start

    This solves the problem.

    1. Anonymous Coward
      Anonymous Coward

      Re: Not just java...

      What problem?

      My Linux server running MySQL sailed through the weekend with no problems at all..

      1. JEDIDIAH
        Linux

        Re: Not just java...

        Yeah. What problem?

        My two multi-core machines that run mysql were not effected.

        Also, none of the machines at the office were impacted.

        If not for these hysterics in the press it would never have occurred to me that leap seconds are anything to worry about at all.

    2. Tom 7

      Re: Not just java && Kernel bug?

      OK I'll change that to:

      Does it affect only machines running Oracle software...

    3. Anthony Cartmell
      WTF?

      Re: Not just java...

      Apache Solr on one of my servers appears to have been affected by this, resulting in two threads running for a long period with 50% CPU each. Kernel 2.6.35.7

      Nagios monitoring shows the problem started an hour or so after midnight on 30 June.

      Doing the above magic incantation at the command line appears to have fixed the problem :)

  12. Tom 38
    Devil

    <<-- Smug twat

    All our FreeBSD servers were unaffected, apart from the clock getting corrected from NTP.

    Now, if this were only true all the time. Linux 3453453 - FreeBSD 1 (but we're catching)

    1. Anonymous Coward
      Anonymous Coward

      Re: <<-- Smug twat

      Go on then. This is something that I never hear: Why (Free)BSD over Linux (or anything else)?

      I honestly have never looked into it.

      1. Tom 38
        Devil

        Re: <<-- Smug twat

        You cannot compare Linux and BSD. One is a disparate set of software packages cobbled together by a distributor, and the other is an operating system lovingly crafted since 1977.

        But in the real world, the main differences are:

        (Free|Open|Net)BSD are all complete OS, rather than a set of base packages.

        Less hardware support for BSD.

        BSD is fully documented, Linux, not so much.

        BSD is not tainted by FSF dogma.

        No/little GPL code - ever decreasing amounts.

        ZFS support (see 'No FSF dogma above').

        DTrace support (ditto).

        Linux tends to have better package management tools*.

        BSD has jails, which are like VMs, but without the overhead.

        The biggest plus TBH is that we've been using it for so long we know where everything is. I don't doubt we could use Linux just as effectively, but we would have to learn it all again.

        * One plus for BSD in package management is that it doesn't do that brain dead linux tradition of splitting packages up into 'libfoo' and 'libfoo-dev' - what kind of fucked up brain thinks not installing the header files for a library is a good idea - y'know, the stuff that actually allows you to use the API presented by the library.

  13. Anonymous Coward
    Anonymous Coward

    MS?

    Good job there's no bugs in Microsoft Operating systems or the world would go haywire!

  14. Ben Liddicott
    Pint

    For interest, windows does the opposite to google

    Windows ignores leap second indicator, and treats the updated time after the event as clock skew.

    This means in practice it adjusts gradually after the leap second.

    People: Do one of those things. Your junior devs will *never* be good enough to cope with leap seconds in their time calculations. Push the problem down the stack.

  15. Anonymous Coward
    Anonymous Coward

    Well...

    It's a good job that a whole load of Linux types didn't round on a certain Redmondian company when their cloud service admin systems were brought down by a leap year related certificate error...

    Just saying...

  16. trollied

    Killed my Ubuntu desktop too

    Got to work this morning and the load average on my Ubuntu desktop was ~30.

    A fair few of our servers went kapput. Thankfully I wasn't on call! ;)

  17. Anonymous Coward
    Anonymous Coward

    Major ISP running Huawei at the Edge

    Email goes round, "expecting some Huawei boxes to crash...."

    Now I know they run Lunix.

    1. Anonymous Coward
      Anonymous Coward

      Re: Major ISP running Huawei at the Edge

      That would be TalkTalk, yes?

  18. Bela Lubkin
    Black Helicopters

    timetard(is)

    This is obviously an outrageous attack on OSS by Microsoft! They deliberately slowed the rotation of the Earth in order to insert a leap second, thus sabotaging dozens of services relying on Linux.

  19. Mike G
    Trollface

    I though open source was supposed to superior because it means everybody could see and fix bugs, there's no way bugs could go unfixed for years like in bad proprietary software?

    1. Paul Crawford Silver badge
      Facepalm

      Sadly I am troll-feeding but its worth noting this bug was introduced around the time of the last leap-second.

      Real problem is lack of testing of such rare events. Windows is no better really, it ignores the leap second so clock is simply wrong for a while which can be a problem for transaction systems.

  20. Paul Crawford Silver badge
    Boffin

    How to test?

    Why can't someone set up NTP server called skippy or something, that every two days it adds an anomalous +1 leap second and then two days later has a matching -1 leap second?

    That way you could set up a test machine that is more-or-less on time, but makes sure that your kernel and any Java, etc, updates are all happy with the concept of a clock shift.

This topic is closed for new posts.

Other stories you might like