back to article Oracle tripped up by 'leap second'

Database giant Oracle has issued fixes after its Cluster Ready Services (CRS) software failed to cope with the so-called “leap second” added by scientists at the end of 2008. The Earth Orientation Centre is responsible for calculating when a leap second should be added or subtracted because the Earth doesn’t always orbit …

COMMENTS

This topic is closed for new posts.
  1. Tom

    The Earth always orbits perfectly according to the laws of physics

    It's the maths of the boffins keeping the calendar that aren't perfect. Something about the equations not being solvable for more than two bodies and there being considerably more than that in the solar system as I recall.

  2. John Savard

    The Good Old Days

    Back before they started having leap seconds, all this was handled by having the clocks just run a bit slower, and nobody, but a few scientists working in specialized areas, was affected by this.

    Maybe going back is not the answer, because the Internet is so precisely timed that it is necessary (or, with advancing technology, will soon become necessary) to synchronize to such a high accuracy that even frequent leap 50 millisecond increments would be noticed. Waiting longer, and having leap hours, is not really an option.

    But actually making every second longer - not running our clocks, which tell civil time, by Ephemeris Time, letting their seconds tick away a bit slower than SI seconds would - would seem very unobtrusive, if managed right. Quartz clocks would just adjust themselves to an external time standard which would differ less from the old one than their own inherent limitations, and one which only had the seconds in it you would expect to see.

  3. dervheid
    Happy

    "Cluster Ready Services"?

    Could this possibly be the new name for "Cluster F**K Services"?

    Nah, surely not.

  4. gabor
    Thumb Down

    Oracle vs Zune

    Not sure if they play in the same championship. Zune pisses off a few music listeners that spent some hundred quid on that player.

    Oracle - well.

  5. Anonymous Coward
    Anonymous Coward

    Why?

    I've always wondered why a computer would care about an external time source to such an extent. Even if it's wrong, it still has it's own internal clock I thought. Can somebody explain why a system would reboot or freeze up?

  6. Anthony Wood

    Leap second is due to slowing of Earth's spin, not changes in it's orbit

    "the Earth doesn’t always orbit perfectly". Is that so? It does wobble about a bit but I'm sure the Earth would claim that it always orbits exactly as it should.

    I thought the reason why we had to add the extra second is because the Earth's rotation is slowing (i.e. the days are getting longer), rather than anything to do with the orbit of the Earth around the Sun. We add the extra second to avoid us gradually getting towards the situation where the Sun rises at midday.

  7. Simon Neill

    @AC

    "Can somebody explain why a system would reboot or freeze up?"

    I imagine the contradtiction of the time going 57 58 59 60 00 rather than 58 59 00 would cause a reboot, unhandled exception or whatever you want to call it.

    Odd how the y2k bug that everyone paniced about caused no real issues, but a leap second brings oracle to its knees.

  8. Anthony Wood

    Leap second is due to slowing of Earth's spin, not changes in it's orbit

    I thought the reason why we had to add the extra second is because the Earth's rotation is slowing (i.e. the days are getting longer), rather than anything to do with the orbit of the Earth around the Sun. We add the extra second to avoid us gradually getting towards the situation where the Sun rises at midday.

    Incidentally: "the Earth doesn’t always orbit perfectly". Is that so? It does wobble about a bit but I'm sure the Earth would claim that it always orbits exactly as it should.

  9. Anonymous Coward
    Anonymous Coward

    Re: The Good Old Days

    > "Waiting longer, and having leap hours, is not really an option."

    Sounds like a good option to me, since the crashes would only happen once every few tens of millennia (that's a lot of nines!)

  10. Anonymous Coward
    Anonymous Coward

    @Simon 15:15

    "Odd how the y2k bug that everyone paniced about caused no real issues, but a leap second brings oracle to its knees."

    Yes, it is odd isn't it? One would almost think that hundreds of millions of hours had been spent prior to y2k making sure that the problem wouldn't materialise...

  11. Phil Endecott

    My Linux server crashed

    My Linux web server went down at the stroke of midnight because of this (or a related) bug. (I believe this Oracle system is Linux-based).

    As I understand it, the NTP daemon calls into the kernel to tell it to do the leap second thing. The kernel code that this triggers does something that can cause a crash with some probability low enough not to be detected during testing; I think perhaps it called printk() to output a message, but did this with a lock held which printk() itself needs to take, leading to a deadlock.

    Moral of the story? I think that the particular variant that I hit had actually been fixed a while ago, so I could have avoided it by updating. On the other hand, new versions may introduce new problems. More fundamentally, infrequently-exercised code paths are bad: wouldn't it be better to do nothing special for the leap second, and let NTP apply a one second adjustment in the normal way over the next few hours? This would make your clock a second out for a while, but if your system is sensitive to that then I imagine you'll encounter other problems.

    It may also be that there are other bugs that cause more frequent crashes, but we don't notice them because they don't all occur simultaneously. If you have a cluster of machines with fail-over, you should try to avoid common-mode failure by diversification, especially anything related to time.

    More about my crash here; the links in comment #4 are useful:

    http://www.debian-administration.org/users/endecotp/weblog/6

  12. M. Burns Silver badge
    Boffin

    @Tom

    It's not the calculations. The major reason is that angular momentum conservation in the Earth-Moon system causes the Earth's rotation to slow down as tidal forces transfer angular momentum from the Earth's rotation to the moons orbit. In other words, the Earth's rotation slows and the Earth-Moon distance increases over time.

  13. David Harper

    Ask an astronomer

    @Anthony Wood:

    "I thought the reason why we had to add the extra second is because the Earth's rotation is slowing (i.e. the days are getting longer)"

    That's correct, up to a point. When the length of the second was standardised in the early 1970s (as part of the SI system of units) in terms of the oscillations of a caesium atom, the value was chosen to match the previous international time standard called Ephemeris Time, which was based on the orbital motion of the Moon and planets.

    Unfortunately, that timescale had been defined in such a way that 86,400 seconds were a very good approximation to the length of the day circa the the early 19th century. The Earth's rotation had slowed somewhat in the intervening 150 years, so a day in 1970 was several milliseconds longer than a day in 1820. Those milliseconds add up over the course of a year to give an excess of a whole second, and hence the need for a leap-second every so often.

    @John Savard:

    "But actually making every second longer ... would seem very unobtrusive, if managed right."

    High-precision timekeeping now pervades our lives to such an extent that it would be utterly impractical to adjust the length of the SI second.

    In any case, the Earth's rotation will continue to slow down, so even if we re-defined the SI second to match 1/86,400 of the current length of the day, we would only be making trouble for the future.

  14. Anonymous Coward
    Flame

    @Simon Neill

    "Odd how the y2k bug that everyone paniced about caused no real issues"

    Actually, it caused a bloody great shitload of issues. The difference is that we fixed them before they could cause a problem.

    I'm sick of ignorant people saying it was just a big IT con.

    Nothing happened because of an awful lot of hard work by people like me.

  15. Johnny Five
    Paris Hilton

    Earths rotation slows due to the moon.

    Earths rotation is slowing, but the slowing isn't consistent. Which is why they cannot predict when a leap second will happen several years in advance. The slowing of Earth rotation has to do with the moon, and the tidal forces it creates. Give Earth enough time, and her rotation has been synched to the rotation of the moon.

    BTW, does Oracles fix means that the next time there's a leap second at the end of 2008, servers won't reboot first thing in 2009?

  16. Peter Gathercole Silver badge
    Paris Hilton

    time and reference

    I'm sure that the reason why leap seconds are necessary is that the universe is only perfect if treated holisticly.

    I don't believe that any scientist or mathematician is arrogant enough to be able to claim that they can take into account all of the significant gravatational objects that could affect both the Earth's orbit or it's rotation. I'm fairly certain that a large asteroid strike, volcanic erruption or earthquake could affect the rotation of the earth, and the solar wind pressure may perturb the orbit enough to introduce a measurable variance in it's orbit. And that's not to mention Andromeda.

    So no, the Earth's orbit is not mathematically perfect.

    And anyway, who told God that the rotation and orbit of the Earth should follow some nice fixed relationship. It's all a co-incidence.

    Is Paris enough of a heavenly body to shake the world?

  17. Anonymous Coward
    Boffin

    "the contradiction ... causes a reboot" vs NTP ???

    Surely any system with a properly implemented NTP setup should have been able to cope properly with the extra second, because the NTP stuff is designed to automagically correct for differences between time on the NTP client (based on a local and not necessarily reliable ticker) and time provided by an external known-good server (such as another more definitive NTP server, or a GPS, or whatevever)? Any half decent network/system admin knows this, surely?

    So, what did Oracle systems/sysadmins get so wrong?

    Even my DSL router at home uses NTP, and it didn't get conf

    CARRIER LOST

  18. Anonymous Coward
    Anonymous Coward

    Stop it!

    Things should relay on the number of milliseconds since 1970 (or whatever), not the calendar date and time. Its just like floating point rounding: you keep the precision, but you round before you display to the user. Thus you can always map from ms since 1970 to the date, but you can't necessarily map from the date to ms since 1970. Base your programs on the former, and you will be OK.

  19. Steve

    Caused a reboot?

    Probably being very naive here, but why on earth would and application crashing on a server cause the OS to shit itself and reboot?

    Word crashed a few days ago and Windows seemed to live through it...?

    Or have Oracle taken the final step to raping your budget by developing their own Linux based OS for it's DB app?

  20. Phil Endecott

    Re: Caused a reboot?

    > why on earth would and application crashing on a server cause the OS to shit itself and reboot?

    The bug was in the kernel, not an application.

  21. Phil Endecott

    Re: Caused a reboot?

    > why on earth would and application crashing on a server cause the OS to shit itself and reboot?

    The bug was in the kernel, not an application.

    > Or have Oracle [... developed] their own Linux based OS for it's[sic] DB app?

    Yes, that's what they've done.

  22. pAnoNymous
    Happy

    so everyone running Oracale on Windows was fine?

    by the sounds of it everyone should have stuck to Windows in the first place :)

    not sure why they had to do it this way anyway - servers/pcs get out of sync all the time - that's why you have ntp. they should have just update the central time servers.

  23. Philip Teale

    CRS

    The bug was in CRS, which, as anyone who has a RAC infrastructure will know, is responsible for the control of the nodes in the cluster. If CRS detects that there is a problem with one of the cluster nodes, it will restart it to maintain cluster integrity. This is why a problem with CRS could cause a server reboot.

  24. Gunther Vermeir
    Boffin

    please get your facts right

    This CRS bug is solved in 10.2.0.4 and 11g - the latest version/patchset who is already some time out. Oracle Support subscribers can check "Note 759143.1 NTP leap second event causing CRS node reboot" on Metalink for more information.

  25. Bob

    The reason for the earth slowing

    Americans get fat and move south to Florida. Conservation of angular momentum means that the earth's rotation slows as a result.

  26. Chris
    Thumb Up

    @ Philip Teale

    Finally, someone who knows what they are talking about.

    I was having a right old laugh reading some of these.

This topic is closed for new posts.