back to article OVH data centres go TITSUP: Power supply blunders blamed

Power outages have brought some OVH data centres to their knees, and unspecified issues have broken optical cable routing in Europe. OVH boasts it is used by "155 out of the 1000 largest European companies," and "20 out of the 500 largest international companies." According to the outage monitoring service downdetector.com, …

  1. wolfetone

    Yet again the idea of having all your eggs in one basket comes to mind.

    Sure, it's bad they've been down for 3 hours. But why haven't the site owners affected thought about contingencies for such an event? The company I work for have it's sites primarily on Linode, but I know what happened with them before, so the sites are mirrored on another provider separate to Linode. If they go down, we can switch our sites quickly. I haven't had to yet, but at least if something goes wrong I've protected my company's assets.

    Question yourself as to how much you value your website, not how a company can remain powerless for 3 hours.

    1. Anonymous Coward
      Anonymous Coward

      Haven't had to yet?

      So have you actually done it to ensure it works?

      1. wolfetone

        I have tried it a few times to make sure it works. And it does work.

        I've never had to use it in anger though.

        1. Anonymous Coward
          Anonymous Coward

          Top marks, so often I hear about contingency plans and DR which nobody has ever bothered to dry run!

    2. btrower

      I have sites hosted at OVH and they are up right now. However, I have had trouble with them in the past.

      I have had sites hosted on more than half a dozen different hosts and every one has eventually failed me -- more than one of them catastrophically.

      My stuff is not mission critical, but if it was, the only way I could feel comfortable they would stay up would be to host on three entirely different unrelated hosts with an elaborate fail-over mechanism.

      You would think in 2017 you would not have endless failures in basic services.

      1. Lee D Silver badge

        That is indeed the only way to keep services up.

        I know in my workplace, we've taken to having most stuff in-house because of the sheer number of issues with external suppliers, and then rent things like OVH dedicated servers to mirror to should the worst happen.

        There is no one company I would trust - not even Google or Microsoft - to keep a business operational nowadays. You have to be able to be independent and able to continue when, say, AWS or Google Docs goes down, even if it's not ideal, because even the big guys can't guarantee anything.

        The weakest point is actually things like DNS. You only need one bad DNS host and you can be in for a world of hurt even with all the failovers and contingency plans in the world.

        But, it has to be said, like "backups", redundant services need to be redundant. If you are hosting with OVH and it means anything at all to your business (e.g. 1000 audio streams for a radio station!), then you need a way to spread those across SOME OTHER ENTIRELY UNRELATED COMPANY. Whether that's your own in-house site, another host, etc. Even "same company, data centre B" is a silly thing to do, because when their A is down, their B site is going to take an enormous hit too even if that's just people failing-over.

        You have to have multiple, redundant product offerings from different people hosted in different places access by different line, or you're really just wasting your time or "playing" at doing the IT properly.

        If I ran a business reliant on an online presence, you can be sure that DNS round-robin, distributed filesystems and databases, and redundant machines would be the first thing I'd set up. Hell, surely you have to do this for basic backup systems, no? No point having your backups all in the same building. So why wouldn't you do it for your live systems too if you're expecting at least one live system to stay up.

  2. smudge Silver badge
    Holmes

    "Small Earthquake In Peru - Not Many Injured" *

    OVH boasts it is used by "155 out of the 1000 largest European companies" and "20 out of the 500 largest international companies".

    15.5% and 4%. When companies mention these things, they usually have much higher figures.

    * the classic of its type, from 1998 - http://news.bbc.co.uk/1/hi/world/americas/73785.stm

    1. LDS Silver badge

      Re: "Small Earthquake In Peru - Not Many Injured" *

      They could boast about how many spammers they host, especially French ones, but I guess they do only on specialized forums...

      1. andy bird

        Re: "Small Earthquake In Peru - Not Many Injured" *

        I wonder if there was a global drop in spam attacks and bot net attacks. I am sick of seeing compromised OVH servers trying to spam out network

    2. Anonymous Coward
      Anonymous Coward

      Re: "Small Earthquake In Peru - Not Many Injured" *

      "OVH is the world's third largest internet hosting company with 260,000 servers in 20 data centres in 17 countries hosting some 18 million web applications."

      https://www.theregister.co.uk/2017/07/13/watercooling_leak_killed_vnx_array/

    3. Anonymous IV

      Re: "Small Earthquake In Peru - Not Many Injured" *

      The BBC headline you quote http://news.bbc.co.uk/1/hi/world/americas/73785.stm is a slight and much later rework of the supposed Times headline written by Claud Cockburn (1904-1981) "Small earthquake in Chile, not many dead" thereby winning their "dullest headline" competition.

      He is also known for the truism: "Never believe anything until it has been officially denied."

  3. Anonymous Coward
    Anonymous Coward

    DRP 1 v OVH 0

    How many cloud providers will suffer this way?

  4. anthonyhegedus Silver badge

    il est down?! Excellent!

  5. A K Stiles Silver badge
    Joke

    Phew!

    Well, at least Commitstrip.com is back up now!

    https://twitter.com/CommitStrip/status/928546536472088576

  6. RealBigAl

    155 out of 1000 is a bit a weird boast. 845 of the 1000 largest European companies don't use our services...

  7. Tim Brown 1
    Facepalm

    Back up now but...

    The CEO has just tweeted his explanation of the incident

    https://twitter.com/olesovhcom/status/928592231807713280

    Claims it was two unconnected major incidents, one power failure and one optical fibre control f*ckup. Though how he expects anyone to believe that the first didn't lead to the second I've no idea.

    And as for their 'failover' system, I tried to use it to move an IP address from the datacentre with the power failure to an unaffected one, but the move task just got stuck in their API, The move task is still there, with no way i can see to delete it, so now I may have problems at some unspecified time in the future if the move does eventually happen when I don't want it to.

  8. Anonymous Coward
    Anonymous Coward

    The cloud

    Lol.

    1. Anonymous Coward
      Anonymous Coward

      Re: The cloud

      "Just someone else's computer."

    2. Roj Blake Silver badge

      Re: The cloud

      You do know that in-house IT systems fail from time to time as well, right?

      1. Fatman Silver badge

        Re: The cloud

        <quote>You do know that in-house IT systems fail from time to time as well, right?</quote>

        yes, we do.

        But, and this is a huge but, you know exactly who to scream at, and who needs to get 'kicked in the nuts'.

        If manglement does not provide the funding to achieve business continuity in the name of increasing the executive bonus pool; then some manglement asses need to be kicked to the curb when foreseeable events occur.

      2. LDS Silver badge

        "You do know that in-house IT systems fail from time to time as well, right?"

        Yes. but their effect is confined to a few systems...

      3. Anonymous South African Coward Silver badge

        Re: The cloud

        I have said it somewhere else, and will repeat it here.

        An in-house team gives you the ability to select the people you want to work with, and if you treat them well, they will be instantly available to attend to your IT snafu. It may cost you a bit more in the long run.

        An outsourced team, well... you don't have a say in who you want on that team, and when you have a meltdown, the outsourced team may take a while to respond to your problem. Over the weekend will be more problematic. Even more problematic will be if the outsourced team is already responding to somebody else's brown trouser incident. Remember, outsourced = shared.

        In-house IT do fail from time to time, due to unforeseen circumstances and Mr Murphy. The key is how fast you will be able to respond to downtime, and fix the problem before your customers will start cussing at you.

  9. Anonymous Coward
    Anonymous Coward

    Good news for my weekly report

    "Avoided live service failure by fortuitously failing to plan my migration to OVH properly last week"

    1. Anonymous South African Coward Silver badge

      Re: Good news for my weekly report

      Procrastination saved the day there :)

  10. Anonymous Coward
    Anonymous Coward

    A few milliseconds away from you ! *

    *Courtesy of the OVH website.

  11. Anonymous Coward
    Anonymous Coward

    Ah, that explains it..

    I did notice a sudden and substantial drop in spam and attempts to breach our websites. That explains at least the fewer script kiddie attacks, now I have to find out what lowered the spam count and maybe encourage that to happen more often :).

    As far as I'm concerned, OVH is welcome to its outage. The longer the better.

    1. Kevin McMurtrie Silver badge

      Re: Ah, that explains it..

      I've already blacklisted OVH so I got a flood of spam yesterday when spammers moved elsewhere.

      1. Anonymous Coward
        Anonymous Coward

        Re: Ah, that explains it..

        If you have a list of all their IP ranges and AS numbers I would love to see it, because I would do that too..

    2. LDS Silver badge

      They are online again...

      "Maisons du Monde" started to spam me again....

  12. AceQuery

    Translated Statement from the CEO

    http://travaux.ovh.net/?do=details&id=28256

    Hello,

    Before all the details, two first informations.

    This morning we had 2 separate incidents that have nothing to do with each other. The first incident affects our Strasbourg site (SBG) and the 2nd Roubaix (RBX). On SBG we have 3 datacentres in operation and 1 under construction. On RBX, we have 7 datacentres in operation.

    SBG:

    On SBG we had an electrical problem. Power has been restored and services are restarting. Some customers are UP and others not yet.

    If your service is not yet UP, the recovery time is between 5 minutes and 3-4 hours. Our monitoring system allows us to know which customer is still impacted and we are working to fix them.

    RBX:

    We had a problem on the optical network that allows RBX to be connected with the interconnection points we have in Paris, Frankfurt, Amsterdam, London, Brussels. The origin of the problem is a software bug on the optical equipment which caused the loss of the configuration and the cut of the connection with our site of RBX. We handed over the backup of the software configuration as soon as we diagnosed the source of the problem and the DC is reachable again. The incident on RBX is closed. With the manufacturer, we are looking for the origin of the software bug and also how to avoid this kind of critical incident.

    We are in the process of retrieving the details to provide you with information on the SBG recovery time of all services / customers. Also, we will give all the technical details on the origin of these 2 incidents.

    We are sincerely sorry. We have just experienced 2 simultaneous and independent events that impacted all RBX customers between 8:15 am and 10:37 am and all SBG customers between 7:15 am and 11:15 am. We continue to work on clients who are not yet UP at SBG.

    Regards

    Octave

  13. I Am Spartacus
    Coat

    Thats what happens when kiddies run data centres

    It's some years since I had to run or design a data centre. But I do recall a number of key points:

    3m Separation: That is the distance that two independent network links had to be apart from each other at all times. Also, making sure that they both didn't go to the same alternative end point. It's called triangularisation, and it ensure that one remote DC doesn't had a shit fit and take out an adjacent DC by trashing its comms gateway. Then you have layers of redundant comms, partially as firewalls, but also to ensure there was separation and segregation of comms protocols;

    Power. Redundant power feeds, battery standby, M-G sets. OH and test them. We did find on one test that a supplier had put the wrong type of diesel in the tank, and it had waxed.

    This was in the days of mainframes. Nowdays we have reduced the mainframe to a couple of oversized PC's and we call it a data centre. Is it only us old farts to can recall building real data centres? Ones where it would take a well placed tactical nuke to take out? (Actually, I could talk about the underground one we built, but I think I am still covered by certain legal aspects, so I won't).

    The key word is multiple redundancy. It's obvious when you say it, but bloody difficult to do well - and it's not cheap.

    Mines the one with the book on painful data centre memories in the pocket.

    1. Anonymous Coward
      Anonymous Coward

      Re: Thats what happens when kiddies run data centres

      I'm glad to see there are still people here from the era where stuff had to work. If we had an outage it would be publicity for weeks, questions in government and very uncomfortable interviews with key staff, now sites drop for a few hours and it's "meh, sorry, gadget x failed and we'll make sure THAT never fails again" (read: but we're unsure about other things) in the press and everyone just accepts it.

      Call me old fashioned, but when I say "we implemented redundancy", I not just mean what I say, but it's backed up by real life tests (like backups are). That said, we worked for people who would not accept any less, and were willing to pay for it. In a way, that may be the key issue: everyone wants the Rolls Royce service for Citroen 2CV money and is actively avoiding the question how the low price was achieved.

  14. DJV Silver badge

    "trying to restart generators"

    So, they are properly tested regularly in the event of such "mishaps", then?

    1. I Am Spartacus

      Re: "trying to restart generators"

      Yes. It was on a test that we found that problem.

      DR was tested every six weeks: One time one of the alternate centres would come to us, or we would go the one of the alternates. We were largest, so we tested more often.

      Once we did a full test: A suitably authorised person went in to a DC, locked the doors on the gate behind him, and announced to the DC supervisor: "There has been a fire. You are dead. Everyone in the DC is dead. All tapes, documents and procedures are gone. The phones are dead."

      ... And they waited until it was noticed, and the world woke up to having to recover. This was in the days of 2400' tapes as the backup and before fibre. An interesting time: Everything came back in 32 hours, against the stated target of 48.

      Apart from one application (a crucial one) which was not backed up. Despite the PM's written statement that it had been tested for DR, it wasn't. He and his team were locked in a room to come up with a plan. When they came back, the board examined the plan, considered it viable, then told the PM it had all been an exercise. PM was mad that he had lost a weekend over this, but he was even madder when he was fired for lying about having tested DR when clearly he hadn't.

      No-one EVER forgot to actually test their apps on the regular DR failover tests after that!

      1. happy but not clappy

        Re: "trying to restart generators"

        You fired the Prime Minister? Now that's power right there. Not so easy these days.

      2. Wayland Bronze badge

        Re: "trying to restart generators"

        "Once we did a full test: A suitably authorised person went in to a DC, locked the doors on the gate behind him, and announced to the DC supervisor: "There has been a fire. You are dead. Everyone in the DC is dead. All tapes, documents and procedures are gone. The phones are dead.""

        Sounds like Dr Strangelove.

    2. Tim Brown 1

      Re: "trying to restart generators"

      There was an article a while back (I think it might have been one of the on-call ones or maybe during the big Ba snafu recently) about the perils of testing your disaster recovery systems in live environments. Who wants to be the one who admits to crippling a datacentre because their failover testing failed!

      I had my own mini problem with this outage, since my recovery plan relied on OVH not losing ALL connectivity throughout Europe...

  15. PGregg

    15:37 - still down

    My machine is still down at 15:38 UK time.

  16. gwharton.zimma

    Still Down 8 Hours later

    Strasbourg Data Centre still down for a lot of customers 8 hours after initial outage.

    Their visual status monitor is a mess.

    http://status.ovh.net/vms/index_sbg1.html

    Kinda glad I don't work there as it looks like they are having to manually bring the servers back on line one by one (judging on how quickly they are coming back online)

    1. Tim Brown 1

      Re: Still Down 8 Hours later

      At least the status monitor is showing presumably valid info now. When I looked at it shortly after power to the datacentre had been restored it was showing all machines up when they clearly weren't

    2. Tim Brown 1

      Re: Still Down 8 Hours later

      The CEO has just tweeted an update (Thursday, 09 November 2017, 17:53PM), lots of issues still

      Il reste encore en panne: (the following are still not working)

      2100 serveurs dédié

      1500 instances PCI

      25000 VPS

      300 hosts PCC

  17. dnicholas Bronze badge

    Config bug after power outage... Hmmm....

    "The origin of the problem is a software bug on the optical equipment which caused the loss of the configuration"

    So someone forgot the write memory command?

  18. Cliff Stanford

    Unbelievable

    More than twelve hours after it went down and the only update is:

    Il reste encore en panne:

    - 2100 serveurs dédié

    - 1500 instances PCI

    - 25000 VPS

    - 300 hosts PCC

    OVH really shouldn't be able to survive this! Or at least they don't deserve to.

  19. Anonymous Coward
    Anonymous Coward

    I chose ovh because...

    I chose ovh because I felt it was EXACTLY the right supplier for testing a DR plan.

  20. Victor Ludorum

    Update

    Octave has just posted an update...

    status.ovh.com/?do=details&id=15162#comment18119

    Basically, SBG1 was a container datacentre that outgrew itself, and they built SBG2 using the same power supply...

    1. Anonymous South African Coward Silver badge

      Re: Update

      Basically, SBG1 was a container datacentre that outgrew itself, and they built SBG2 using the same power supply...

      d'oh

    2. I Am Spartacus
      Thumb Up

      Re: Update

      Honesty is always the best policy.

  21. Anonymous Coward
    Anonymous Coward

    "4) closing SBG1/SBG4 and the uninstallation of the shipping containers."

    I guess I need to move my VPS to a different location within OVH's network, wonder if they'll let me keep my ip addresses?...

  22. punk4evr

    Thats what happens when...

    WHen you overhype the redundancies you have, and the capacity you are really capable of handling. I have worded in a couple data centers, and I can tell you the hype men selling them are way overselling their bill of deliverables! IN the centers I was in, its barly controlled chaos. They only just enough staff to keep the status Quo. If there is an emergency, they are screwed. Thats their whole plan. Pay tme minimum for the fewest amount of people you can get away with to maximize profitability and then when disaster strikes! Ha! Good luck.. .watch them race to cash out, before the mayhem! Places i worked for, Had no way to handle critical failures and the staff was just barely keeping things running as it was! Thats what they do!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2020