back to article Microsoft's Azure cloud goes a bit wobbly in West Europe

Admins in Microsoft's Azure cloud data centre for West Europe in Amsterdam, Netherlands, have spent the morning battling severe problems in the gear that supports Redmond's main cloud service. Problems with the core Compute and Storage components were first reported at 9:39am UTC on Thursday, according to the Windows Azure …

COMMENTS

This topic is closed for new posts.
  1. hplasm
    Happy

    DaaS

    Downtime as a Service.

    Continuing Innovation from Microsoft.

  2. dogged

    In before

    the inevitable Bob Vistakin claim that nobody else ever has downtime.

  3. John 104

    Small Scale

    Still, an outage of much smaller scale than the last time around for MS and AWS..

  4. Anonymous Coward
    Anonymous Coward

    Get it at the Windows Shop

    The Azure YoYo.

  5. Proud Father
    FAIL

    Huh?

    I thought these wonderful cloud systems were supposed to be highly reliable?

    Like system redundancy, fail safe, multiple copies of data etc etc so if something fails it just keeps going with the remaining working resources?

    Isn't that the whole point of these cloud systems?

    1. Anonymous Coward
      Anonymous Coward

      Re: Huh?

      "I thought these wonderful cloud systems were supposed to be highly reliable?"

      Nope - circa 99.9% uptime is the quoted norm. Azure has historically been a bit more reliable than say Amazon S3 though.

      "Like system redundancy, fail safe, multiple copies of data etc etc so if something fails it just keeps going with the remaining working resources?"

      That's why Azure has multiple regions - so that you can create applications that are resilient to a local issue.

      1. Levente Szileszky

        Re: Huh?

        "Nope - circa 99.9% uptime is the quoted norm. Azure has historically been a bit more reliable than say Amazon S3 though."

        Ahahahaha, thanks for the afternoon laugh.

      2. Anonymous Coward
        Anonymous Coward

        Re: Huh?

        Nope - circa 99.9% uptime is the quoted norm.....

        And the powers that be want telephony in the cloud. HA!

        5 9's minimum or no go.

      3. Dan 55 Silver badge
        Trollface

        Re: Huh?

        "That's why Azure has multiple regions - so that you can create applications that are resilient to a local issue."

        Funny how the whole lot regularly dies on its arse though.

    2. Tom 38

      Re: Huh?

      I thought these wonderful cloud systems were supposed to be highly reliable?

      No, they are supposed to be cheaper in capex.

      Everyone's comments here are proof that it is possible to build a reliable service on top of an unreliable service, TCP being a reliable service that is implemented over IP, an unreliable service. The idea of clouds is that lower capex costs allow you to dynamically scale your loads, allowing you to provide a reliable service to your users that is built on commodity cloud servers that may be unreliable.

      Not seen one done right so far though, and if you are in business long enough, the benefit of lower capex is quickly extinguished by the massive increase in opex.

  6. Steve Knox
    Trollface

    It's your fault

    If you'd allow MS to store copies of all of your data here in the US, they could just redirect affected customers to a US mirror.

    1. Phil_Evans

      Re: It's your fault

      That's right Steve, then we could all have a single geo-lo and no data redundancy at all. I'm not going to fall for the old NSA/Orwellian barb there since there are so many good jokes to Make about msft before we even get to that :-)

    2. Anonymous Coward
      Anonymous Coward

      Re: It's your fault

      Azure has 2 regions in Europe. No need to send your data all the way to the colonies.

  7. BlueGreen

    I wonder if it's down to their love of complexity

    MS are addicted to complexity. If you use their site, it needs JScript. If you use hotmail it uses an utter ton of JS. If you use their products they tie in together to make a gordian knot that can't be cut. It's deliberate; put one foot into their garden and they try to tie you down forever. So, they are addicted to complexity. The downside is bugs and failure. I wonder if that's the ultimate root of their cloud problems, as well as their desktop flakiness[*].

    [*] Am learning SSAS 2008, just working through tutorials. Just basic stuff and I've managed to crash it outright once and have had over half a dozen internal errors (for which google has been much more helpful than MS) which has cost me hours. Utter crap.

    1. John P

      Re: I wonder if it's down to their love of complexity

      "If you use their site, it needs JScript. If you use hotmail it uses an utter ton of JS"

      As is the case with about 99% of sites on the internet, what's your point?

      1. BlueGreen

        Re: I wonder if it's down to their love of complexity @John P

        (sigh) It's not about JS per se. My point is that it is unnecessary. gmail works fine without it. But they use it by the shovel load for no good reason (UI prettiness isn't a good reason IMO). They're addicted to More, not Simpler, and if I'm right that attitude may have worked its way into their datacentres and is causing them problems. Clear enough?

        1. This post has been deleted by its author

          1. BlueGreen

            Re: I wonder if it's down to their love of complexity @John P @Pascal

            > What are you talking about, gmail.com is very, very heavily loaded up with js.

            disable your JS and try it. It works. That's how I use it. Disabling JS on hotmail just redirects you to a page telling you to enable it.

            >>>> BUT the point is not JS but complexity. I mentioned JS overuse as a proxy, not the main point. <<<<

  8. Dave 15

    Maybe the service is having trouble uploading the data to the nsa

    Why don't American corporations cut the middle man out and ask the NSA to provide the storage, that way at least there is only one copy of all the data not 2.

    1. Anonymous Coward
      Anonymous Coward

      Re: Maybe the service is having trouble uploading the data to the nsa

      The NSA mostly don't copy it - that would be very inefficient. They just index it...

  9. Destroy All Monsters Silver badge
    Terminator

    Well, it *is* "International Workers' Day"

    These servers just decided to perform some "labour action". ROTM?

    (In the US, this day is called either "Law Day", "Americanization Day" or "Loyalty Day". WTF?)

    1. Mephistro

      Re: Well, it *is* "International Workers' Day"

      In the US, this day is called either "Law Day", "Americanization Day" or "Loyalty Day". WTF?

      I think that could be one of the side effects of McCarthyism . Apparently, "Labour Day" sounds too commie for America.

  10. Apemantus
    Trollface

    Azure Ads

    As I read this article it is surrounded by Azure cloud adverts, the whole background, the top and a box in the middle.

    I want my storage on their advertising cloud, that'll never go down.

  11. Tom 38
    Stop

    It's good to know that The Register is following the highest standards of journalism possible, as practised by the BBC, viz that it is not news unless you can find two arbitrary people complaining about it on Twitter.

    Fuck yeah! Digital engagement!

    1. Destroy All Monsters Silver badge
      Go

      They also attack technological cripples in tech autism land that are differently abled. THAT'S UNFAIR!

  12. blondebier

    RCA

    We were affected by this outage in West Europe... The RCA report we received is as follows :

    Incident Title Storage and Compute in West Europe : Partial Service Interruption

    Service(s) Impacted Azure Compute (Service Management), IaaS, Azure Service Management, Storage, Azure Web Sites

    Incident Start Date and Time

    5/1/2014 2:39:00 AM (Pacific Time)

    Date and Time Service was Restored

    5/1/2014 3:40:00 PM (Pacific Time)

    Summary

    On May 1st, Customers may have experienced timeouts or errors with their Compute or Storage services in West Europe sub-region. The root cause of this interruption was an unexpected power outage during scheduled maintenance in the datacenter.

    A set of racks lost power affecting compute and storage services running there. Most racks recovered automatically once power was back, however some needed a reboot of their chassis to recover. Once mitigation and verification steps were executed on all clusters, full functionality of all Azure services were restored.

    Customer Impact

    Customers may have experienced timeouts or errors with their Compute or Storage services in West Europe sub-region. Storage account creation may have failed during the impacted window.

    Affected sub-regions

    Region Sub-Region

    Europe West Europe

    Timeline

    Time Event

    5/1/2014 02:39 AM PST The Microsoft Azure team received the first alert of a power outage. The investigation initiated promptly.

    5/1/2014 02:40 AM PST Power restored to impacted racks.

    5/1/2014 03:08 AM PST Majority of services were restored automatically once power was back. Automated repair process (Service healing) started repairing for offline instances.

    5/1/2014 03:40 AM PST The Microsoft Azure team identified some racks needed a reboot of their chassis to recover. Mitigation steps were validated and executed over the next hours

    5/1/2014 11:25 AM PST All services were fully restored but Azure team kept monitoring and verifying that the restoration processed as expected.

    5/1/2014 15:40 PM PST The Microsoft Azure team confirmed full recovery of all Microsoft Azure services.

    Root Cause

    A power outage due to a human error during scheduled maintenance in the datacenter.

    Next Steps

    We are continuously taking steps to improve the Microsoft Azure Platform and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):

    • Improve validation process during maintenance to prevent human errors.

    • Investigate and repair server hardware that encountered additional reboot failures, working closely with our partners.

    • Tooling and automation improvements to minimize time to recovery.

    We apologize for any inconvenience.

    ---------------------------------------------------------------------------------------------------------------------------------------------

    Work experience boy was allowed in to the data centre!

    Microsoft were slow to respond and we were without compute and storage services for over 6 hours.

    You may be interested to know that geo-failover did not occur. Why not you say? Isn't that one of the main attractions to the cloud?

    Apparently, for a Microsoft Azure data centre, a “major disaster” would be a complete data centre going off-line. Microsoft felt that as this was not a complete data centre outage as the majority of their other worldwide customers were not affected. Since the entirety of the services in the data centre were not affected, the geo-failover process was not invoked.

    Our future involvement with Azure will now be very limited. There is no service level redundancy. Data is copied from one site to another, but you aren't in control of it and you can't access it in the event of a disaster. If you want to have service level redundancy in the Azure, you need to provision additional services yourself. Effectively duplicating all your systems should an apprentice unplug a row of server racks. This makes the entire Microsoft Azure offering uneconomic and we'd be better placed expanding our current data centre where we are in full control.

This topic is closed for new posts.

Other stories you might like