back to article The TRUTH behind Microsoft Azure's global cloud mega-cock-up

Windows Azure suffered a global meltdown at the end of October that caused us to question whether Microsoft had effectively partitioned off bits of the cloud from one another. Now we have some answers. After a bit of prodding, Redmond sat us down with Windows Azure general manager Mike Neil, who explained to us why a sub- …

COMMENTS

This topic is closed for new posts.
  1. Henry Wertz 1 Gold badge

    Wow...

    It seems odd to me that if this system can only have a single front end... it's doing something else (scheduling or moving jobs maybe) that is not really the job of a front end. Hopefully it's not an indicator of further architectural faults.

    I was going to insert a Microsoft bash, but *shrug*. It's true, it takes time to work out the bugs on complex systems. IBM mainframes are reliable as can be *now* (and probably the last 30-40 years), but they apparently also had crashes aplenty through the 1960s and 1970s.

    1. Roland6 Silver badge

      Re: Wow...

      "There are three truths of cloud – machines will fail, software has bugs, people will make mistakes,"

      So it would seem this one goes down as a people mistake, as only a person could of designed Azure so that there was only one functioning RDFE worldwide at any one time.

      1. Anonymous Coward
        Anonymous Coward

        Re: Wow...

        "...only a person could of designed Azure..."

        ...only a person could HAVE designed Azure... I'll put that down as a people mistake :-)

    2. Anonymous Coward
      Anonymous Coward

      Re: Wow...

      Azure doesn't have a single Front End or a single Point of Failure in this regard.

      Multiple systems were updated with the problematic software...Potential for a similar issue exists in any cloud system with a consistent OS, fabric, controller, firmware, etc, etc...

      Perhaps roll outs should have been cascaded globally on a slower basis, etc, but that still doesn't make it a single point of failure....

  2. Nate Amsden

    Soooooooooooooooo

    The fabric controller started spewing gibberish?

    http://forums.theregister.co.uk/forum/1/2013/11/08/microsoft_azure_networking_feature/

    1. Anonymous Coward
      Anonymous Coward

      Re: Soooooooooooooooo

      Quick!! Give it a job in marketing!

  3. doronron

    GM Foods

    Be thankful.

    If this was a genetically modified food rollout, they wouldn't be able to do a recall, and it would self replicate.

    You may make bugs and failures in software and hardware but at least you can do product recalls and upgrades.

    Asparagus apocalypse is real and it's coming to a farm near you!

    1. Steve Knox

      Re: GM Foods

      If this was a genetically modified food rollout,

      Okay, completely different industry, but let's see where this goes...

      they wouldn't be able to do a recall,

      Yes they would. See, for example, http://en.wikipedia.org/wiki/Starlink_corn_recall and http://www.organicconsumers.org/gefood/canolarecall.cfm

      and it would self replicate.

      Seasonally, at a pace which would allow for relatively easy control.

      All of this ignores the fact that GMO are much more regulated that software and tech. services. See http://en.wikipedia.org/wiki/Regulation_of_the_release_of_genetically_modified_organisms for starters.

      1. Anonymous Coward
        Anonymous Coward

        @Steve Knox - Re: GM Foods

        Regulated my leg! Tell Monsanto about that!

        1. Destroy All Monsters Silver badge

          Re: @Steve Knox - GM Foods

          > Tell Monsanto about that!

          But Monsanto happens to be the regulator.

      2. doronron

        Re: GM Foods

        They issue the recall, the GM crop just continues to produce seeds regardless. It's almost as if the duplication mechanism doesn't care about the recall notice.

        You can't stop invasive plants spreading across Britain, so you can't stop invasive GM plants either.

        "GMO are much more regulated that software"

        Create a law making it illegal for invasive species to spread across Britain, does it work? Of course not, they're just words if there's no way to enforce those words.

        1. Destroy All Monsters Silver badge

          Re: GM Foods

          > You can't stop invasive plants spreading across Britain, so you can't stop invasive GM plants either.

          Is there a need? Won't these plants just die on their arse when challenged by the natural variety in their usual habitat?

          1. Destroy All Monsters Silver badge
            Thumb Up

            Re: GM Foods

            > http://www.cracked.com/article_18503_how-biotech-company-almost-killed-world-with-booze.html

            Haha! This is like Frederik Pohl's "Shaffery Among the Immortals".

            Analysis that's not from the weekend science press requested. One would suppose that this kind of trick might be found by mother nature a few times per day, greygooing your garden through purely natural means (100% green!). That doesn't routinely happen though. Maybe it just happens every 300 years...

            1. Captain DaFt

              Re: GM Foods

              Yeah, that was one that 'could've happened', but luckily, didn't.

              Here's one that did: Native rapeseed weeds gain immunity to herbicides from closely related GM canola plants.

              I'm not against GM crops where the changes enhance the quality or nutritional value of crops, but changes that're simply for the benefit of herbicide companies, or rendering vital foodstocks sterile to ensure the value of patents (Seriously, no patents for living things, ya greedy bastards!) should be a big no-no!

          2. Roo

            Re: GM Foods

            "> You can't stop invasive plants spreading across Britain, so you can't stop invasive GM plants either.

            Is there a need? Won't these plants just die on their arse when challenged by the natural variety in their usual habitat?"

            I wouldn't bet my house on that, especially if the GM variety is specifically engineered to be more resistant to a pests/diseases and pesticides. Of course if the variety is also engineered to grow faster or yield more heavily the chances are it will starve the native species of sunlight. This kind of thing happens already with varieties bred by more traditional methods.

          3. BongoJoe

            Re: GM Foods

            Is there a need? Won't these plants just die on their arse when challenged by the natural variety in their usual habitat?

            Like Japanese Knotweed, American Stink Cabbage, Himalayan Balsam to mention just three?

            One of the problems with these invasive plants is that the insects which would normall be expected to devour them haven't yet been imported/designed.

          4. Michael Wojcik Silver badge

            Re: GM Foods

            Is there a need? Won't these plants just die on their arse when challenged by the natural variety in their usual habitat?

            Often, yes, in which case there's no problem. But also often, no, because they find a niche to which they're already well-adapted; or no, because the ecological processes that formerly prevented species with similar adaptations from establishing themselves have been disrupted.

            The last is what happened with prairie grasses in most of the US grasslands, for example. They are typically tall, slow-growing plants with very deep root systems. When established, they block most of the sun from fast-growing invasives; the ones that aren't controlled that way generally were done in either by drought (didn't have the roots to survive it) or burn-off. Burn-off, caused by lightning strikes, was particularly important - the native grasses could burn to the ground and regrow quickly from their root systems, while the burn would wipe out other plants.

            When people started building permanent dwellings, cutting down the tall grasses, irrigating fields, and suppressing wildfires, the European invasive plants they brought (because guess which people these were) easily found niches in the modified ecosystem.

            Invasive-species ecology is a whole complex field of study. Many people have made careers out of a single species (Asian carp, kudzu, Formosan termites ...).

      3. Tom 7

        Re: GM Foods seasonally?

        seasonally alas no - once the change gets into other plants then they can replicate it every few weeks.

        And recalling grain in grain elevators is of no use when its in the wild plant population. GM modifications have been found in almost all wild maize in Mexico.

      4. Anonymous Coward
        Anonymous Coward

        Re: GM Foods

        It would be very interesting to know if the three up votes for Steve Knox came from an IP address range belonging to Monsanto...

      5. Anonymous Coward
        Anonymous Coward

        Re: GM Foods

        >> Seasonally, at a pace which would allow for relatively easy control.

        Yes, like mosquitos.

    2. Captain DaFt

      Re: GM Foods

      Ah, it's not likely that could ev... Oh, Fuuuu...

  4. Steve Knox
    Facepalm

    Really?

    "One of the most difficult problems for us to address ... is the software itself,"

    This ... from a software company.

    1. Anonymous Coward
      Anonymous Coward

      Re: Really?

      "This ... from a software company."

      No, Microsoft !

  5. A Non e-mouse Silver badge

    Release procedures

    So they tested their new software on a few servers, then hit the big button to roll it out worldwide at once? I've never run an operation the size of Azure, but I'm fairly certain I wouldn't do everywhere all at once.

    Read the news reports of Google, Amazon & Facebook releasing new features, and you see them slowly release their new features across their different geographical locations.

    1. keithpeter Silver badge
      Holmes

      Re: Release procedures

      @ A Non e-mouse

      As the Microsofty said in the article, they have to run just the one instance of this front end of theirs. So effectively there is only one region for Azure as regards the front end.

      Perhaps a larger sample for testing will be used next time.

    2. Hans 1
      Windows

      Re: Release procedures

      Sure, but Google, Amazon, and Facebook don't have three big teams of office cleaners, the first tasked to write software, the second with system administration, and the third walks around Redmond campus with brooms ...

      I wonder how many more single-point of failures they have architectured into this crapCloud™ - we already had the two certificate cock-up's.

      BTW, I live in a place called "Cote d'Azure" ... nothing to do with Redmond Washing, though.

      1. Anonymous Coward
        Anonymous Coward

        Re: Release procedures

        Erm, you know Azure has had fewer major outages than Amazon?

      2. Neil Greatorex

        Re: Release procedures

        "they have architectured"

        Please go and wash your mouth out!

  6. Anonymous Coward
    Anonymous Coward

    Azure asked for it - front end 'bricked it'

    As someone experienced producing bootloaders for flash-based embedded systems - you never make your

    'frontend' (the bootloader/firmware updater/configurator) depend in any way on the application (what the bootloader is supposed load and run), because if the application fails, then you've no way to update the system - hence 'bricking it'.

    This lack of separation would be painfully obvious to anyone developing low-level firmware - because if you make this mistake, it'll only be once in your lifetime and you'll and never ever forget it. Surprising that a company of MS's monstrous size can make such elementary design error on such a huge system.

  7. Will Godfrey Silver badge

    Translation

    We're new to this sort of thing, but think we know better than everyone else.

  8. John Smith 19 Gold badge
    Unhappy

    "Red Dog Front End"

    Somehow I always recall the phrase from the film "Ransom," of a "Lying assed dog."

    Funny how memory works?

  9. Anonymous Coward
    Anonymous Coward

    proof that microsoft really don't have a fucking clue what they're doing

  10. oldcoder

    It also means that there is still a single point of failure in their system...

    So what else is new? They can't program for reliability at all.

  11. Dan 55 Silver badge
    Holmes

    How many worldwide wobbles are we up to now?

    If I could be bothered to go back over the headlines I could find out but they seem to be a fairly regular occurrence.

    1. Will Godfrey Silver badge

      Re: How many worldwide wobbles are we up to now?

      ... don't they claim 99.999% uptime.

      ... starting 2020

    2. John P

      Re: How many worldwide wobbles are we up to now?

      Already posted this once, for some reason it didn't make it!

      I think we're up to 2 or 3 this year, 2 major ones which caused a lot of downtime. This one was just a deployment inconvenience, not even remotely a "meltdown", but definitely a "cock-up"

      Complex, distributed global systems are hard, no surprise there. All cloud providers have had down time in recent years, it would be useful if someone kept a tally of number/severity of outages so we could see who was faring worse.

      "Cloud" is a trade off, you exchange the convenience of not having to deal with managing your own infrastructure for potentially less up time if their systems have problems. To me this is a worthy trade-off, to others possibly not.

  12. SVV

    Single Point of Failure

    "One of the most difficult problems for us to address which creates a single point of failure in the system is the software itself," Neil said.

    If this is the case then the real single point of failure lies somewhere in the management who let this architecture loose. I've worked in several large hosting environments, and the idea that you'd rely on a software based solution to manage access to all server instances without massively comprehensive and time consuming testing of new versions of the software smacks of poor risk management and trying to do things on the cheap - a complete no no for this type of service.

    But hey! The cloud is the latest cool IT trend! My boss read in a magazine that everybody has to move to the cloud!

    1. Hans 1
      Facepalm

      Re: Single Point of Failure

      >My boss read in a magazine that everybody has to move to the cloud!

      Same here ...

      Sad, sad world ...

    2. Roland6 Silver badge

      Re: Single Point of Failure

      "One of the most difficult problems for us to address which creates a single point of failure in the system is the software itself," Neil said.

      It would seem that this proves the wisdom in using software developed by different teams...

      1. Anonymous Coward
        Anonymous Coward

        Re: Single Point of Failure

        Not really, because you still have a single system, if it's developed by different teams or not. This is what he is getting at - the software itself is an individual system, therefore it represents a single point of failure. Albeit one which an extraordinary amount of effort has been made to make it not reliable. The only way you're going to get out of the software as a SPOF is if you take an Aeroplane style development system where you have different code written on different hardware and using different languages in a voting system, but this is obviously overkill and going to be extremely expensive for a cloud based system.

    3. Getriebe

      Re: Single Point of Failure

      "If this is the case then the real single point of failure lies somewhere in the management who let this architecture loose. I've worked in several large hosting environments, and the idea that you'd rely on a software based solution to manage access to all server instances without massively comprehensive and time consuming testing of new versions of the software smacks of poor risk management and trying to do things on the cheap - a complete no no for this type of service."

      mmmm .... I do not think the reason for the failures is as simple as this

      We have many servers in Azure and they continued to work. No one has reported to me any downtime for our customers I have to report to them.

      I am surprised if the Pink Poodle is the same installation for all of their groups of servers, and so perhaps one or two were fucked but others weren't. But I am speculating.

  13. Anonymous Coward
    Anonymous Coward

    This will continue to happen as there is a fault in the implimentation of NSPOF.

    "Due to the way Azure is built, there can only be one RDFE functioning worldwide at any one time"

    Why do some companies never do things right? Is it that they can't be arsed or is it "good enough" to earn some cash?

  14. tirk
    Unhappy

    "...infrastructure that scales"

    ...the bugs scale too, apparently.

  15. Parax

    infrastructure that scales

    Infrastructure that scales.

    1. Will Godfrey Silver badge
      Happy

      Re: infrastructure that scales

      Oh very good sir. May your skies be cloudless

This topic is closed for new posts.

Other stories you might like