Feeds

back to article The TRUTH behind Microsoft Azure's global cloud mega-cock-up

Windows Azure suffered a global meltdown at the end of October that caused us to question whether Microsoft had effectively partitioned off bits of the cloud from one another. Now we have some answers. After a bit of prodding, Redmond sat us down with Windows Azure general manager Mike Neil, who explained to us why a sub- …

COMMENTS

This topic is closed for new posts.
Gold badge

Wow...

It seems odd to me that if this system can only have a single front end... it's doing something else (scheduling or moving jobs maybe) that is not really the job of a front end. Hopefully it's not an indicator of further architectural faults.

I was going to insert a Microsoft bash, but *shrug*. It's true, it takes time to work out the bugs on complex systems. IBM mainframes are reliable as can be *now* (and probably the last 30-40 years), but they apparently also had crashes aplenty through the 1960s and 1970s.

4
0
Bronze badge

Re: Wow...

"There are three truths of cloud – machines will fail, software has bugs, people will make mistakes,"

So it would seem this one goes down as a people mistake, as only a person could of designed Azure so that there was only one functioning RDFE worldwide at any one time.

1
0
Anonymous Coward

Re: Wow...

Azure doesn't have a single Front End or a single Point of Failure in this regard.

Multiple systems were updated with the problematic software...Potential for a similar issue exists in any cloud system with a consistent OS, fabric, controller, firmware, etc, etc...

Perhaps roll outs should have been cascaded globally on a slower basis, etc, but that still doesn't make it a single point of failure....

0
0

Re: Wow...

"...only a person could of designed Azure..."

...only a person could HAVE designed Azure... I'll put that down as a people mistake :-)

1
0
Bronze badge

Soooooooooooooooo

The fabric controller started spewing gibberish?

http://forums.theregister.co.uk/forum/1/2013/11/08/microsoft_azure_networking_feature/

0
0

Re: Soooooooooooooooo

Quick!! Give it a job in marketing!

7
0

GM Foods

Be thankful.

If this was a genetically modified food rollout, they wouldn't be able to do a recall, and it would self replicate.

You may make bugs and failures in software and hardware but at least you can do product recalls and upgrades.

Asparagus apocalypse is real and it's coming to a farm near you!

2
9
Silver badge

Re: GM Foods

If this was a genetically modified food rollout,

Okay, completely different industry, but let's see where this goes...

they wouldn't be able to do a recall,

Yes they would. See, for example, http://en.wikipedia.org/wiki/Starlink_corn_recall and http://www.organicconsumers.org/gefood/canolarecall.cfm

and it would self replicate.

Seasonally, at a pace which would allow for relatively easy control.

All of this ignores the fact that GMO are much more regulated that software and tech. services. See http://en.wikipedia.org/wiki/Regulation_of_the_release_of_genetically_modified_organisms for starters.

7
5
Anonymous Coward

@Steve Knox - Re: GM Foods

Regulated my leg! Tell Monsanto about that!

4
0
Silver badge

Re: GM Foods

Ah, it's not likely that could ev... Oh, Fuuuu...

6
0

Re: GM Foods

They issue the recall, the GM crop just continues to produce seeds regardless. It's almost as if the duplication mechanism doesn't care about the recall notice.

You can't stop invasive plants spreading across Britain, so you can't stop invasive GM plants either.

"GMO are much more regulated that software"

Create a law making it illegal for invasive species to spread across Britain, does it work? Of course not, they're just words if there's no way to enforce those words.

6
1
Silver badge

Re: GM Foods seasonally?

seasonally alas no - once the change gets into other plants then they can replicate it every few weeks.

And recalling grain in grain elevators is of no use when its in the wild plant population. GM modifications have been found in almost all wild maize in Mexico.

4
0
Silver badge

Re: GM Foods

> You can't stop invasive plants spreading across Britain, so you can't stop invasive GM plants either.

Is there a need? Won't these plants just die on their arse when challenged by the natural variety in their usual habitat?

0
0
Silver badge
Thumb Up

Re: GM Foods

> http://www.cracked.com/article_18503_how-biotech-company-almost-killed-world-with-booze.html

Haha! This is like Frederik Pohl's "Shaffery Among the Immortals".

Analysis that's not from the weekend science press requested. One would suppose that this kind of trick might be found by mother nature a few times per day, greygooing your garden through purely natural means (100% green!). That doesn't routinely happen though. Maybe it just happens every 300 years...

0
0
Silver badge

Re: @Steve Knox - GM Foods

> Tell Monsanto about that!

But Monsanto happens to be the regulator.

3
0
Anonymous Coward

Re: GM Foods

It would be very interesting to know if the three up votes for Steve Knox came from an IP address range belonging to Monsanto...

0
0
Roo
Bronze badge

Re: GM Foods

"> You can't stop invasive plants spreading across Britain, so you can't stop invasive GM plants either.

Is there a need? Won't these plants just die on their arse when challenged by the natural variety in their usual habitat?"

I wouldn't bet my house on that, especially if the GM variety is specifically engineered to be more resistant to a pests/diseases and pesticides. Of course if the variety is also engineered to grow faster or yield more heavily the chances are it will starve the native species of sunlight. This kind of thing happens already with varieties bred by more traditional methods.

0
0
Bronze badge

Re: GM Foods

Is there a need? Won't these plants just die on their arse when challenged by the natural variety in their usual habitat?

Like Japanese Knotweed, American Stink Cabbage, Himalayan Balsam to mention just three?

One of the problems with these invasive plants is that the insects which would normall be expected to devour them haven't yet been imported/designed.

0
0
Anonymous Coward

Re: GM Foods

>> Seasonally, at a pace which would allow for relatively easy control.

Yes, like mosquitos.

0
0
Bronze badge

Re: GM Foods

Is there a need? Won't these plants just die on their arse when challenged by the natural variety in their usual habitat?

Often, yes, in which case there's no problem. But also often, no, because they find a niche to which they're already well-adapted; or no, because the ecological processes that formerly prevented species with similar adaptations from establishing themselves have been disrupted.

The last is what happened with prairie grasses in most of the US grasslands, for example. They are typically tall, slow-growing plants with very deep root systems. When established, they block most of the sun from fast-growing invasives; the ones that aren't controlled that way generally were done in either by drought (didn't have the roots to survive it) or burn-off. Burn-off, caused by lightning strikes, was particularly important - the native grasses could burn to the ground and regrow quickly from their root systems, while the burn would wipe out other plants.

When people started building permanent dwellings, cutting down the tall grasses, irrigating fields, and suppressing wildfires, the European invasive plants they brought (because guess which people these were) easily found niches in the modified ecosystem.

Invasive-species ecology is a whole complex field of study. Many people have made careers out of a single species (Asian carp, kudzu, Formosan termites ...).

0
0
Silver badge

Re: GM Foods

Yeah, that was one that 'could've happened', but luckily, didn't.

Here's one that did: Native rapeseed weeds gain immunity to herbicides from closely related GM canola plants.

I'm not against GM crops where the changes enhance the quality or nutritional value of crops, but changes that're simply for the benefit of herbicide companies, or rendering vital foodstocks sterile to ensure the value of patents (Seriously, no patents for living things, ya greedy bastards!) should be a big no-no!

0
0
Silver badge
Facepalm

Really?

"One of the most difficult problems for us to address ... is the software itself,"

This ... from a software company.

17
0
Anonymous Coward

Re: Really?

"This ... from a software company."

No, Microsoft !

9
0
Bronze badge

Release procedures

So they tested their new software on a few servers, then hit the big button to roll it out worldwide at once? I've never run an operation the size of Azure, but I'm fairly certain I wouldn't do everywhere all at once.

Read the news reports of Google, Amazon & Facebook releasing new features, and you see them slowly release their new features across their different geographical locations.

5
0
Bronze badge
Holmes

Re: Release procedures

@ A Non e-mouse

As the Microsofty said in the article, they have to run just the one instance of this front end of theirs. So effectively there is only one region for Azure as regards the front end.

Perhaps a larger sample for testing will be used next time.

0
0
Bronze badge
Windows

Re: Release procedures

Sure, but Google, Amazon, and Facebook don't have three big teams of office cleaners, the first tasked to write software, the second with system administration, and the third walks around Redmond campus with brooms ...

I wonder how many more single-point of failures they have architectured into this crapCloud™ - we already had the two certificate cock-up's.

BTW, I live in a place called "Cote d'Azure" ... nothing to do with Redmond Washing, though.

0
0
Anonymous Coward

Re: Release procedures

Erm, you know Azure has had fewer major outages than Amazon?

0
0

Re: Release procedures

"they have architectured"

Please go and wash your mouth out!

0
0
Anonymous Coward

Azure asked for it - front end 'bricked it'

As someone experienced producing bootloaders for flash-based embedded systems - you never make your

'frontend' (the bootloader/firmware updater/configurator) depend in any way on the application (what the bootloader is supposed load and run), because if the application fails, then you've no way to update the system - hence 'bricking it'.

This lack of separation would be painfully obvious to anyone developing low-level firmware - because if you make this mistake, it'll only be once in your lifetime and you'll and never ever forget it. Surprising that a company of MS's monstrous size can make such elementary design error on such a huge system.

6
0
Silver badge

Translation

We're new to this sort of thing, but think we know better than everyone else.

8
0
Gold badge
Unhappy

"Red Dog Front End"

Somehow I always recall the phrase from the film "Ransom," of a "Lying assed dog."

Funny how memory works?

1
0
Anonymous Coward

proof that microsoft really don't have a fucking clue what they're doing

6
0

It also means that there is still a single point of failure in their system...

So what else is new? They can't program for reliability at all.

4
0
Silver badge
Holmes

How many worldwide wobbles are we up to now?

If I could be bothered to go back over the headlines I could find out but they seem to be a fairly regular occurrence.

4
0
Silver badge

Re: How many worldwide wobbles are we up to now?

... don't they claim 99.999% uptime.

... starting 2020

2
0

Re: How many worldwide wobbles are we up to now?

Already posted this once, for some reason it didn't make it!

I think we're up to 2 or 3 this year, 2 major ones which caused a lot of downtime. This one was just a deployment inconvenience, not even remotely a "meltdown", but definitely a "cock-up"

Complex, distributed global systems are hard, no surprise there. All cloud providers have had down time in recent years, it would be useful if someone kept a tally of number/severity of outages so we could see who was faring worse.

"Cloud" is a trade off, you exchange the convenience of not having to deal with managing your own infrastructure for potentially less up time if their systems have problems. To me this is a worthy trade-off, to others possibly not.

2
0
SVV

Single Point of Failure

"One of the most difficult problems for us to address which creates a single point of failure in the system is the software itself," Neil said.

If this is the case then the real single point of failure lies somewhere in the management who let this architecture loose. I've worked in several large hosting environments, and the idea that you'd rely on a software based solution to manage access to all server instances without massively comprehensive and time consuming testing of new versions of the software smacks of poor risk management and trying to do things on the cheap - a complete no no for this type of service.

But hey! The cloud is the latest cool IT trend! My boss read in a magazine that everybody has to move to the cloud!

5
0
Bronze badge
Facepalm

Re: Single Point of Failure

>My boss read in a magazine that everybody has to move to the cloud!

Same here ...

Sad, sad world ...

0
0
Bronze badge

Re: Single Point of Failure

"One of the most difficult problems for us to address which creates a single point of failure in the system is the software itself," Neil said.

It would seem that this proves the wisdom in using software developed by different teams...

0
0
Bronze badge

Re: Single Point of Failure

"If this is the case then the real single point of failure lies somewhere in the management who let this architecture loose. I've worked in several large hosting environments, and the idea that you'd rely on a software based solution to manage access to all server instances without massively comprehensive and time consuming testing of new versions of the software smacks of poor risk management and trying to do things on the cheap - a complete no no for this type of service."

mmmm .... I do not think the reason for the failures is as simple as this

We have many servers in Azure and they continued to work. No one has reported to me any downtime for our customers I have to report to them.

I am surprised if the Pink Poodle is the same installation for all of their groups of servers, and so perhaps one or two were fucked but others weren't. But I am speculating.

0
0
Anonymous Coward

Re: Single Point of Failure

Not really, because you still have a single system, if it's developed by different teams or not. This is what he is getting at - the software itself is an individual system, therefore it represents a single point of failure. Albeit one which an extraordinary amount of effort has been made to make it not reliable. The only way you're going to get out of the software as a SPOF is if you take an Aeroplane style development system where you have different code written on different hardware and using different languages in a voting system, but this is obviously overkill and going to be extremely expensive for a cloud based system.

0
0
Anonymous Coward

This will continue to happen as there is a fault in the implimentation of NSPOF.

"Due to the way Azure is built, there can only be one RDFE functioning worldwide at any one time"

Why do some companies never do things right? Is it that they can't be arsed or is it "good enough" to earn some cash?

0
0
Unhappy

"...infrastructure that scales"

...the bugs scale too, apparently.

1
0
Bronze badge

infrastructure that scales

Infrastructure that scales.

2
0
Silver badge
Happy

Re: infrastructure that scales

Oh very good sir. May your skies be cloudless

0
0
This topic is closed for new posts.