User topics

Article topics

Log in Sign up

Azure fell over for 7 hours in Europe because someone accidentally set off the fire extinguishers

Microsoft has explained how a cascading series of cockups left some of its Northern European Azure customers without access to services for nearly seven hours. On September 29, the sounds of "Sacré bleu!" "Scheisse!" and "What are the bastards up to now?" were, we're guessing, heard from Redmond's Euro clients after key …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Tuesday 3rd October 2017 18:51 GMT Christian Berger

The insane thing about it is...

that most of the things people do on that cloud are things you could do at home with an extremely modest server from 20 years ago. E-Mail and storage aren't particularly hard things to do.

43 14 Reply
1. Tuesday 3rd October 2017 18:56 GMT Anonymous Coward
  
  Re: The insane thing about it is...
  
  But that takes work! It's just so much easier to let Azure do it.
  
  Oh wait...
  
  38 3 Reply
2. Tuesday 3rd October 2017 19:15 GMT Anonymous Coward
  
  Re: The insane thing about it is...
  
  You’ve never worked with Exchange Server have you?
  
  First off, many companies consider Email mission critical. Email is customer facing, and it’s used by everyone. I’d rather have the accounting system down for a few hours than email.
  
  Second, Exchange Server is complicated... When you have hundreds/thousands of users it becomes a b1tch to maintain. Try managing spam, phishing, archival (legal), backups, etc. is a pain in the a$$.
  
  35 6 Reply
  1. Tuesday 3rd October 2017 22:19 GMT Anonymous Coward
    
    Re: The insane thing about it is...
    
    "Second, Exchange Server is complicated... "
    
    Sure - it's an enterprise grade solution.
    
    "When you have hundreds/thousands of users it becomes a b1tch to maintain. Try managing spam, phishing, archival (legal), backups, etc. is a pain in the a$$."
    
    It's easier than any other on site option I am aware of to do all those thing on!
    
    11 4 Reply
    1. Wednesday 4th October 2017 09:17 GMT Anonymous Coward
      
      Re: The insane thing about it is...
      
      It's a MS product, how can it be the easiest?
      
      Are all the Linux and open source fans wrong after all?
      
      2 2 Reply
  2. Tuesday 3rd October 2017 23:09 GMT razorfishsl
    
    Re: The insane thing about it is...
    
    Yes an exchange server = email done the WRONG way.
    
    13 17 Reply
    1. Wednesday 4th October 2017 06:49 GMT AMBxx
      
      Re: The insane thing about it is...
      
      Exchange is way more than just email.
      
      I've recently moved from Exchange in home office to Office 365. So pleased to get rid of the admin overhead of Exchange. Every upgrade of OS was fingers crossed and way too complicated.
      
      Yes, Exchange is too complicated for an email solution, but if that's all you think it does, you're missing the point.
      
      13 0 Reply
    2. Wednesday 4th October 2017 14:55 GMT phuzz
      
      Re: The insane thing about it is...
      
      "Yes an exchange server = email done the WRONG way."
      
      You're looking at it the wrong way. Exchange is a multi user calender system that just happens to also do email, and if you'd ever tried getting a working calendar system for more than a few hundred users, you'd understand why Exchange still makes money.
      
      6 1 Reply
  3. Wednesday 4th October 2017 07:25 GMT Anonymous South African Coward
    
    Re: The insane thing about it is...
    
    Well said.
    
    I was an Exchange 2003 admin once, and spam just got a real PITA. Fortunately (or unfortunately?) the system b0rked itself after a power outage, and the company decided to outsource the email to an hosted Exchange.
    
    Benefits
    
    - somebody else's problem with dealing with spamz0rz and haxx0rz
    
    - somebody else's problem dealing with the DataStore on Exchange
    
    - somebody else's problem dealing with backups
    
    Drawbacks
    
    - adding users may take a bit longer
    
    - some issues take longer to address
    
    But in general it is a great deal better as I don't have to waste my time dealing with Exchange and its quirks anymore, and can focus more on other matters.
    
    FWIW Exchange is a good product, and is reliable when set up properly. It went downhill with Exchange 2010 and higher, which is a pity.
    
    3 2 Reply
    1. Wednesday 4th October 2017 12:11 GMT TheVogon
      
      Re: The insane thing about it is...
      
      "It went downhill with Exchange 2010 and higher, which is a pity."
      
      Nope, the newer 2010, 2013 and 2016 versions are very good and there are many design, scalability, resilience, maintenance and functionality improvements. 2007 was very flaky and scalability limited in comparison.
      
      1 3 Reply
      1. Wednesday 4th October 2017 18:39 GMT Terrance Brennan
        
        Re: The insane thing about it is...
        
        Exchange 2010 was the pinnacle of the product for on-premises customers. 2013 and beyond were completely designed solely to meet Microsoft's cloud needs. It's a great solution if you have multiple datacenters and thousands of servers which you buy by the truck load and can afford to have a dozen or more copies of each database.
        
        0 1 Reply
  4. Wednesday 4th October 2017 09:10 GMT hoola
    
    Re: The insane thing about it is...
    
    And you don't have to do any of that in O365?
    
    1 0 Reply
  5. Wednesday 4th October 2017 18:34 GMT Terrance Brennan
    
    Re: The insane thing about it is...
    
    Through Exchange Server 2010 it was not that hard to do and keep it all running. Exchange Server 2013 and 2016 however, are different beasts. They are now designed and optimized for Microsoft's use in the cloud and are not fit for most on-premises use.
    
    0 1 Reply
    1. Thursday 5th October 2017 18:41 GMT Anonymous Coward
      
      Re: The insane thing about it is...
      
      "Exchange Server 2013 and 2016 however, are different beasts. They are now designed and optimized for Microsoft's use in the cloud and are not fit for most on-premises use."
      
      As someone who is Exchange certified and architects and runs large installs, I can tell you that you are wrong. There are many onsite advantages to the newer Exchange versions.
      
      0 0 Reply
  6. Wednesday 4th October 2017 20:17 GMT StargateSg7
    
    Re: The insane thing about it is...
    
    OMG! A few hundred or thousands of users?
    
    Ha ha ha ha I WISH we had that TINY TINY LOAD!
    
    I deal with PETABYTES of data PER DAY!
    
    I deal with 500,000 64 kilobyte Iinput/Output requests PER SECOND PER SERVER!
    
    I deal with files that are 100 Terabytes in size!
    
    I can have TEN MILLION SIMULTANEOUS real connections and
    
    another few MILLIONS of computer-simulated virtual users on
    
    an inhouse platform.
    
    40 Gigabit connections are TOOO TINY to fit my needed bandwidth!
    
    I use MANY CUSTOM terabit fibre interconnects and THOUSANDS
    
    OF GPU's as mini-HTML/SQL servers!
    
    Your data server requirements are PIPSQUEAK SMALL compared to some people!
    
    So YES email/htm/sql for 1000 users can ALL be done inhouse with a $2000 (1500 Euro) server
    
    and some GPU cards to offload the tasks to!
    
    0 4 Reply
3. Tuesday 3rd October 2017 22:51 GMT garetht t
  
  Re: The insane thing about it is...
  
  "things you could do at home with an extremely modest server from 20 years ago"
  
  Such as:
  
  Redundant power
  
  Redundant A/C
  
  Redundant Internet connections
  
  Low latency transport links to backbone
  
  Physical Security
  
  Audited and certified systems and procedures
  
  Sneering at the cloud is really easy until you actually think about it.
  
  24 8 Reply
  1. Wednesday 4th October 2017 07:16 GMT Destroy All Monsters
    
    Re: The insane thing about it is...
    
    Sneering at the cloud is really easy until you actually think about it.
    
    Totally correct. Nobody does this at home except the youthful hacker clubbe and people who think they are consultancy-grade but are actually lacking lots of clues.
    
    If you have the money, you might want to stay off the pubic cloud and rent a few racks in a secure datacenter in the 'burbs, but then it's up to you to manage the hardware/software, which actually costs a bunch of money, especially if you want to harden it against lots of failure modes.
    
    7 4 Reply
    1. Wednesday 4th October 2017 07:47 GMT smudge
      
      Re: The insane thing about it is...
      
      If you have the money, you might want to stay off the pubic cloud
      
      Agreed. You could pick up some nasty infections that way.
      
      7 0 Reply
    2. Wednesday 4th October 2017 08:04 GMT allthecoolshortnamesweretaken
      
      Re: The insane thing about it is...
      
      "... people who think they are consultancy-grade but are actually lacking lots of clues ..."
      
      Meet your new boss!
      
      13 0 Reply
      1. Wednesday 4th October 2017 08:22 GMT Stoneshop
        
        Meet your new boss!
        
        Same as the old boss.
        
        5 0 Reply
  2. Wednesday 4th October 2017 07:29 GMT Mad Mike
    
    Re: The insane thing about it is...
    
    @garetht t
    
    "Audited and certified systems and procedures"
    
    Yeah. Like they worked really well here!! Reality is that cloud providers are proving themselves to be no better at running datacentres and systems than in-house staff.
    
    21 5 Reply
    1. Wednesday 4th October 2017 09:21 GMT Anonymous Coward
      
      Re: The insane thing about it is...
      
      Same staff, different location.
      
      1 0 Reply
  3. Wednesday 4th October 2017 08:35 GMT jmch
    
    Re: The insane thing about it is...
    
    "Redundant A/C" - seems like it's useless having redundant A/C if they all shut down in case of fire!
    
    9 0 Reply
    1. Wednesday 4th October 2017 11:41 GMT Scroticus Canis
      
      Re: "useless having redundant A/C if they all shut down in case of fire!"
      
      Not shutting down the AC during a fire event is the best way to spread the fire while feeding it fresh oxygen. Should also have fire dampers that close off all the ducting so fire does not spread through them.
      
      In this case it was the wrong thing to do; they should have burnt it to the ground and started over so I gave you an up vote as Azure sucks big time.
      
      3 2 Reply
  4. Wednesday 4th October 2017 08:38 GMT Anonymous Coward
    
    Re: Physical Security and the Cloud
    
    Er....? Unless you know exactly where your 'Cloud Service' is being served on every second of every day then how do you know that it is physically secure?
    
    Come on now, the must be a PHD or two in verifying Cloud Physical Security
    
    How do you know that the backup to that swany Azure (other cloud services are available) is not a few old P4's housed in the back of Achmed's Kebab Shop in Kentish Town? (Other kebab shops are available)
    
    Do you really know for sure and not what the cloud snake oil salesmen tell you?
    
    8 2 Reply
  5. Wednesday 4th October 2017 09:19 GMT Anonymous Coward
    
    Re: The insane thing about it is...
    
    The cloud is like a plane, it's fantastic till it goes wrong. No-one would fly if planes if they had the outage figures cloud providers have.
    
    If it is truly a cloud infrastructure then it should NEVER go TITSUP.
    
    8 2 Reply
  6. Wednesday 4th October 2017 12:06 GMT AJ MacLeod
    
    Re: The insane thing about it is... (@garetht t)
    
    A modest server hardly requires redundant A/C - but basically all the stuff in your list that actually matters in real life is quite readily achievable for even quite a small company with nothing more than a dedicated well ventilated IT room.
    
    Redundant power - UPS capable of handling several hours of outage is easy to get and even a small petrol generator could easily be kept on hand in the unlikely case of an outage lasting any longer than that.
    
    Redundant Internet connections - Easy. (And even a 3/4G last ditch option would be plenty for an Email server)
    
    Low latency links to backbone... hardly necessary for the majority of companies, especially if the bulk of their IT is based on one site.
    
    Physical security... really?
    
    Audited and certified systems and procedures... whatever. In practice, plenty of companies get along much better with just a bit of personal responsibility and good old common sense. If your IT staff consists of a handful (or fewer) of reliable, competent individuals that work well together they'll make sure that nothing too stupid is likely to happen.
    
    You can keep your cloud, it's just your data on a pile of other people's computers managed by fallible humans you can't speak to and the whole edifice waiting to fall over when any one of the billion or so sequences of events occurs that wasn't covered by the "certified procedures"
    
    6 2 Reply
  7. Thursday 5th October 2017 10:30 GMT Pedigree-Pete
    
    Re: The insane thing about it is...
    
    I'm sort of with Christian Berger on this.
    
    It depends on your use case, If it's just a family server(s), even tho' I'm no IT bod, I'd do it in house.
    
    A few hours/days of inconvenience isn't a biggie.
    
    If you're a very small SME then you could get away with it with a off site cloudy email and backup and a decent landline/mobile backup.
    
    I think you see where I'm going.
    
    @garetht t, you are, of course, spot on too for many use cases. PP
    
    ICON> If you need resilience an guaranteed uptime and it doesn't work.
    
    0 0 Reply
4. Wednesday 4th October 2017 08:08 GMT Ryan Kendall
  
  Re: The insane thing about it is...
  
  But who wants a san attached loud fan blowing server running 24/7 in their home.
  
  4 3 Reply
  1. Wednesday 4th October 2017 08:21 GMT Muscleguy
    
    Re: The insane thing about it is...
    
    Doesn't have to be a server. I've just bought and installed new fans on this Mid 2010 Macbook Pro inherited from my daughter. The fan noise was becoming VERY distracting. The surgery was really quite simple. I've done far harder.
    
    But oh, the silence! the lack of vibration! Bliss.
    
    3 1 Reply
  2. Wednesday 4th October 2017 11:49 GMT Alan Edwards
    
    Re: The insane thing about it is...
    
    > But who wants a san attached loud fan blowing server running 24/7 in their home.
    
    I have a self-built VMWare ESXi server and a NAS running 24x7. The HP MicroServer that runs the NAS is give-or-take silent, the PSU fan in the VMWare server is quiet enough that I don't notice it.
    
    The noise factor has stopped me getting a cheap ex-corporate server off eBay though. We once powered up a de-racked ProLiant DL-something in the office, damn that thing was loud. Lots of tiny screaming fans.
    
    4 0 Reply
  3. Wednesday 4th October 2017 12:27 GMT Anonymous Coward
    
    Re: The insane thing about it is...
    
    "But who wants a san attached loud fan blowing server running 24/7 in their home."
    
    I have a pair of HP servers running Hyper-V in my loft. They are close to silent in general use.
    
    3 1 Reply
Tuesday 3rd October 2017 18:59 GMT Dwarf

Cost a pretty penny

That fireproof gas (tm) ain't cheap to replace

I bet the maintenance folk needed some new trousers too.

17 0 Reply
1. Tuesday 3rd October 2017 19:18 GMT Swarthy
  
  Re: Cost a pretty penny
  
  Ah, but the beancounter(s) who were in the server room when the ~~BOFH~~ workmen set off the ~~halon~~ fire extinguishers will need more than clean trousers. In fact, they will need: some old carpet, a bag of quicklime, and a couple of spades.
  
  43 0 Reply
  1. Tuesday 3rd October 2017 21:05 GMT robidy
    
    Re: Cost a pretty penny
    
    ROFL
    
    5 0 Reply
  2. Tuesday 3rd October 2017 22:19 GMT Anonymous Coward
    
    Re: Cost a pretty penny
    
    "when the BOFH workmen set off the halon fire extinguishers"
    
    It would likely have been an Inergen IG541 based solution. Hardly anyone still has Halon these days.
    
    3 0 Reply
    1. Wednesday 4th October 2017 08:40 GMT Mr Dogshit
      
      Re: Cost a pretty penny
      
      Yes yes we know.
      
      1 1 Reply
    2. Wednesday 4th October 2017 09:27 GMT Anonymous Coward
      
      Re: Cost a pretty penny
      
      I found evidence in other comments that there are actually many farms that still have water in their fire extinguishing systems
      
      0 0 Reply
      1. Wednesday 4th October 2017 12:03 GMT Anonymous Coward
        
        Re: Cost a pretty penny
        
        "I found evidence in other comments that there are actually many farms that still have water in their fire extinguishing systems"
        
        Lots of clueless US facilities have sprinkler systems I have noticed.
        
        1 0 Reply
        
        Wednesday 4th October 2017 14:46 GMT Anonymous Coward
        
        Re: Cost a pretty penny
        
        Re: sprinklers
        
        Not a terrible idea if you have an inert gas suppression system. The gas should knock down any major files long before the sprinklers trigger. The sprinklers act as a backup, so if the crap really hits the fan, you will have some soggy hard drives from which to extract data instead of a crispy pile of melted parts.
        
        1 0 Reply
  3. Thursday 5th October 2017 10:41 GMT Pedigree-Pete
    
    Re: Cost a pretty penny
    
    @Swarthy. Can't upvote you enough for that. You owe me a new keyboard but I only have 1 icon choice so Cheers to Friday Eve. PP
    
    0 0 Reply
2. Wednesday 4th October 2017 08:50 GMT Anonymous Coward
  
  Re: Cost a pretty penny
  
  During all the commotion, did anyone check to make sure the poorly paid "fire extinguisher testers" didn't in fact install a back door into the system? Wouldn't THAT be funny....LOL!!
  
  0 0 Reply
Tuesday 3rd October 2017 19:04 GMT DagD

Rain in the cloud

Oh, the irony. But that's ok... Your salesman has got your back (w-a-a-a-y back...).

Still nice and sunny here in the land of "keep it in-house".

18 2 Reply
Tuesday 3rd October 2017 19:04 GMT Ken Moorhouse

This was a Microsoft training exercise...

Altogether now:-

Embrace (cut that out you two at the back)

Extend (I wanna see those pierced navels)

Extinguish (noooo not with one of those...)

27 3 Reply
Tuesday 3rd October 2017 19:04 GMT sjsmoto

When was the last (or first?) time we heard there was a major cloud problem but it didn't affect anyone because the rollover performed flawlessly?

32 1 Reply
1. Tuesday 3rd October 2017 19:11 GMT Anonymous Coward
  
  Probably such events only make local news, if that.
  
  6 0 Reply
2. Tuesday 3rd October 2017 19:28 GMT yowl00
  
  "Implementation of Virtual Machines in Availability Sets with Managed Disks would have provided resiliency against significant service impact for VM based workloads. ". So if you'd implemented your service properly then that's exactly what would have happened.
  
  10 0 Reply
  1. Tuesday 3rd October 2017 22:01 GMT Colin Tree
    
    in theory
    
    in theory,
    
    read the article, they're low level, physical, real world problems,
    
    but clouds are so ethereal, each level of abstraction increases complexity.
    
    If you didn't use the azure cloud, you wouldn't have a problem.
    
    KISS
    
    oh, azure is the colour of a clear sky, nothing to do with clouds,
    
    MS cockup again
    
    2 4 Reply
3. Tuesday 3rd October 2017 19:49 GMT JimC
  
  > hear about ... problem ... didn't affect anyone
  
  To be fair, I don't think we'd hear about those at all. I imagine most cloud hosting sites would rather not let the customers know there had been a problem.
  
  My experience has been that most PHBs I've been involved with would rather pretend there are no problems rather than tell the customer ever time there's been a problem which hasn't impacted the customer. Its the same mindset, I guess, that thinks those 9s come from writing the SLA, not good design and careful planning.
  
  9 0 Reply
  1. Tuesday 3rd October 2017 20:11 GMT sjsmoto
    
    Re: > hear about ... problem ... didn't affect anyone
    
    @JimC - I don't know... when these kind of problems keep happening, you'd think a company (especially a smaller one) would be very happy to get noticed by saying a crash affected no one.
    
    5 0 Reply
    1. Tuesday 3rd October 2017 21:02 GMT Lusty
      
      Re: > hear about ... problem ... didn't affect anyone
      
      Yep, here you go. Full list of all Azure issues. Nothing to do with publicity or cover ups, its responsibility and trust.
      
      https://azure.microsoft.com/en-us/status/history/
      
      https://status.aws.amazon.com/
      
      7 0 Reply
  2. Thursday 5th October 2017 10:54 GMT Pedigree-Pete
    
    Re: > hear about ... problem ... didn't affect anyone
    
    We have a cloud supplier who has a global presence. As Admin on services we sell on that platform I get 6 emails from them whenever an "issue threshold" is breached. For clarity I'm in the UK.
    
    1/ There have been reports of issues with X in Shanghai/Hong Kong.
    
    2/ We're investigating issues with x in Shanghai/Hong Kong.
    
    3/ We have identified a probable cause and applied a fix for x in Shanghai/Hong Kong.
    
    4/ We're monitoring x in Shanghai/Hong Kong.
    
    5/ No further incidences of x have occurred in Shanghai/Hong Kong or anywhere else.
    
    6/ Issue is now resolved.
    
    Naturally, I shrug and go ho hum, but should one of our users call and say I'm trying to do x with Shanghai/Hong Kong here in the trenches I can say I know and it's being worked on. In my experience emails 1-6 rarely take more than 45 mins.
    
    All cloud providers have outages somewhere in the services they provide. It's how you communicate that down the channel that counts.
    
    I'm sure you've all had outages and been unable to make progress on investigating and fixing because your colleagues/customers keep ringing to tell you, you have an outage. :(
    
    PP
    
    0 0 Reply
4. Tuesday 3rd October 2017 22:52 GMT garetht t
  
  "Dog bites man!"
  
  0 0 Reply
Tuesday 3rd October 2017 19:22 GMT yowl00

Didn't even notice. A subset of customers were affected, what percentage would be interesting.

1 1 Reply
Tuesday 3rd October 2017 19:38 GMT Pascal Monett

From the looks of it, cogs were falling off all over the place

So, let's countdown the failures :

-VMs were axed

- Backup vaults were not available

- Azure Site Recovery lost failover ability

- Azure Scheduler and Functions dropped jobs

- Azure Monitor and Data Factory experienced pipeline errors

- Azure Stream Analytics went on the fritz

- Azure Stream Analytics had a stroke

Apart from that, the Cloud is marvelous, never fails you and you can always access your data.

Except when it FUBARs and no backup is working any more, but the salespeople will never tell you that.

25 7 Reply
1. Tuesday 3rd October 2017 22:56 GMT garetht t
  
  Re: From the looks of it, cogs were falling off all over the place
  
  "the Cloud is marvelous, never fails you and you can always access your data."
  
  That's a strawman - you're saying things so that you can knock it down.
  
  The cloud doesn't guarantee anything except possible failure, and you are massively encouraged to architect your systems against failure. High-availability systems across availability zones, backup systems in different geographic regions.
  
  The people highest on their horse on this page against the cloud are the people who know the least. How infuriating!
  
  11 8 Reply
  1. Tuesday 3rd October 2017 23:22 GMT Nate Amsden
    
    Re: From the looks of it, cogs were falling off all over the place
    
    Most likely those folks know that architecting for failure in cloud is a pretty rare thing just look at how many customers have outages when cloud goes down.
    
    Hell I have seen developers complain about tcp connections being dropped during a LB failover(takes about 1 second ) because their app couldn't even handle that without restarting it. And this is for a new application stack, not something designed 10 or 15 years ago. I could go on and on for other real scenarios easily.
    
    Building apps with single points of failure is very common still.
    
    I remember what was it a decade ago or so, fire at data center in seattle, a facility that had at least annual power outages for 2 or 3 years prior. Bing travel site was in that data center. Was down for a long time. Maybe MS got it onlinr before the datacenter came back online with external generator trucks about 40 hrs later not sure (this was a colo facility not a MS datacenter).
    
    Point is 10 years ago isn't that long and a company with the size and resources of MS wasn't willing or able to do it for bing travel at the time(hell even I had the foresight to move the company I was with at the time out of that DC 2 years before the big outage), doesn't surprise me that companies the fraction of the size still can't figure it out today. It's not as if it's impossible, it is just very difficult to do and most talk the talk but won't walk the walk when it comes down to it.
    
    Same situation applies to security of applications.
    
    8 0 Reply
  2. Wednesday 4th October 2017 07:31 GMT Hans 1
    
    Re: From the looks of it, cogs were falling off all over the place
    
    High-availability systems across availability zones, backup systems in different geographic regions.
    
    In Theory, maybe, problem is, Slurp held it wrong, else customers would not have noticed.
    
    What I do not understand is why do people go with AWS or Azure ?
    
    Multiple providers offer OpenStack, you can get service from two or three to do ultra high availability and disaster recovery, same stack, MUCH easier to implement ... if you really wanna go cloud, that is. What are the chances for two or three OpenStack vendors to fail at the same time vs AWS or Azure ?
    
    The people highest on their horse on this page against the cloud are the people who know the least. How infuriating!
    
    Generalization, not good.
    
    If you back Azure, your opinion does not count.
    
    2 9 Reply
    1. Wednesday 4th October 2017 08:01 GMT TonyJ
      
      Re: From the looks of it, cogs were falling off all over the place
      
      "...Generalization, not good.
      
      If you back Azure, your opinion does not count...."
      
      You clearly don't get irony.
      
      4 0 Reply
    2. Wednesday 4th October 2017 12:07 GMT Anonymous Coward
      
      Re: From the looks of it, cogs were falling off all over the place
      
      "same stack, MUCH easier to implement "
      
      You are kidding right? Openstack is WAY more complex and fiddly to implement and use than say Azure. You have to edit text files to store config for a start - how prehistoric and insecure. For instance how do you control ACLs for and audit changes to say just one setting in a text file?!
      
      1 3 Reply
  3. Wednesday 4th October 2017 08:54 GMT Anonymous Coward
    
    Re: From the looks of it, cogs were falling off all over the place
    
    AC as details
    
    Not working for a huge company so cloud has one great advantage, ability to automatically "spin up" additional resources if required (dealing with activity spikes)
    
    Yes, that could be done "on site" but would mean a lot of (expensive) kit, doing very little much of the time, just sitting there waiting for an activity spike.
    
    Other advantages, let's talk Azure here, is the Azure SQL "Point in Time" functionality, all that db backup burden removed, the geographical replication / failover stuff (that protects against some cloud failure) .
    
    If you are a huge company then enough onsite "iron" for those rare peaks is probably viable, and multiple geographically distributed replicating data centres is viable but not for many smaller outfits: Cloud is not perfect, but it's useful for some of us.
    
    4 1 Reply
Tuesday 3rd October 2017 20:01 GMT Sureo

Now they know...

that the fire suppression system works. How often do you get to try that for real?

12 0 Reply
1. Tuesday 3rd October 2017 20:47 GMT Anonymous Coward
  
  Re: Now they know...
  
  Yes, that bit did but the server shutdown part didn't, nor apparently did the failover to the mirror site.
  
  15 1 Reply
Tuesday 3rd October 2017 20:08 GMT Erik4872

Today's cloud lesson...

The lesson for today is that you should never assume a cloud provider's operations are 100%. I hate having to explain to people why we need to have an instance of our service in more than one region. "But it's so expensive! My cloud salesman assured me that each region is interconnected data centers miles apart and they are nearly incapable of failing!"

It's all just computers and data centers, even if it's very much software-defined and very resilient. If humans and computers are involved, something will eventually go wrong.

15 0 Reply
1. Tuesday 3rd October 2017 21:56 GMT Mark 85
  
  Re: Today's cloud lesson...
  
  It's all just computers and data centers, even if it's very much software-defined and very resilient. If humans and computers are involved, something will eventually go wrong.
  
  Just to simplify things: Murphy's Law applies to everything. Manglement seems to forget that.
  
  9 0 Reply
Tuesday 3rd October 2017 20:18 GMT fedoraman

Really?

This all happened because you lost an AHU? (I'm assuming not all of the AHUs in the data centre were stopped, just some, in the allegedly fire-affected region)

So the rack temperature starts to rise, quite rapidly, because you no longer have moving air, to carry away your excess heat. At what point, do you think, it might be a good idea to have graceful shutdowns of the affected racks, that have lost their conditioned air. You know, triggered by some kind of flow sensor, or a delta-p switch across the AHU fan?

I look after many dozens of air-handling systems. Even with two motors to each fan, and multiple belts, they do break down occasionally.

14 1 Reply
1. Tuesday 3rd October 2017 23:29 GMT Nate Amsden
  
  Re: Really?
  
  I think large scale graceful shutdowns in this situation is probably really complicated as they operate as a cluster, as systems shut down likely other things kick in to try to restore availability maybe moving resources to other nodes or something. At some point you probably have to set a flag in the entire system saying it is down and take it all offline(at which point graceful from a customer standpoint is out the window)
  
  I think this happened during that semi recent big S3 outage.
  
  Not as if these are just racks and racks of standalone web servers with local storage.
  
  3 1 Reply
2. Wednesday 4th October 2017 09:19 GMT DropBear
  
  Re: Really?
  
  First thing I thought of too. As far as failures go, thermal ones are as gentle as failures can possibly get* - they're not instant, and you get a warning they're happening. If your cloud can't even handle that gracefully, what the ever-loving fuck is it good for, exactly...?
  
  * ...well, unless the heatsink itself falls off your CPU. You know, because the retaining bracket snapped. And you only realize it because the fan suddenly snaps to full throttle for no good reason. At which point you remember an old Youtube video you once saw about an AMD CPU frying in milliseconds (the Intel one just throttled way down) due to the exact same cause and you bash the power switch mightily. Yes, it survived - new bracket, I'm still using it...
  
  6 0 Reply
Tuesday 3rd October 2017 20:24 GMT Anonymous Coward

so much for those fault domains, eh?

2 3 Reply
1. Tuesday 3rd October 2017 22:19 GMT Anonymous Coward
  
  "so much for those fault domains, eh?"
  
  It only impacted one fault domain. Northern Europe. There ae 7 others in Europe.
  
  2 2 Reply
  1. This post has been deleted by its author
Tuesday 3rd October 2017 20:26 GMT Anonymous Coward

It was easier to make up the whole fire extinguisher thing that to own up about the real reason: forced Windows updates borked the cloud, then all the servers got confused uploading telemetry information to themselves.

20 3 Reply
1. Wednesday 4th October 2017 07:45 GMT wallaby
  
  "It was easier to make up the whole fire extinguisher thing that to own up about the real reason: forced Windows updates borked the cloud, then all the servers got confused uploading telemetry information to themselves"
  
  the sad thing is that at the time of writing this 14 loons had upvoted the above statement - I suspect only because they dislike Microsoft - or would they like to share the evidence to back the statement up ?
  
  3 9 Reply
  1. Wednesday 4th October 2017 08:23 GMT Anonymous Coward
    
    The really sad thing is that wallabies don't understand sarcasm. I thought that was reserved for Americans and that wallabies were more antipodean??
    
    5 1 Reply
    1. Wednesday 4th October 2017 11:26 GMT wallaby
      
      I fully understand sarcasm, its just you sounded like one of the militant penguinistas the frequently rant off on stories about Microsoft.
      
      and as there is no button for a sarcastic upvote I drew conclusions on the (at the time) 14 clickers rather than what you wrote.
      
      1 4 Reply
Tuesday 3rd October 2017 23:12 GMT razorfishsl

The sad thing is that when scientists talk about modeling random systems and tracking them, people laugh.

But someone comes up with a half assed idea of sticking the whole of mankind's data into a cloud system and instantly it is a good idea.

8 1 Reply
Tuesday 3rd October 2017 23:56 GMT Anonymous Coward

The Cloud...

Other peoples computers you have no control over.

9 2 Reply
1. Thursday 5th October 2017 08:15 GMT hplasm
  
  Re: The Cloud...
  
  "Other peoples computers you have no control over."
  
  And neither, it seems, do they...
  
  2 0 Reply
Wednesday 4th October 2017 02:29 GMT Anonymous Coward

Routine periodic fire suppression system maintenance

'The problems started when one of Microsoft's data centers was carrying out routine maintenance on fire extinguishing systems, and the workmen accidentally set them off. This released fire suppression gas, and triggered a shutdown of the air con to avoid feeding oxygen to any flames and cut the risk of an inferno spreading via conduits. This lack of cooling, though, knackered nearby powered-up machines, bringing down a "storage scale unit."'

I thought the 'cloud' was immune to failures at a single location. When a VM instance fails at one location, another is started up elsewhere. What happens to 99.999% up time when you have a real fire?

6 4 Reply
1. Wednesday 4th October 2017 07:17 GMT Destroy All Monsters
  
  Re: Routine periodic fire suppression system maintenance
  
  It's the same problem as discussed here, back in the mid-80s:
  
  Computer System Reliability and Nuclear War
  
  "We can't be sure that it works in a tough situation until after the fact"
  
  1 0 Reply
2. Wednesday 4th October 2017 09:29 GMT Anonymous Coward
  
  Re: Routine periodic fire suppression system maintenance
  
  Where is the checklist?
  
  Where is the failsafe switch?
  
  Where is the oversight?
  
  Every DC I worked in had a master control to switch off before work started and everyone was accompanied while working on site to prevent outages.
  
  Seems like it's time to find a new supplier and manager.
  
  2 0 Reply
Wednesday 4th October 2017 08:09 GMT Ryan Kendall

Not all affected

I have about 20 servers running in Azure North Europe.

Strangely none of them went down.

2 1 Reply
1. Wednesday 4th October 2017 09:50 GMT Ken Moorhouse
  
  Re: Strangely none of them went down.
  
  That's because they were located at the back of Achmed's Kebab Shop in Kentish Town
  
  (How would you know other than doing some low-level packet tracing, and other detective work? It's a bit like Bit Torrent)
  
  3 2 Reply
  1. Wednesday 4th October 2017 10:16 GMT Ken Moorhouse
    
    Re: Achmed's Kebab Shop in Kentish Town
    
    The more I think about Achmed's Kebab Shop in Kentish Town, the more I think we're all being fooled.
    
    Not only does Achmed help serve up BT's local OpenZone service (if the shop uses a BT HomeHub), but if the business owner's pc has BitTorrent installed then there is a possibility he is a contributor to a film you may be watching (I'm sure I read somewhere that Microsoft are using BitTorrent techniques to serve up updates since the advent of W10). How do we know that Azure/AWS does not "sub-contract" in a similar way? AFAIK there is no agreement between BT and Achmed as to whether BT can use Achmed's Broadband connection for providing BT's Public WiFi service - BT being a big company y'know. Plus (I'm sure I've said this before), do Azure rent capacity from AWS and vice versa?
    
    1 1 Reply
  2. Wednesday 4th October 2017 12:11 GMT Anonymous Coward
    
    Re: Strangely none of them went down.
    
    "That's because they were located at the back of Achmed's Kebab Shop in Kentish Town
    
    (How would you know other than doing some low-level packet tracing, "
    
    Because of where the Azure ExpressRoute connections we pay for go to?
    
    1 1 Reply
Wednesday 4th October 2017 08:20 GMT wyatt

Interesting comments on this thread. I wouldn't say I'm against hosting, there have been some very good examples given here of the benefits. However, I think that when moving from on site to hosted the potential issues are not planned for. Multiple geographically spread instances with redundant networking (yours and theirs) should be a minimum requirement you'd think?

We've hosted services with Azure and I'm not aware of them having outages either which is good, either not in an affected location or resilient enough to keep going.

1 0 Reply
This post has been deleted by its author
1. Wednesday 4th October 2017 12:08 GMT Anonymous Coward
  
  Re: SLA and BC/DR
  
  "In 5 years, Microsoft should have been able to avoid any outage for any customers by implementing a correct BC/DR strategy and testing it."
  
  They do and it worked perfectly. You have a choice to use it.
  
  1 1 Reply
Wednesday 4th October 2017 08:39 GMT hoola

Double Standards

Every time there is some cock up with Azure Microsoft give a load of pathetic assurances that it will not happen again and they are always improving. Based on what we experienced in very specific circumstances there was data loss on a VM.

If this or even a far more minor event (a single VM host falling over) had happened in our data centre there would have been people screaming from the rooftops. But this this, because it is in Azure is just accepted, not even a whisper from upstairs.

On site is constantly under scrutiny and has to provide a far better service and then there are complaints about the cost.

7 3 Reply
Wednesday 4th October 2017 08:39 GMT Anonymous Coward

It always seems to be the testing that brings these things down, testing of any kind carries a risk that it won't go to plan and fail over gracefully. Mitigating that risk by not testing isn't an option and having on-prem doesn't make you immune from a technician/inspector accidentally pushing the big red button.

How certain are you ? Cool, go on then, go and set of your fire systems and it'll be like a ballet; watching everything seamlessly and gracefully fail-over, migrate and shut down :)

0 1 Reply
Wednesday 4th October 2017 08:42 GMT Anonymous Coward

but..

...how can this be? The "cloud" is infallible so a certain digital director thinks.

2 2 Reply
Wednesday 4th October 2017 08:51 GMT Smoking Gun

Interesting, we have clients in NE and were not affected.

I presume the referred too Microsoft services are all distributed across the data centre so this event, while disruptive to some, only disrupted a limited number of services for a limited number of clients.

The funny thing is when reading comments or talking to people about Cloud, as soon as they implement a service in Azure they automatically have an expectation of 100% availability and nothing will ever fail.

A lot of this comes back to lack of understanding, if you want availability, you still need to architect your service correctly, even on a public cloud, and this will ultimately increase costs.

2 0 Reply
1. Wednesday 4th October 2017 09:46 GMT itguy
  
  So we were one of the sites impacted. Yes this is a screw up by MS BUT it was also our screwup. We didn't have the traffic manager enabled and if we did, our service would probably have switched over to another geo.
  
  Lessons learnt on both sides.
  
  3 0 Reply
  1. Wednesday 4th October 2017 10:25 GMT Ken Moorhouse
    
    Re: our service would probably have switched over to another geo.
    
    Probably? Not a certainty then.
    
    The other point is whether your data had been replicated to that other geo. How can you guarantee that the data you are now looking at through that different conduit is current? And how would you know for certain that it is up to date?
    
    Replication was I believe one of the issues with Lotus Notes. Have lessons been learned?
    
    1 2 Reply
    1. Wednesday 4th October 2017 12:08 GMT Anonymous Coward
      
      Re: our service would probably have switched over to another geo.
      
      "The other point is whether your data had been replicated to that other geo. How can you guarantee that the data you are now looking at through that different conduit is current? And how would you know for certain that it is up to date?"
      
      By say using synchronous replication and time stamps as one of several options.
      
      "And how would you know for certain that it is up to date?"
      
      Because transactions will only be committed once replicated to all live sites.
      
      1 0 Reply
Wednesday 4th October 2017 09:06 GMT Pat Harkin

I've never really understood The Cloud...

...but at least now I know it works on fire and if you put out the fire it stops working.

5 0 Reply
1. Wednesday 4th October 2017 09:44 GMT DropBear
  
  Re: I've never really understood The Cloud...
  
  There can be only one logical explanation. The Cloud is... *GASP* steam powered...!
  
  2 0 Reply
Wednesday 4th October 2017 11:51 GMT Alan Edwards

Use their own service??

> Azure Site Recovery lost failover ability

So a failure at one data centre knocked out everyone else's ability to fail over to a different site :)

1 3 Reply
1. Wednesday 4th October 2017 12:00 GMT Anonymous Coward
  
  Re: Use their own service??
  
  "So a failure at one data centre knocked out everyone else's ability to fail over to a different site :)"
  
  No - only to failover to that one site.
  
  1 0 Reply
Wednesday 4th October 2017 14:04 GMT DagD

is your ISP using BGP?

..Then there are network based attacks

https://www.nist.gov/news-events/news/2017/10/new-network-security-standards-will-protect-internets-routing

0 0 Reply
Thursday 5th October 2017 03:18 GMT Daniel B.

BOFH

So I guess we now know where the BOFH is working at these days!

1 0 Reply
Thursday 5th October 2017 13:15 GMT Anonymous Coward

"between 1327 and 2015"

Nearly 700 years, is this a new outage record for M$??

(Yes, yes, I know, feeble joke).

0 0 Reply
Thursday 5th October 2017 18:02 GMT Anonymous Coward

Dirty shutdowns?

The note about dirty shutdowns indicates that there was no communication between the cooling system and the servers.

0 0 Reply
Thursday 5th October 2017 18:37 GMT msroadkill

Some failsafe clouds are more failsafe than others - G Orwell

0 0 Reply
Thursday 5th October 2017 18:42 GMT TheVogon

"The note about dirty shutdowns indicates that there was no communication between the cooling system and the servers."

Quite probably so. The failsafe is that the servers will shutdown at a critical temperature - which is a likely better solution in most cases as stuff that doesn't get too hot won't shut down.

To shut a massive cloud system down cleanly in a hurry is simply not likely to be possible in a period under tens of minutes anyway so likely that's another reason why they don't do that.

0 0 Reply
Friday 6th October 2017 20:53 GMT Steve_Jobs1974

AWS have been running Availability Zones (AZs) for years.

Fully isolated zones (of clusters of data centers) with low latency connections. This simply wouldn't happen in AWS. Microsoft have been in such a rush to expand their foot print that they have not done a great job here - A single isolated event takes out an Azure region. This is terrible.

1 0 Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

Other stories you might like

Microsoft foresees a new type of AI PC: A Surface designed with help from machines

For now, Redmond is dogfooding Azure for product simulations

AI + ML 22 Apr 2024 | 5

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

CISA calls for 'fundamental, security-focused reforms' to happen ASAP, delaying work on other software

Security 3 Apr 2024 | 39

Researchers claim Windows Defender can be fooled into deleting databases

BLACK HAT ASIA Two rounds of reports and patches may not have completely closed this hole

Security 22 Apr 2024 | 17

October 2025 will be a support massacre for a bunch of Microsoft products

Not just Windows 10. Don't forget about Exchange Server, Skype for Business, and all those Office installations

Software 18 Apr 2024 | 34

Microsoft is a national security threat, says ex-White House cyber policy director

Interview With little competition at the goverment level, Windows giant has no incentive to make its systems safer

Public Sector 21 Apr 2024 | 111

Open source versus Microsoft: The new rebellion begins

Opinion Neither side can afford to lose, but one surely must

SaaS 15 Apr 2024 | 183

Microsoft breach allowed Russian spies to steal emails from US government

Affected federal agencies must comb through mails, reset API keys and passwords

Cyber-crime 12 Apr 2024 | 18

AI gold rush continues as Microsoft invests $1.5B in UAE's G42

Can regulators keep up?

AI + ML 16 Apr 2024 | 3

French lawmakers take a swing at cloud monopolies

Action gathers steam in the EU, US and UK as anti-trust teams collate market feedback

PaaS + IaaS 2 Apr 2024 | 6

911 goes MIA across multiple US states, cause unclear

Updated Some say various cell services were out, others still say landlines were affected. What just happened?

Networks 18 Apr 2024 | 36

Microsoft shrinks AI down to pocket size with Phi-3 Mini

Language model focused on reasoning fits on a smartphone and runs offline

AI + ML 23 Apr 2024 | 8

Microsoft aims to triple datacenter capacity to fuel AI boom

And it's far from the only hyperscaler getting in on the act

On-Prem 18 Apr 2024 | 2

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024