The insane thing about it is...
that most of the things people do on that cloud are things you could do at home with an extremely modest server from 20 years ago. E-Mail and storage aren't particularly hard things to do.
Microsoft has explained how a cascading series of cockups left some of its Northern European Azure customers without access to services for nearly seven hours. On September 29, the sounds of "Sacré bleu!" "Scheisse!" and "What are the bastards up to now?" were, we're guessing, heard from Redmond's Euro clients after key …
You’ve never worked with Exchange Server have you?
First off, many companies consider Email mission critical. Email is customer facing, and it’s used by everyone. I’d rather have the accounting system down for a few hours than email.
Second, Exchange Server is complicated... When you have hundreds/thousands of users it becomes a b1tch to maintain. Try managing spam, phishing, archival (legal), backups, etc. is a pain in the a$$.
"Second, Exchange Server is complicated... "
Sure - it's an enterprise grade solution.
"When you have hundreds/thousands of users it becomes a b1tch to maintain. Try managing spam, phishing, archival (legal), backups, etc. is a pain in the a$$."
It's easier than any other on site option I am aware of to do all those thing on!
Exchange is way more than just email.
I've recently moved from Exchange in home office to Office 365. So pleased to get rid of the admin overhead of Exchange. Every upgrade of OS was fingers crossed and way too complicated.
Yes, Exchange is too complicated for an email solution, but if that's all you think it does, you're missing the point.
"Yes an exchange server = email done the WRONG way."
You're looking at it the wrong way. Exchange is a multi user calender system that just happens to also do email, and if you'd ever tried getting a working calendar system for more than a few hundred users, you'd understand why Exchange still makes money.
Well said.
I was an Exchange 2003 admin once, and spam just got a real PITA. Fortunately (or unfortunately?) the system b0rked itself after a power outage, and the company decided to outsource the email to an hosted Exchange.
Benefits
- somebody else's problem with dealing with spamz0rz and haxx0rz
- somebody else's problem dealing with the DataStore on Exchange
- somebody else's problem dealing with backups
Drawbacks
- adding users may take a bit longer
- some issues take longer to address
But in general it is a great deal better as I don't have to waste my time dealing with Exchange and its quirks anymore, and can focus more on other matters.
FWIW Exchange is a good product, and is reliable when set up properly. It went downhill with Exchange 2010 and higher, which is a pity.
"It went downhill with Exchange 2010 and higher, which is a pity."
Nope, the newer 2010, 2013 and 2016 versions are very good and there are many design, scalability, resilience, maintenance and functionality improvements. 2007 was very flaky and scalability limited in comparison.
Exchange 2010 was the pinnacle of the product for on-premises customers. 2013 and beyond were completely designed solely to meet Microsoft's cloud needs. It's a great solution if you have multiple datacenters and thousands of servers which you buy by the truck load and can afford to have a dozen or more copies of each database.
"Exchange Server 2013 and 2016 however, are different beasts. They are now designed and optimized for Microsoft's use in the cloud and are not fit for most on-premises use."
As someone who is Exchange certified and architects and runs large installs, I can tell you that you are wrong. There are many onsite advantages to the newer Exchange versions.
OMG! A few hundred or thousands of users?
Ha ha ha ha I WISH we had that TINY TINY LOAD!
I deal with PETABYTES of data PER DAY!
I deal with 500,000 64 kilobyte Iinput/Output requests PER SECOND PER SERVER!
I deal with files that are 100 Terabytes in size!
I can have TEN MILLION SIMULTANEOUS real connections and
another few MILLIONS of computer-simulated virtual users on
an inhouse platform.
40 Gigabit connections are TOOO TINY to fit my needed bandwidth!
I use MANY CUSTOM terabit fibre interconnects and THOUSANDS
OF GPU's as mini-HTML/SQL servers!
Your data server requirements are PIPSQUEAK SMALL compared to some people!
So YES email/htm/sql for 1000 users can ALL be done inhouse with a $2000 (1500 Euro) server
and some GPU cards to offload the tasks to!
"things you could do at home with an extremely modest server from 20 years ago"
Such as:
Redundant power
Redundant A/C
Redundant Internet connections
Low latency transport links to backbone
Physical Security
Audited and certified systems and procedures
Sneering at the cloud is really easy until you actually think about it.
Sneering at the cloud is really easy until you actually think about it.
Totally correct. Nobody does this at home except the youthful hacker clubbe and people who think they are consultancy-grade but are actually lacking lots of clues.
If you have the money, you might want to stay off the pubic cloud and rent a few racks in a secure datacenter in the 'burbs, but then it's up to you to manage the hardware/software, which actually costs a bunch of money, especially if you want to harden it against lots of failure modes.
Not shutting down the AC during a fire event is the best way to spread the fire while feeding it fresh oxygen. Should also have fire dampers that close off all the ducting so fire does not spread through them.
In this case it was the wrong thing to do; they should have burnt it to the ground and started over so I gave you an up vote as Azure sucks big time.
Er....? Unless you know exactly where your 'Cloud Service' is being served on every second of every day then how do you know that it is physically secure?
Come on now, the must be a PHD or two in verifying Cloud Physical Security
How do you know that the backup to that swany Azure (other cloud services are available) is not a few old P4's housed in the back of Achmed's Kebab Shop in Kentish Town? (Other kebab shops are available)
Do you really know for sure and not what the cloud snake oil salesmen tell you?
A modest server hardly requires redundant A/C - but basically all the stuff in your list that actually matters in real life is quite readily achievable for even quite a small company with nothing more than a dedicated well ventilated IT room.
Redundant power - UPS capable of handling several hours of outage is easy to get and even a small petrol generator could easily be kept on hand in the unlikely case of an outage lasting any longer than that.
Redundant Internet connections - Easy. (And even a 3/4G last ditch option would be plenty for an Email server)
Low latency links to backbone... hardly necessary for the majority of companies, especially if the bulk of their IT is based on one site.
Physical security... really?
Audited and certified systems and procedures... whatever. In practice, plenty of companies get along much better with just a bit of personal responsibility and good old common sense. If your IT staff consists of a handful (or fewer) of reliable, competent individuals that work well together they'll make sure that nothing too stupid is likely to happen.
You can keep your cloud, it's just your data on a pile of other people's computers managed by fallible humans you can't speak to and the whole edifice waiting to fall over when any one of the billion or so sequences of events occurs that wasn't covered by the "certified procedures"
I'm sort of with Christian Berger on this.
It depends on your use case, If it's just a family server(s), even tho' I'm no IT bod, I'd do it in house.
A few hours/days of inconvenience isn't a biggie.
If you're a very small SME then you could get away with it with a off site cloudy email and backup and a decent landline/mobile backup.
I think you see where I'm going.
@garetht t, you are, of course, spot on too for many use cases. PP
ICON> If you need resilience an guaranteed uptime and it doesn't work.
Doesn't have to be a server. I've just bought and installed new fans on this Mid 2010 Macbook Pro inherited from my daughter. The fan noise was becoming VERY distracting. The surgery was really quite simple. I've done far harder.
But oh, the silence! the lack of vibration! Bliss.
> But who wants a san attached loud fan blowing server running 24/7 in their home.
I have a self-built VMWare ESXi server and a NAS running 24x7. The HP MicroServer that runs the NAS is give-or-take silent, the PSU fan in the VMWare server is quiet enough that I don't notice it.
The noise factor has stopped me getting a cheap ex-corporate server off eBay though. We once powered up a de-racked ProLiant DL-something in the office, damn that thing was loud. Lots of tiny screaming fans.
Re: sprinklers
Not a terrible idea if you have an inert gas suppression system. The gas should knock down any major files long before the sprinklers trigger. The sprinklers act as a backup, so if the crap really hits the fan, you will have some soggy hard drives from which to extract data instead of a crispy pile of melted parts.
in theory,
read the article, they're low level, physical, real world problems,
but clouds are so ethereal, each level of abstraction increases complexity.
If you didn't use the azure cloud, you wouldn't have a problem.
KISS
oh, azure is the colour of a clear sky, nothing to do with clouds,
MS cockup again
To be fair, I don't think we'd hear about those at all. I imagine most cloud hosting sites would rather not let the customers know there had been a problem.
My experience has been that most PHBs I've been involved with would rather pretend there are no problems rather than tell the customer ever time there's been a problem which hasn't impacted the customer. Its the same mindset, I guess, that thinks those 9s come from writing the SLA, not good design and careful planning.
We have a cloud supplier who has a global presence. As Admin on services we sell on that platform I get 6 emails from them whenever an "issue threshold" is breached. For clarity I'm in the UK.
1/ There have been reports of issues with X in Shanghai/Hong Kong.
2/ We're investigating issues with x in Shanghai/Hong Kong.
3/ We have identified a probable cause and applied a fix for x in Shanghai/Hong Kong.
4/ We're monitoring x in Shanghai/Hong Kong.
5/ No further incidences of x have occurred in Shanghai/Hong Kong or anywhere else.
6/ Issue is now resolved.
Naturally, I shrug and go ho hum, but should one of our users call and say I'm trying to do x with Shanghai/Hong Kong here in the trenches I can say I know and it's being worked on. In my experience emails 1-6 rarely take more than 45 mins.
All cloud providers have outages somewhere in the services they provide. It's how you communicate that down the channel that counts.
I'm sure you've all had outages and been unable to make progress on investigating and fixing because your colleagues/customers keep ringing to tell you, you have an outage. :(
PP
So, let's countdown the failures :
-VMs were axed
- Backup vaults were not available
- Azure Site Recovery lost failover ability
- Azure Scheduler and Functions dropped jobs
- Azure Monitor and Data Factory experienced pipeline errors
- Azure Stream Analytics went on the fritz
- Azure Stream Analytics had a stroke
Apart from that, the Cloud is marvelous, never fails you and you can always access your data.
Except when it FUBARs and no backup is working any more, but the salespeople will never tell you that.
"the Cloud is marvelous, never fails you and you can always access your data."
That's a strawman - you're saying things so that you can knock it down.
The cloud doesn't guarantee anything except possible failure, and you are massively encouraged to architect your systems against failure. High-availability systems across availability zones, backup systems in different geographic regions.
The people highest on their horse on this page against the cloud are the people who know the least. How infuriating!
Most likely those folks know that architecting for failure in cloud is a pretty rare thing just look at how many customers have outages when cloud goes down.
Hell I have seen developers complain about tcp connections being dropped during a LB failover(takes about 1 second ) because their app couldn't even handle that without restarting it. And this is for a new application stack, not something designed 10 or 15 years ago. I could go on and on for other real scenarios easily.
Building apps with single points of failure is very common still.
I remember what was it a decade ago or so, fire at data center in seattle, a facility that had at least annual power outages for 2 or 3 years prior. Bing travel site was in that data center. Was down for a long time. Maybe MS got it onlinr before the datacenter came back online with external generator trucks about 40 hrs later not sure (this was a colo facility not a MS datacenter).
Point is 10 years ago isn't that long and a company with the size and resources of MS wasn't willing or able to do it for bing travel at the time(hell even I had the foresight to move the company I was with at the time out of that DC 2 years before the big outage), doesn't surprise me that companies the fraction of the size still can't figure it out today. It's not as if it's impossible, it is just very difficult to do and most talk the talk but won't walk the walk when it comes down to it.
Same situation applies to security of applications.
High-availability systems across availability zones, backup systems in different geographic regions.
In Theory, maybe, problem is, Slurp held it wrong, else customers would not have noticed.
What I do not understand is why do people go with AWS or Azure ?
Multiple providers offer OpenStack, you can get service from two or three to do ultra high availability and disaster recovery, same stack, MUCH easier to implement ... if you really wanna go cloud, that is. What are the chances for two or three OpenStack vendors to fail at the same time vs AWS or Azure ?
The people highest on their horse on this page against the cloud are the people who know the least. How infuriating!
Generalization, not good.
If you back Azure, your opinion does not count.
"same stack, MUCH easier to implement "
You are kidding right? Openstack is WAY more complex and fiddly to implement and use than say Azure. You have to edit text files to store config for a start - how prehistoric and insecure. For instance how do you control ACLs for and audit changes to say just one setting in a text file?!
AC as details
Not working for a huge company so cloud has one great advantage, ability to automatically "spin up" additional resources if required (dealing with activity spikes)
Yes, that could be done "on site" but would mean a lot of (expensive) kit, doing very little much of the time, just sitting there waiting for an activity spike.
Other advantages, let's talk Azure here, is the Azure SQL "Point in Time" functionality, all that db backup burden removed, the geographical replication / failover stuff (that protects against some cloud failure) .
If you are a huge company then enough onsite "iron" for those rare peaks is probably viable, and multiple geographically distributed replicating data centres is viable but not for many smaller outfits: Cloud is not perfect, but it's useful for some of us.
The lesson for today is that you should never assume a cloud provider's operations are 100%. I hate having to explain to people why we need to have an instance of our service in more than one region. "But it's so expensive! My cloud salesman assured me that each region is interconnected data centers miles apart and they are nearly incapable of failing!"
It's all just computers and data centers, even if it's very much software-defined and very resilient. If humans and computers are involved, something will eventually go wrong.
It's all just computers and data centers, even if it's very much software-defined and very resilient. If humans and computers are involved, something will eventually go wrong.
Just to simplify things: Murphy's Law applies to everything. Manglement seems to forget that.
This all happened because you lost an AHU? (I'm assuming not all of the AHUs in the data centre were stopped, just some, in the allegedly fire-affected region)
So the rack temperature starts to rise, quite rapidly, because you no longer have moving air, to carry away your excess heat. At what point, do you think, it might be a good idea to have graceful shutdowns of the affected racks, that have lost their conditioned air. You know, triggered by some kind of flow sensor, or a delta-p switch across the AHU fan?
I look after many dozens of air-handling systems. Even with two motors to each fan, and multiple belts, they do break down occasionally.
I think large scale graceful shutdowns in this situation is probably really complicated as they operate as a cluster, as systems shut down likely other things kick in to try to restore availability maybe moving resources to other nodes or something. At some point you probably have to set a flag in the entire system saying it is down and take it all offline(at which point graceful from a customer standpoint is out the window)
I think this happened during that semi recent big S3 outage.
Not as if these are just racks and racks of standalone web servers with local storage.
First thing I thought of too. As far as failures go, thermal ones are as gentle as failures can possibly get* - they're not instant, and you get a warning they're happening. If your cloud can't even handle that gracefully, what the ever-loving fuck is it good for, exactly...?
* ...well, unless the heatsink itself falls off your CPU. You know, because the retaining bracket snapped. And you only realize it because the fan suddenly snaps to full throttle for no good reason. At which point you remember an old Youtube video you once saw about an AMD CPU frying in milliseconds (the Intel one just throttled way down) due to the exact same cause and you bash the power switch mightily. Yes, it survived - new bracket, I'm still using it...
This post has been deleted by its author
"It was easier to make up the whole fire extinguisher thing that to own up about the real reason: forced Windows updates borked the cloud, then all the servers got confused uploading telemetry information to themselves"
the sad thing is that at the time of writing this 14 loons had upvoted the above statement - I suspect only because they dislike Microsoft - or would they like to share the evidence to back the statement up ?
'The problems started when one of Microsoft's data centers was carrying out routine maintenance on fire extinguishing systems, and the workmen accidentally set them off. This released fire suppression gas, and triggered a shutdown of the air con to avoid feeding oxygen to any flames and cut the risk of an inferno spreading via conduits. This lack of cooling, though, knackered nearby powered-up machines, bringing down a "storage scale unit."'
I thought the 'cloud' was immune to failures at a single location. When a VM instance fails at one location, another is started up elsewhere. What happens to 99.999% up time when you have a real fire?
Where is the checklist?
Where is the failsafe switch?
Where is the oversight?
Every DC I worked in had a master control to switch off before work started and everyone was accompanied while working on site to prevent outages.
Seems like it's time to find a new supplier and manager.
The more I think about Achmed's Kebab Shop in Kentish Town, the more I think we're all being fooled.
Not only does Achmed help serve up BT's local OpenZone service (if the shop uses a BT HomeHub), but if the business owner's pc has BitTorrent installed then there is a possibility he is a contributor to a film you may be watching (I'm sure I read somewhere that Microsoft are using BitTorrent techniques to serve up updates since the advent of W10). How do we know that Azure/AWS does not "sub-contract" in a similar way? AFAIK there is no agreement between BT and Achmed as to whether BT can use Achmed's Broadband connection for providing BT's Public WiFi service - BT being a big company y'know. Plus (I'm sure I've said this before), do Azure rent capacity from AWS and vice versa?
Interesting comments on this thread. I wouldn't say I'm against hosting, there have been some very good examples given here of the benefits. However, I think that when moving from on site to hosted the potential issues are not planned for. Multiple geographically spread instances with redundant networking (yours and theirs) should be a minimum requirement you'd think?
We've hosted services with Azure and I'm not aware of them having outages either which is good, either not in an affected location or resilient enough to keep going.
This post has been deleted by its author
Every time there is some cock up with Azure Microsoft give a load of pathetic assurances that it will not happen again and they are always improving. Based on what we experienced in very specific circumstances there was data loss on a VM.
If this or even a far more minor event (a single VM host falling over) had happened in our data centre there would have been people screaming from the rooftops. But this this, because it is in Azure is just accepted, not even a whisper from upstairs.
On site is constantly under scrutiny and has to provide a far better service and then there are complaints about the cost.
It always seems to be the testing that brings these things down, testing of any kind carries a risk that it won't go to plan and fail over gracefully. Mitigating that risk by not testing isn't an option and having on-prem doesn't make you immune from a technician/inspector accidentally pushing the big red button.
How certain are you ? Cool, go on then, go and set of your fire systems and it'll be like a ballet; watching everything seamlessly and gracefully fail-over, migrate and shut down :)
Interesting, we have clients in NE and were not affected.
I presume the referred too Microsoft services are all distributed across the data centre so this event, while disruptive to some, only disrupted a limited number of services for a limited number of clients.
The funny thing is when reading comments or talking to people about Cloud, as soon as they implement a service in Azure they automatically have an expectation of 100% availability and nothing will ever fail.
A lot of this comes back to lack of understanding, if you want availability, you still need to architect your service correctly, even on a public cloud, and this will ultimately increase costs.
Probably? Not a certainty then.
The other point is whether your data had been replicated to that other geo. How can you guarantee that the data you are now looking at through that different conduit is current? And how would you know for certain that it is up to date?
Replication was I believe one of the issues with Lotus Notes. Have lessons been learned?
"The other point is whether your data had been replicated to that other geo. How can you guarantee that the data you are now looking at through that different conduit is current? And how would you know for certain that it is up to date?"
By say using synchronous replication and time stamps as one of several options.
"And how would you know for certain that it is up to date?"
Because transactions will only be committed once replicated to all live sites.
"The note about dirty shutdowns indicates that there was no communication between the cooling system and the servers."
Quite probably so. The failsafe is that the servers will shutdown at a critical temperature - which is a likely better solution in most cases as stuff that doesn't get too hot won't shut down.
To shut a massive cloud system down cleanly in a hurry is simply not likely to be possible in a period under tens of minutes anyway so likely that's another reason why they don't do that.
AWS have been running Availability Zones (AZs) for years.
Fully isolated zones (of clusters of data centers) with low latency connections. This simply wouldn't happen in AWS. Microsoft have been in such a rush to expand their foot print that they have not done a great job here - A single isolated event takes out an Azure region. This is terrible.