DaaS
Downtime as a Service.
Continuing Innovation from Microsoft.
Admins in Microsoft's Azure cloud data centre for West Europe in Amsterdam, Netherlands, have spent the morning battling severe problems in the gear that supports Redmond's main cloud service. Problems with the core Compute and Storage components were first reported at 9:39am UTC on Thursday, according to the Windows Azure …
"I thought these wonderful cloud systems were supposed to be highly reliable?"
Nope - circa 99.9% uptime is the quoted norm. Azure has historically been a bit more reliable than say Amazon S3 though.
"Like system redundancy, fail safe, multiple copies of data etc etc so if something fails it just keeps going with the remaining working resources?"
That's why Azure has multiple regions - so that you can create applications that are resilient to a local issue.
I thought these wonderful cloud systems were supposed to be highly reliable?
No, they are supposed to be cheaper in capex.
Everyone's comments here are proof that it is possible to build a reliable service on top of an unreliable service, TCP being a reliable service that is implemented over IP, an unreliable service. The idea of clouds is that lower capex costs allow you to dynamically scale your loads, allowing you to provide a reliable service to your users that is built on commodity cloud servers that may be unreliable.
Not seen one done right so far though, and if you are in business long enough, the benefit of lower capex is quickly extinguished by the massive increase in opex.
MS are addicted to complexity. If you use their site, it needs JScript. If you use hotmail it uses an utter ton of JS. If you use their products they tie in together to make a gordian knot that can't be cut. It's deliberate; put one foot into their garden and they try to tie you down forever. So, they are addicted to complexity. The downside is bugs and failure. I wonder if that's the ultimate root of their cloud problems, as well as their desktop flakiness[*].
[*] Am learning SSAS 2008, just working through tutorials. Just basic stuff and I've managed to crash it outright once and have had over half a dozen internal errors (for which google has been much more helpful than MS) which has cost me hours. Utter crap.
(sigh) It's not about JS per se. My point is that it is unnecessary. gmail works fine without it. But they use it by the shovel load for no good reason (UI prettiness isn't a good reason IMO). They're addicted to More, not Simpler, and if I'm right that attitude may have worked its way into their datacentres and is causing them problems. Clear enough?
This post has been deleted by its author
> What are you talking about, gmail.com is very, very heavily loaded up with js.
disable your JS and try it. It works. That's how I use it. Disabling JS on hotmail just redirects you to a page telling you to enable it.
>>>> BUT the point is not JS but complexity. I mentioned JS overuse as a proxy, not the main point. <<<<
We were affected by this outage in West Europe... The RCA report we received is as follows :
Incident Title Storage and Compute in West Europe : Partial Service Interruption
Service(s) Impacted Azure Compute (Service Management), IaaS, Azure Service Management, Storage, Azure Web Sites
Incident Start Date and Time
5/1/2014 2:39:00 AM (Pacific Time)
Date and Time Service was Restored
5/1/2014 3:40:00 PM (Pacific Time)
Summary
On May 1st, Customers may have experienced timeouts or errors with their Compute or Storage services in West Europe sub-region. The root cause of this interruption was an unexpected power outage during scheduled maintenance in the datacenter.
A set of racks lost power affecting compute and storage services running there. Most racks recovered automatically once power was back, however some needed a reboot of their chassis to recover. Once mitigation and verification steps were executed on all clusters, full functionality of all Azure services were restored.
Customer Impact
Customers may have experienced timeouts or errors with their Compute or Storage services in West Europe sub-region. Storage account creation may have failed during the impacted window.
Affected sub-regions
Region Sub-Region
Europe West Europe
Timeline
Time Event
5/1/2014 02:39 AM PST The Microsoft Azure team received the first alert of a power outage. The investigation initiated promptly.
5/1/2014 02:40 AM PST Power restored to impacted racks.
5/1/2014 03:08 AM PST Majority of services were restored automatically once power was back. Automated repair process (Service healing) started repairing for offline instances.
5/1/2014 03:40 AM PST The Microsoft Azure team identified some racks needed a reboot of their chassis to recover. Mitigation steps were validated and executed over the next hours
5/1/2014 11:25 AM PST All services were fully restored but Azure team kept monitoring and verifying that the restoration processed as expected.
5/1/2014 15:40 PM PST The Microsoft Azure team confirmed full recovery of all Microsoft Azure services.
Root Cause
A power outage due to a human error during scheduled maintenance in the datacenter.
Next Steps
We are continuously taking steps to improve the Microsoft Azure Platform and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
• Improve validation process during maintenance to prevent human errors.
• Investigate and repair server hardware that encountered additional reboot failures, working closely with our partners.
• Tooling and automation improvements to minimize time to recovery.
We apologize for any inconvenience.
---------------------------------------------------------------------------------------------------------------------------------------------
Work experience boy was allowed in to the data centre!
Microsoft were slow to respond and we were without compute and storage services for over 6 hours.
You may be interested to know that geo-failover did not occur. Why not you say? Isn't that one of the main attractions to the cloud?
Apparently, for a Microsoft Azure data centre, a “major disaster” would be a complete data centre going off-line. Microsoft felt that as this was not a complete data centre outage as the majority of their other worldwide customers were not affected. Since the entirety of the services in the data centre were not affected, the geo-failover process was not invoked.
Our future involvement with Azure will now be very limited. There is no service level redundancy. Data is copied from one site to another, but you aren't in control of it and you can't access it in the event of a disaster. If you want to have service level redundancy in the Azure, you need to provision additional services yourself. Effectively duplicating all your systems should an apprentice unplug a row of server racks. This makes the entire Microsoft Azure offering uneconomic and we'd be better placed expanding our current data centre where we are in full control.