With clouds come rain
Who'd have thought it?
Oracle is making hay over last weekend's mega six-hour Amazon Web Services (AWS) cloud outage. "You get what you pay for," tweeted Oracle's Phil Dunn, with the caveat that all views are his and don't necessarily reflect those of Oracle. But you get the point. Yes, Amazon's been left with egg on its face and rivals will be …
If its not AWS then your second provider isn't sharing. Although since AWS is multi homed, and your other provider is multi homed, you may end up using the same network provider.
The issue many forget is the costs involved in shipping data from AWS to another source. Very expensive.
If you have, like the author suggests in this article, your compute nodes on AWS and your storage on Google, there is a BIG chance - in fact, almost a certainty - that such setup will not guard you against any failure. In fact, it's even worse: you will experience a outage whenever _one of the two_ has problems.
..................of the pond have a tendency to say when they really, really agree with someone; This!
These days it is not enough to check whether your "independent" provider is or is not financially in bed with someone you wish to limit your exposure to. You also have to check what systems they share/are dependent upon - even if, as far as the law is concerned, they are completely different companies with no shared ownership.
Ok, lets say you wanna go with two providers ... can you mirror shit on Azure and AWS or Google or whatever easily ? I doubt it.
OpenStack has many vendors, same tech, same stack - much easier to go with them and certainty you can just "switch over" in case of emergency with one of the vendors.
However, my hands bleed (how many times have I typed this this week?), don't put sensitive data on a cloud, and never EVER put critical data on a cloud.
AWS has egg on its face for violating its availability zones. It will learn and improve and make more money.
Heads will not roll at Netflix because they are making money and getting profitable stuff done.
Dual providers will not improve the cost/reliability ratio. Done right (big if), with considerable cost, one might increase availability by a few hours every few years, numbers from the author. It ain't worth it for movies, music, games, or shopping. If you're running a bank or brokerage, well good for you.
Just a year ago, their 'CRM on the Cloud' product had an outage:
But hey! If I roll my own Oracle then I at least can guarantee uptime...
From a couple of years back... Primary database problems lead to data loss for Salesforce customers.... (Salesforce use Oracle for their primary RDBMS)
What can we learn from this?
1) Cloud versus non-cloud is not a discussion of failure versus reliability
2) People jumping with glee on the misfortune of a competitor rarely draw attention to their own mistakes.
3) It is formally impossible to guarantee 100% uptime, and *any* architecture is subject to changing conditions which might increase the the chances of failure... without anyone noticing till it's too late.
lets face it, netflix and such are luxury, non-essential services. If they are offline, someone can't watch a movie !! lets get real.. thats not life threatening. some gobshite has to get off the couch or change a channel. A few hours offline?? Not a biggie really. How many people are going to cancel a subscription for that? very very small percentage really.
So the decision is based on cost of dual setup vs number of cancellations due to a few hours outage.
My bet is that its cheaper to stay as they are.
Big difference to Netflix offline Vs paypal/worldpay/sage/salesforce, which businesses rely on.
let not make mountain out of molehills for not being able to watch a tv show for a few hours..
Reality is, clients want a cheap service, so the company providing it have to keep costs low to match. If everyone voted that they would pay Netflix $/£/€ 150 a month, then yeah, you would expect them to have a better setup.
Will someone not think of the children !!
If you are doing stuff at scales similar to Netflix, you can't just easily run with two clouds. Each cloud has its own APIs (loads of them, to be precise), certain things you can or cannot do, or which you have to engineer just a little bit different due to unsupported features on one side, which exist on the other, and vice versa. The same would be true if you had two independent providers both running on OpenStack for example. Because OpenStack is a rather messy affair still and which features a provider offers is their discretion.
So in theory running dual cloud is all dandy. In practice it will also increase costs, because your deployment process, sometimes even the application design, will differ significantly. Not to mention the additional logic of keeping them in sync, making sure that fail-over (and recovery!) between entire clouds is handled correctly. It's really not quite as straight forward in practice.
(And as others pointed out, splitting parts of the application infrastructure between two clouds actually at least doubles your chance for error. What you do want to do is *duplicate* it.)
It's a business decision whether increasing your costs by a fair margin upfront is the better option (to be protected against such failures), or if you just accept that even in clouds like AWS shit can happen -- albeit not very often and so far never on a global scale (they do have several DC locations world-wide, three alone in the US if I'm not mistaken; so there's your same-API disaster plan, if you really want one, without increasing your costs).
I guess we all know what most businesses (or their bean counters) would opt for: take the risk and reduce your costs from the onset.
This is still much better than running your own data centers and getting bogged down with Infrastructure and Network teams that block progress and innovation. If Netflix had such legacy Network Teams, they would still be trying to talk the Network team into opening up firewalls for port 443.
"Despite the embarrassment of Netflix this time around..."
"Heads should roll at Netflix for its over-dependence on AWS. Increasingly, its status as an all-in-one-AWS pioneer is hurting it."
I'm not sure from where you're getting this info on "embarrassment" and such... According to Netflix's own blog post from 9/25 (which you conveniently didn't link to) they experienced a "brief availability blip in the affected Region, but [they] sidestepped any significant impact..."
Doesn't sound like anyone over there is regretting their decision to move to AWS.
This article seems contradictory to me (or at least confusing). On the one hand you take the time to say...
"The best way to avoid going dark is to architect your service to fail over to different nodes within a region. Even better, different regions. The AWS outage was centered on the giant's US-East region – it has eight others across the planet."
Yes, if you had proper redundant services in other regions you should have been "pretty much okay", If you did not take the time to build in proper fault tolerance you get to suffer (when is that not the case?).
The contradictory or confusing part is at the very end... "Heads should roll at Netflix for its over-dependence on AWS. Increasingly, its status as an all-in-one-AWS pioneer is hurting it.". This could be more accurately stated as "Heads should roll at Netfilx if they in-fact suffered a meaningful outage as a result of not spinning up redundant services in a different AWS region, they are increasingly seen as Amazon's reference customer and should definitely be making use of best practices Amazon has described for years to increase reliability.". But I guess that's not such a fun way to sign off, is it?
This isn't a good reason to give up on outsourced IT. Yeah sure it's annoying when (very very rarely) AWS goes down. However, if you think you can do better with a homebrew solution you're deluding yourself.
Trust in the stats and if you can't beat 99.999 uptime, stay with someone who can. It's true that 20 minutes downtime when all you can do is refresh the status page may seem like forever to you while a couple of hours executing your own emergency restore might fly by. However, no-one gives a crap how long it feels to you. They care about how much buisiness is lost and rightly so.
Obvious exceptions, reactor control software etc or basically any other situation where your company's area of expertise is reliability and hardware (unlike Netflix who's area of expertise is negotiating with studios).
"...having convinced themselves they are the companies who know best"
Well, good luck with outsourcing the business itself and all the mitigation of operational risk to someone who knows nothing about your business(in the case of Netflix and others). This in itself underpins the 'good luck' service that customer can (and do) expect of technology services. Yes it's cool and yes it works and yes it's cheap, but much like life insurance you don't 'get' any value from it.
All the cloud providers effectively own the risk of a huge and growing number of services (some commercial, some public). So you can expect issues like the ones outlined here beginning to impact the otherwise rock-solid government services like 'UK Passports' to be heading the same way.
Like most clouds, there one minute, gone the next.
Biting the hand that feeds IT © 1998–2019