Netflix is pointing the finger at Amazon’s cloud for the terribly-timed outage that left millions of US customers unable to access the service on Christmas Eve. Amazon’s ‘Elastic Load Balancing’ snapped, affecting a number of Amazon Web Services customers in the USA – with Netflix among them and, by extension, more than 20 …
You just can't buy advertising like that.
And YOU want to tryst your data with the cloud?
Castles in the clouds
And YOU want to trust your data with the cloud?
Trust and trustworthiness are only related in reality.
And YOU want to tryst your data with the cloud?
Nice Freudian slip there!
this could never happen in the cloud
oh , hold on....
Re: this could never happen in the cloud
Very, very succint observation.
Constant availability and such buzzwords are just that: words. Cloud without data is useless and data and interconnects are subject to laws of physics (and we're talking tens or hundreds of PB for a locality with cloud providers). And even with cloud you have to know what you are doing. Exactly.
The difference between cloud and your own datacenter is just the fact that you outsource the IT service on particular level (infrastructure, platform or software). You yourself are responsible for design of all higher level services built from those cloudy building blocks (and for all built-in redundancies and all SLAs), so Netflix actually IS responsible for the outage, not Amazon*. They didn't construct their service according to reality, they had SPOF in their path. So it is their fault.
Everything else is BS.
*) if a non-clustered server fails in your datacenter and your service to customers is unavailable as a result, you solely are responsible, not the server vendor. You failed your due dilligence.
Um nope. This was Layer 3 stuff. It belongs to Amazon.
Re: this could never happen in the cloud
I love that argument - blame the customer not the provider. You didn't build your app so it could handle the cloud outage so it's your fault.
I'm trying to think of another scenario on IT where that is the case but am struggling to do so. The closest I can come is to blame the management that chose a service provider that is not high quality perhaps because they are cheap bastards, and only looked at price. (which also comes into play with Amazon since they are competitive on price)
I believe Netflix, more than most have done a ton of work to try to deal with this built to fail model. They even have released this 'chaos monkey' app to randomly kill shit. I wonder if at some point they say fuckit and go entirely on their own. I suspect not until there is an IT management change. They got their heads buried up their own asses, relying on a competitor for their critical services.
Unfortunately the vast majority of people looking to use cloud - especially those that use Amazon have absolutely no idea that this is the case. I have worked at two organizations in a row that designed their apps from day 1 to run in Amazon and in both cases neither of them did a wink of work to deal with the built to fail model. In both cases they had many single points of failure, inflexible configurations, and sensitive to things that you can't take for granted in EC2. Fortunately the most recent one moved out almost a year ago (was in EC2 just a few months for production). Oh my god how much better it has been since from a technical standpoint not to mention the really quick ROI. I have more than 2 years of experience in EC2 and related services and it was an absolute nightmare. Even IF the app was cloud designed, the level of service EC2 provides is so terrible and so ass backwards I wouldn't want to use it anyways.
It's like living in the 90s with APIs bolted on.
* No pooling of resources??
* No billing based on what you USE (based on what you PROVISION)??
* Fixed instance sizes ??
* No live migration of VMs off of degraded servers???
* No thin provisioning???
* Persistent networking configuration ????
* How about booting a VM with an ISO image ??????
I wrote a document about 1.5 yrs ago "reasons I wouldn't use EC2 even if it was free", it was 4 pages long of ~12 point text.
Things enterprise IT has had for more than half a decade show no sign of showing up in Amazon any time soon.
All of the other software development companies are similar- none had apps that were built for the built to fail model. It's just not a priority. They'd rather develop features that customers want than make their software resilient to failure. That is the smart way to go, unless you have nothing better to do than to make things globally distributed. The amount of overhead in such designs is terribly huge as well.
The best solution at a small scale does not equate the best solution at a massive scale. Unfortunately for Amazon customers you are forced to design for massive scale regardless. If you do not then you get massive pain & frustration.
If more customers realized this they would not be using Amazon to begin with, so I do my best to educate folks wherever possible.
This is frankly irrelevant. I understand very well that the onus is *technically* with Amazon, but that's not the point. These guys (Netflix) are not Joe Public, they should've known better.
Your end customer doesn't give <censored> about which of your suppliers YOU'VE CHOSEN in order to lower your own expenses is to blame*. When a company A outsources something to some offshore service company B and as a result customer service tanks, it is a failure of the A's management (be it on technical or financial level) and as a result, of the whole A from customer POV.
*) "flexibility", "on demand" etc. are only different names for expenses in various forms (incl. depreciation) and financial risk management
It's not exactly like that. Often it's: CIO of company A outsources to company B, collects bonus based on estimated savings, immediately resigns, new CIO assumes position, realize he's totally screwed. The outsource is a fiasco, customers are complaining, and the cost to get out of the contract and hire and train a new set of workers is prohibitive. In the meantime, investors are asking "where are our savings?" and when the situation is explained to them, they retort "the previous CIO assured us that this would make us loads of cash. You must not be doing it right." New CIO looks like an idiot through no fault of his own.
Although management covered it up, it was already looking like the outsource had destroyed decades of experience and competence at my company and we were starting to fail badly by the time the CIO quit. When the company tried to promote someone into the position, there were no takers. A year later, the position is still not filled.
The point being, this is a situation where one person or a very few people in the company can lead an initiative to do something profoundly stupid that ultimately wrecks the company. And then, get out before the fall. It's not necessarily the company's fault.
Re: this could never happen in the cloud
Anyone who does their research knows that Amazon's cloud has LOTS of serious outages that turn into total outages for your service.
I am designing a service and have it structured so I use two small clouds, neither Amazon, neither with a major outage to date.
So, yes, if you put all your eggs in one basket and didn't do your research on how many show stopping outages Amazon had, it's your fault.
You say your company went with Amazon. Did they not Google for the major outages in the past year? Did they not view a comparison table or a detailed report on these outages? If they had done any of these things they would know not to use Amazon, as it's failures are often catastrophic and ridiculously frequent. Go look at Heroku's reports on the issue, or Netflix blog.
To paraphrase Homer Simpson
"The Cloud, the answer too, and cause of, most modern data problems."
Great way to generate revenue for your own film service.
Unlikely, I thnk.
A substantial part of Amazon's infrastructure is committed to supporting Netflix, an extremely high profile customer of theirs - AWS cites them all the time. The small amount of customer "churn" to Prime Movies resulting from the Christmas Eve outage would be wiped out many thousands of times over if Netflix decided to migrate to a different infrastructure provider as a result.
Also, consider the impact on the business. Just the outage itself resulted in a 4% loss in Amazon's market value today. The publicity, if Netflix defected, would wreck their share price, and investors don't like that very much. It would lead to Board level changes.
I really don't think Amazon deliberately downed Netflix. Sometimes an outage really is just an outage.
Pass the buck
It might be a faulty Amazon service, but what's the point in telling 'Joe customer' that? I'm no particular Amazon fan, but it smells like someone is trying to pass the buck.
Re: Pass the buck
Maybe not to "pass the buck", but to be honest and forthright about the issue. I have a SaaS running off of Rackspace and have had my share out issues with the cloud, almost ALL of them virtual switch related. In the end, many customers don't care; it is YOU who are at fault, regardless of what the actual issue was.
Do I regret the move to the cloud? In some ways, yes. I love the virtualization and the ability to ramp up horsepower as I need it. Do I TRUST it? Let's say I am cautiously optimistic. As long as there aren't any switch issues that take down internal communication between servers, it runs great and I don't have to run out at O'dark-thirty and drive an hour to kick a server that doesn't respond to anything other than a physical hard boot.
I also like the fact that I can scale as much as I need to without incurring more hardware costs. I also monitor various metrics on the hardware using munin and also paging when something is misbehaving. Rackspace won me over with their excellent support which is stellar, btw, so I plan on sticking with them. I believe what Ronald Reagan said about trust when it came to the Soviets ... Trust, but verify.
Thanks for your honesty.
As you regret the move to the cloud "in some ways" and don't really trust your core IT infrastructure, could you tell us what firm you're part of so we can make an informed decision whether to do business with it? For instance, if you're running my bank account, I'd like to change banks. If you're a marketing firm, not so bothered.
Prior to the move to the cloud I had a half dozen 2U Dell servers running my SaaS in a data center. Great pricing, but I did (and still do) all IT-related functions. I had a 1/2 rack and bandwidth in a semi-secure colo. During that 5 years I was using them, I had one (1) minor network outage and a few unscheduled hardware outages to my own hardware, a couple of which required me to drive an hour to hard cycle one of the boxes (same box every time). I generally trust unto myself and that greatly pisses me off. I have several machines on a load balancer that are constantly synced and an 'emergency' backup in case the load balancer goes tits-up or the internal network goes down. Unfortunately, the hardware was getting old and I didn't want to replace it all for my own virtualized setup, so I looked around and went with Rackspace.
My regret is that Rackspace had some virtual switch issues that really affected performance and in several cases caused outage. It seems like they have cleared things up in their last update and I am cautiously optimistic, but remain vigilant. I run a PostgreSQL 9.1 cluster with no real master/master solution. I opted not do do something like DRBD for the database data due to speed issues and if I need to change to a new master all I have to do is kick off a script. That has never happened except while testing failover.
The bottom line is I believe in trusting unto myself. I'll happily use their hardware resources, but I don't like shared database environments so I have my own. I have several slaves in my DB cluster, have multiple forward-facing machines that are kept in sync and I keep a very close eye on things. Several times I have noticed larger than normal I/O issues that were being caused by someone else sharing the same hardware on one of my VMs. Sucky I/O really affects database performance a LOT and each time I notified Rackspace that they had a problem and they responded immediately, including after the 3rd time in the same morning, in which they disabled the VM of the idiot who was causing the issue until they could prove it was fixed. Their support is AWESOME and one of the main reasons I went with them.
My service is a small business solution geared towards the pet / dog daycare industry. There is a marketing component, but mainly scheduling and tracking pets. Short of the virtual switch issues Rackspace had with their Open Stack cloud, and the couple other instances, I have been more than pleased with their service.
You should be thankful that you chose Rackspace - at least there if you want to go physical or go hybrid they are more than happy to accommodate. About two years ago I worked briefly with a company who had a few SQL servers hosted at Rackspace and said "hey rackspace - give us fusion i/o cards", and they got them..
I believe rackspace also has multiple cloud offerings, the big one is openstack but I think they also have
With amazon it's their way or the highway.
Yeah, I don't regret my decision in that regard. We very well may go back to physical hardware for part of the database cluster (master and one slave) when we go to PostgreSQL 9.3, although we are currently pretty happy with where we are now for the time being.
I would swear that they ad a problem earlier this year with the exact same AWS 'region'. Is there some reason that they haven't spent the intervening time implementing a back-up for this type of failure?
"Is there some reason that they haven't spent the intervening time implementing a back-up for this type of failure?"
It would cost money?
Cheaper to let it fail and blame Amazon.
No, it is cloud networking and the concept of SDN (as seen by Amazon) at play here.
That region is where most of their network interconnects (and clusterf***).
Those of us who did network routing protocols and implementations can say - not entirely unexpected. That is what happens when you try to reinvent the wheel in an area which _REALLY_ requires some knowledge of mathematics and not just software mongering even if it is being done by someone like James Hamilton. It will happen again... And again... And again...
GoPro were also impacted they have a post about it on their facebook page.
A system failure? Never heard of such a thing...
Live and learn.
This is a good example of where outsourcing is not a good idea. Netflix is relying on a company that competes with them to facilitate their service. There was nothing Netflix could do to resolve the problem.
I've had "business" partners that tried to get me to outsource certain pieces of production when I had a manufacturing company. I never saw the point since we were able to keep our employees busy all of the time and didn't have to rely on another company to meet our goals.
Outsourcing has been a fashionable thing for business executives, but doesn't make much sense in the real world. There is always a certain amount of outside services a company will use such as machining, plating and shipping, but there are many suppliers and switching from one to the next is quick and painless. Having a single source for something critical to your company is always a risk. If that single source is also a competitor, it's time to reconsider if your business is worthwhile.
"If that single source is also a competitor, ..."
That is a very scary thought. Who would let themselves get into that situation?
Think about it this way: When it all goes pear shaped they have an excuse.
Samsung, Sony, LG, HTC, Asus and any other Android vendor.
Any Windows tablet vendors.
Any web startup using Google.
Apple where they rely on Samsung components.
Even major players sometimes decide the risk is worth taking (or don't properly consider it).
It's a matter of business investments.
I think its one many company will face going forward and certainly something I considered when "moving to the cloud".
When Netflix was starting out, they were probably just developers that needed a platform. Amazon was the company they choose (probably because they were the most current) had had not thoughts of Amazon as a competitor. They were simply a third party by which to deliver their product.
The fact that Amazon now competes with them on their product is reason to consider a switch, but the question is, do you move it another player? You next best option is a Rackspace., but are you getting the same deal?
Should you setup your own infrastructure? Now you need to invest in the people, equipment, and data-centers to make it happen.
I can still see more benefit for Netflix to stay with Amazon, namely, Netflix isn't going to commit the required resources to operate on their own, no matter how much "better" that may be.
I agree with your last statement, they aren't going to move until their management is replaced. I've seen time and time again the costs of EC2 FAR outweigh the costs of operating your own stuff even after staff overhead. In fact you need MORE expertise to operate in EC2 than you do on traditional enterprise equipment. Mainly because it is difficult to use, lacks features, built to fail, billed based on provisioning rather than utilization, limited training available, support sucks, etc etc... all of these drive costs.
even when you toss out everything beyond basic EC2 costs and assume EC2 operates with the same level of features and reliability you can get elsewhere the costs still blow out doing it on your own in most cases.
EC2 and related services are like a roach motel, easy for developers to get in, very hard to get out.
S3 is not bad by contrast, I dislike amazon, but S3 is a halfway decent service which is true pay for utilization, is fairly reliable, costs are reasonable. The main thing S3 lacks of course is automated inter-region replication.
And still it managed to rain on their parade.
Eh-yup. Let's put all of our sensitive, proprietary data out on someone else's data network that's wide open to failure, compromise, and theft. "To the Cloud." More like, "To the moon, Alice."
It's the Netflix Christmas special!!
Starring Jeff Bezos as The Grinch :)
( I think it's extremely unlikely that the outage was an attempt to subvert Netflix in favor of Amazon Prime, but even so, Netflix might want to ask why they are relying on a competitor for key infrastucture when there are lots of other options out there.)
We seem to have an escalating situation where there are Amazon related outages. I wonder if somebody has over-reached themselves.
Oh well, back to my 1949 copy of The Wind in the Willows.
Cloud services are fine
As long as one backs up ones cloud data locally.
My DVDs and Blu-ray Discs worked just fine over Christmas.
Btw it always seems to be the east region. Amazon does need to sort that one out. If that gets more usage then up resources in the centre.
progress at last!
It was most annoying - here of course it was the early hours of Xmas day - and just like Netflix said - it didn't affect ALL devices just some of them. While my Nexus 4 could quite happily connect to Netflix, my Panasonic Freeview HD+ box could not - and I really REALLY wanted to watch some more of the 4400.....
I complained on Twitter and asked about the possibility of a free month - and hey presto, 2 minutes later it was streaming like nothing had happened...... guess I'm not getting a free month :(
You a month's free service because there was an outage lasting less than 24 hours? Is your life so empty that couldn't amuse yourself without Netflix for a few hours?
Well, look at that!
It seems that despite the hype of cloud computing and the outsourcing craze, that it still is necessary to sit down and work out the design of your systems, including figuring out how you will avoid compromising your business when the inevitable component failure (cloud provider outage) occurs.
Reality is still in charge.
Reality has teeth.
Reality does not take too kindly to being ignored.
Also, Canadians. This also affected Canadians - or at least this Canadian, in Eastern Canada.
First it prevented me from watching on my WDTV, then I tried my Xbox360 and PS3 with no luck. My Asus Transformer Infinity worked fine for another hour, then that too stopped working. I continued watching on my PC - which eventually got a little wonky, but did keep letting me watch stuff, if after a bit of a longer wait than usual.
(This was the Canadian Netflix, not the US Netflix with the DNS trick.)
same on the wet coast
Wii would not start Netflix. Could look at the website on the pc, could not get account details to come up.
This explains why, on every other movie my wife and I tried to load, it'd get to 7% and then just sit there in graybar limbo. Even more infuriating were the times that a movie load would get to 99% and then freeze up. This couldn't have happened to them at a worse time -- Christmas Eve, when everybody and their cat is wanting to watch It's A Wonderful Life or Miracle On 34th Street or The Waltons' Christmas episode, and all they get is "Loading... 7%..."
We finally were able to get three episodes of The Larry Sanders Show to play before Netflix finally flatlined for the rest of the night.
Once again, I was thankful for my local stash of DVDs and mp4's. My wife used to have a DVD rental account with Netflix but went to their streaming service a couple of years ago and, at last report, still swears by it. Eurgh.
Don't Amazon own Lovefilm?
No conflict of interest there then!
No more than any other sector really...
Apply buying parts from Samsung...
Nikon buying image sensors from Sony...
Conspiracy Theory Anyone?
Maybe Netflix should look for a new provider ehhh.
"...and the social media app Scope"
What idiot at Netflix thought it would be ok to host their service on a competitor's machines? I mean, what did you expect to happen? "Sorry, something broke, your service is not available. But our competing service is fine. What a coincidence."