If you’re thinking about heading to the cloud for über-reliability and an environment in which anything that happens to hardware is someone else’s problem, think again: Amazon Web Services sometimes replaces the hardware virtual servers run on and switches those servers off without elegant or accurate notifications of what’s …
love that built to fail mantra
haven't had to rebuild or re-install a single server since moving out of EC2. After two years of using EC2, it was by far the most frustrating experience of my career.
I'd wager greater than 98% of operators out there are not equipped to handle such a built to fail system (because they have had no need to operate such a system themselves, they just repair and keep going, or perhaps in the VM world just move the VMs to another system (perhaps automatically)).
This sort of behaviour can be somewhat of a culture shock. Contrary to what could be considered common beliefs, it actually takes more skill (far more) to effectively operate in EC2 than operating on your own infrastructure (or on another cloud provider that utilizes more intelligent infrastructure).
I loved this article from El Reg:
It was (and still is) a masterpiece.
I knew EC2 was going to be a nightmare even before I originally started using it, turns out it was far worse than I expected.
Built to fail makes sense at very large scale, but makes no sense (too much overhead) at smaller scales. Most places are small scale(say 500-1k servers or less). Amazon should stop trying to shoehorn their operating model on the customers. But I'm not holding my breath on that one.
I've worked at nothing but internet facing software companies for the past 10 years and nobody has really ever built anything (software wise) that could live (e.g. no single points of failure etc) in a built to fail environment. Companies would rather focus on developing features for their customers than worry about making the application resilient to failure. And the level of automation required to handle such situations gracefully is extremely complex.
Re: love that built to fail mantra
one more thing - I had to verify just to make sure. But if your instance is "stopped" and you start it again you get a new network configuration (IP address, host name etc).
Some folks use Elastic IPs (these can be somewhat static - though you pay extra for IPs that you allocate which you don't use) and you can assign an elastic IP but Elastic IP only applies to the external interface, not the internal. For most applications if you have them talking between servers within the cloud you may want to leverage the internal IP (better performance perhaps, and also may be cheaper as you might incur bandwidth fees talking on the externals I am not certain). In any case, you can't preserve the internal IPs..
Which reminds me - EC2 has got to be the source of the largest waste of IPv4 addresses out there, since every instance get's it's own external IP, when I'd wager in 99% of cases that is not required.
I could go on all day, but I feel like one of those people that were crying wolf right before the housing bust and nobody was listening to them. It gets tiring.
Re: love that built to fail mantra
The way to work around having a dynamic internal IP address is for your VMs to communicate using the *external* DNS records of your Elastic IP -- when they're resolved inside of EC2, they point to the current private/internal IP of the VM.
See here for more detais:
Nothing to see here
Um, this is part of what cloud means, at least in AWS's visualization of it. If you have something to "repair" when an instance goes away, you are doing it wrong.
Re: Nothing to see here
Exactly - you should have bought elastic storage, put a copy of your data on elastic ice and elasticated yourself in all other relevant elastic paraphernalia.
Oh, did that just double the estimate you had for how much would running your business on EC2 cost? Tough - too bad.
We finally moved our AWS instances to a colo and couldn't be happier - half the cost, 6x the performance, 1/30 the latency and incomparably higher MTBF. Back when I had 80 virtual servers with them, they would fail about twice a week. In comparison, the scheduled outage notices were much rarer - less than 10% of cases, and thus lost in the noise of EC2's general crappiness.
If this bothers you, you don't understand EC2
As Werner Vogels (Amazon's CTO and AWS's CEO) is oft-quoted as saying: "Everything fails, all the time." And as "ysth" notes above, your infrastructure should be designed to be invisibly self-mending on failure. In fact, some AWS customers deliberately inflict damage on theirs - Netflix's "Chaos Monkey" and "Simian Army" kill instances and other infrastructure components at random, on the live system, to maintain a constant assurance of "failover" protection.
The whole AWS ecology is designed to support fault-tolerant systems. If you use multiple availability zones properly and a whole AWS datacentre dies, another takes the strain, invisibly to your users. You don't hand-craft running systems: you create Elastic Block Store templates, so that if an instance stops, a clone can fill its shoes. You don't store critical data locally on a running instance's "disk", you keep it in Simple Storage Service, DynamoDB or Relational Database Service, where it's a lot safer and backed up. You don't run one big server instance when you can run several smaller ones behind an Elastic Load Balancer with AutoScaler support to tailor demand to supply (including shutting down instances when things are quiet).
Yes, "Voland's right hand", that does mean extra cost over and above the basic instance requirements. But compared to the cost of running your own machine room (a single point of failure), it's still dirt cheap.
If you don't like the PAYG prices, bid on the Spot Market for EC2 instances. Bid a huge multiple of the usual PAYG price, and your instances won't get killed if the spot price spikes briefly, but most of the time you'll pay far less than the usual pricing, usually 50% or less.
The thing is, if you're sensitive to single instance failure, you haven't designed your architecture correctly: it's not how it's supposed to be used. The AWS site has a vast amount of informative literature, covering everything from security to stability, explaining how to make best - and most economic, note - use of their infrastructure.
Bottom line: if you're happy with a single point (or location) of failure, put your server instances into a colo. If the colo's power supply, or net connection, is broken, or if the colo company goes bust and "forgets" to pay the bills, you're on your own without a fallback.
Instance MTBF should be irrelevant: it's the whole server infrastructure MTBF (including data) that's the only important statistic. Get that right, as have Netflix, NASA and a whole bunch of other AWS super-users, and you've got it cracked.
the thing that bugs me about this - be it Aure, AWS or AppEngine - is that to engineer for failure means you need more overhead running (ideally distributed in multiple locations) and while it's still cheaper (usually) with the cloud it's really hard to predict with the pricing models that all of these guys have in place.
I've worked on a couple of projects recently where the customer didn't really know what their traffic and usage was going to look like (both new ventures) so in both cases we budgeted for success (ie took their estimates, modelled the compute load and then doubled everything)... one of them is paying about 30% of what we budgeted (some of that due to offloading a lot of dynamic content to static pre-published pages and looking at how we could cache better - they're actually above their revenue/user estimates), the other is about 10% overbudget because the usage pattern doesn't fit nicely into the particular platforms billing model (but now we've modelled and observed for a couple of months we're tweaking the architecture to bring costs down).
In both cases though the random restarts or disappearances of instances has been problematic and required a lot of up-front investment in the architecture and design that was hard to explain.
Hopefully as cloud computing matures the "design for failure" mantra will become easier to develop around with frameworks and toolkits becoming more readily available and understood but until then... a bit of up-front effort pays dividends in the long run...
am I missing something?
I thought the whole point of putting your biz / server requirements on the cloud is for near 100% uptime, yet this article suggests that servers can go pete tong more regularly than people think.
If this is the case im all for physical servers, at least a ticking HD gives you good warning that remedial action is required.
@Tezfair. If you are missing something, then I am missing the exact same thing.
This does not sound like what "cloud" is always described as to me. This appears to be a layer of virtualisation over some server resource that you still buy piecemeal, that results in it failing more often than if you just bought some servers.
The requirement for (and lack of in most cases) software/system that can get over things failing by itself is one of the reasons the little guys are invited to put their stuff in the cloud I thought. The cloud handles the outages by shuffling stuff around as needed. Not so it seems, although admitted I speak without much knowledge or experience.
No sense IMHO. Also explains why mysqueezebox.com has been so flaky in recent weeks even though it's "in the cloud"