So much for fault injection testing !
" Hey Ravi - run this CLI ... that is what fixed it last time ... "
Amazon has provided the postmortem for Tuesday's AWS S3 meltdown, shedding light on what caused one of its largest cloud facilities to bring a chunk of the web down. In a note today to customers, the tech giant said the storage system was knocked offline by a staffer trying to address a problem with its billing system. …
"an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the team wrote in its message.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."
Wow they make manual command line changes that can impact lots of production systems?! Glad I don't use Amazon then. Such changes should be planned, change controlled, scripted in a file, and 4 eyed before pressing go....
"To guarantee a mess put a human in charge of said computer. Enough said."
And to guarantee a shitstorm of Diluvian proportions, put said barely-technical human in front of some automation they don't really understand - but hey it looked great in the management meeting.
It'll save us loads of money, they said.
It'll guarantee five nines availability, they said.
It's foolproof, they said...
Funny that you mention punchcards: I recently pulled one of my old boxes of code stacks out of the cellar, to let my grand-daughters make the quintessential early-70s craft project: the Punchcard Christmas Wreath.
I had forgotten the joys of card stacks, and the multiple marker and highlighter lines across the top of the deck to help quickly restore the deck if you dropped it.
Good times, good times. . .
The 1970's with it's punch cards was good times, a peak in many ways for Canadians, and I'm not talking Fortran WATFOR or WATFIV.
Back then the average family income was about $10,000. That's about $65,000 today, which if you look up family income is still roughly about the middle of family incomes today. No real growth but apparently not much of a set back, until we look at where that income comes from and goes.
In 1970's family income was usually from a single income. Today almost all $65K families are at least dual income and thanks to dramatic changing in Canadian taxes, from who and how much is collected they do not get to keep much of that. Even the US numbers show us what good times the past was when it came to growth and optimism. .
"Expressed in 1950 dollars, U.S. median household income in 1950 was $4,237. Expenditures came to $3,808. Savings came to $429, or 10 per cent of income. The average new-house price was roughly $7,500 – or less than 200 per cent of income. By 1975, however, it took 300 per cent of median household incomes to buy a house; by 2005, 470 per cent."
Many more years in school and training are required to get a job, all adults in a family have to work, most at jobs with much longer hours and often no benefits and today it is almost impossible to get a detached house in a major Canadian city for even 10X the annual income of the average high school graduate.
When I look fondly at punch cards I am reminded that the good times was largely the result of citizens being "allowed" to share in the wealth they were creating.
They dig into this to an extent in the full statement. The command alone wasn't enough to do it. It was running a command designed for a much smaller scale of S3 over too many machines causing a bunch of systems subsequently layered over those machines to mutually screw each other up.
Critically it was the requirement to restart that really screwed them. The system hadn't been restarted in so long no one noticed the restart procedure took a really, really long time. Cheeky little humblebrag, methinks.
They also mention a full audit of existing operations to ensure sanity checks are in place. I for one look forward to the outage caused by being unable to affect a change to as many machines as actually needed, because sod's law's just like that.
Oh dear, that sounds like an event.
Not a process.
Which suggests they will find (and hopefully) fix all such issues this time round a whole new bunch will accumulate over time till the next one surfaces and borks them again.
Periodic review following significant (cumulative) changes should be SOP for such a large operation.
"Under what circumstances would you want to be able to (virtually) shut down a whole data centre with one (mis) executed command?"
Ultimately somebody has to have the power to do this because shutting down servers is a valid admin activity. However it should be made a multistep process with plenty of Are You Sure? types prompts (or even somehow require 2 people/keys nuclear missile launch style), not something that can be done with a single mistyped command. In the end its a balancing act between treating your admins like responsible professionals and not children who need to be hand-held, but also ensuring one tired person can't make an almighty cock up.
"However it should be made a multistep process with plenty of Are You Sure? types prompts"
Not just "are you sure Y/N", but also "Here's exactly what is about to be done... is that correct and what you actually intended? Y/N", otherwise anyone would just assume the command they'd entered would do what THEY intended, not what the command was about to do.
This will shutdown 1040 servers, please type 1040 to continue.
This will reduce capacity enough to cause a service failure for the following 8 services
Please type "8 SERVICE FAILURES" to continue.
My point exactly.
Yes servers have to be taken down. Yes sometimes clusters of servers have to be taken down. But it should be very rare that all need to be taken down at the same time.
And it should be impossible to do so without whoever's doing it realizing exactly what is about to happen.
> Ultimately somebody has to have the power to do this because shutting down servers is a valid admin activity. However it should be made a multistep process with plenty of Are You Sure? types prompts
How about "Please enter the shutdown validation GUID. This can be found on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."
It's a super awesome convenience to be able to hit tons of machines in a big data center operation, but as you can see things can go wrong in a big way. It would be interesting to see a pseudo-syntax of what happened, if this was a webgui or a cli, or a script, what have you. I can tell you at the Yahoo! CEO shuffle I attended a few years back we could address wide swaths of machines, but most of the folks knew what not to do, and how to break up big jobs (ha!) into easy to handle tasks. For instance, my first task was to run a script that fixed the storage issue with NetApp "moosehead" disks that would cause it to loose data and the extra cool thing; not be able to recover from their RAID! Good times! This was on over 300 mail "farms" which were middle-tier mail handling clusters that did the sorting of mail vs junk/spam. The spam goes off to cheapo storage, and "good mail" goes to the main stores. Anyway, the IDs needing fixing to point mail user's mail to the new storage by running a script on close to 6000 machines, no VMs, all pizza boxes. No WAY was I to just go nuts and try and run them all at once, even though you could very well do that with Limo, their internal custom multi-host command tool, later replaced by a tool called Pogo. Clusters of machines could also be addressed with aliases, so I could say "all hosts in a group with a simple name"; turn off the flag to show availability to the VIP. For the script work I was clued in via change management meetings, then I ran the script on one farm to make sure it worked and that we did not clobber any users, then we did 10 farms, then 100, and the rest (are here on Gilligan's Island!). No problem. My goal was to not cause any issue that would make it into the news. :P I had nothing to do with the security also, which is a big embarrassment to their new owners, I'm sure.
I was also in Search (AKA the Bing Gateway) and there we typically choose UTC midnight on Wednesdays to perform updates to the front end search servers. In the US there were two big data centers, each with two clusters of 110 hosts to handle the web facing search front end. For maintenance, you just choose a single host, take it out of the global load balancer, then update it, and drop it back in with extra monitoring turned up. If it does not crap itself, we could then take out half of a data center, do the update, put them back in, then repeat the process three more times for the other clusters, and that was that. But, yes, super easy to fuck up and take out every data center if you don't pay attention to your machine lists.
Quite honestly - if Bing , FB, google , yahoo , blah blah - disappeared would they really be missed ?
They produce nothing other than hordes of advertising spam . Remember the days before that crap existed .. young adults could actually have a face to face conversation , working meant doing something other than browsing the internet for links to share among co-workers ...
6000 machines...so run 200 machines at a time for 30 times.
What is this obession with 10,100,2000,rest and doing a massive population in 5 steps?
Even if 2110 machines worked fine how long would it take to fix the last 3900 machines if enough of them broke?
For failures it is not the number of times you have done it before but the size of the failure domain and how long it takes to fix.
it should be possible to rollout automatically in small batches and even had multiple upgrades rolling out at the same time on an automatic schedule, ripple across the farm!
If it is automated and scheduled who cares how many batches of upgrades are run?
You would catch errors with less impact that way as the failed batch size would be smaller and it would be minimal extra work if designed correctly.
This is the next stage in cloud service design - being able to have slower rolling upgrades with smaller batches!
When it comes to spotting mistakes, the first guess is probably the correct one - and having had numerous requests for file recovery over the years, the 'extra space' problem is not that rare.
Perhaps oddly it seemed to be more common amongst people who did know what they are doing but didn't stop to re-inspect what they typed to see if they accidentally batted the space bar somewhere.
Though at the other end of the scale, someone trying to follow unfamiliar instructions printed in a poorly-selected font where they have been told 'do this exactly' and it sure as hell looks like that's meant to be a space there...
I did similar thing about 2 months ago on my mac while trying to tidy stuff up in the root drive.
sudo rm -rf User*
hmm.that's taking an awful long time to delete some temporary crap....
CntlC CntlC CnltC
Luckily good old timemachine got me back to an hour before and I had a 'Users' directory again.
I have to say, in 10 years of mac ownership... one of the many many many times timemachine has got me out of a deep deep hole.
I also remember one time, about 20 years ago - working for a large UK telecom company...needed to reboot one of the live boxes that handled 30% of the load of UK non geographic phone calls (0845, 0800, etc)...
sudo shutdown now -r
hmm can't seem to connect to that... doesn't seem to be coming back up..
It was in an unmanned exchanged 30 miles from the nearest engineer.... had to get one of em to go out there, and press the ON button again.
What Amazon left out, and what El Reg didn't mention in their article 12 hours ago, is Availability Zones. You're not supposed to have to go multi-region in order to be able to sustain a major AWS outage. Being in multiple AZs is supposed to allow you to survive a fat finger by an AWS employee.
The fact that Amazon's statement talks so casually about US-EAST-1 S3 makes it clear that there is no segmentation of S3 between AZs. If S3 isn't segmented that probably means other AWS services aren't either. Paid extra for multi-AZ RDS? Added extra EC2 instances for multi-AZ load balancing? It won't help at all if RDS and ELB are administered at the regional level anyway.
I think Amazon has some splaining to do. If their own services aren't redundant across AZs then what is the point of customers paying extra to be in multiple AZs? Is the only independent component of AZs the power source? That is a far cry from Amazon's selling points of multiple AZs.
> We didn't mention AZs because S3 doesn't use availability zones. That's for EC2.
Pretty much every service uses AZs except for S3. RDS, EBS, EFS, Elasticache, ELB. Maybe S3 doesn't because it was one of their original services. But it's worth asking why they haven't upgraded it yet. If they had, most sites that were affected by the outage would probably have been fine.
Also they're still physically the same datacenter, so susceptible to combinations of backhoes, bad weather, and poorly performing power cutover systems, etc.
Using only one AWS region is a bad idea. Period. In fact, I'd argue (thanks, BGP hijacking!) that using only Amazon services is a bad idea. If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement.
Amazon fails sometimes, Google fails sometimes, Microsoft fails sometimes (and in at least one instance took weeks to restore!)... don't put all your eggs in one basket, people. Don't be that guy.
This whole fiasco is probably a good example of why developers should not be put in charge of the IT systems, no matter how "easy" they are... Operations teams tend to focus like a laser on uptime and stability, while developers are more interested in maximizing new features.
"If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement."
We keep hearing people saying things like this. And we have to keep replying that marketing has set inappropriate expectations with these very people who are the ones who make the decisions. They've been told that
cloud someone else's computer is cheap and that it's resilient.
"This whole fiasco is probably a good example of why developers should not be put in charge of the IT systems"
To some extent I take objection to this. Back in the day it was possible to be in charge of development and operation and be paranoid about stability and uptime. It encouraged not developing what you knew you couldn't run. Times have changed and not, I think, for the better.
cloud someone else's computer usage is shadow IT, paid for with a company credit card by people who don't see the need for all the costs and time needed for the detailed stuff which enables in-house developers and operations to combine to provide reliable systems. Don't assume either real developers or operations get anywhere near such deployments. Again, sales and marketing by providers have to take some responsibility here.
And whilst you're extolling operations, don't forget it seems to have been Amazon's operations staff who grew fat fingers in this instance.
> Also they're still physically the same datacenter
No, each region is made of many data centers. US-EAST-1 is spread across Northern Virginia.
> Using only one AWS region is a bad idea. Period. In fact, I'd argue (thanks, BGP hijacking!) that using only Amazon services is a bad idea. If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement.
> Amazon fails sometimes, Google fails sometimes, Microsoft fails sometimes (and in at least one instance took weeks to restore!)... don't put all your eggs in one basket, people. Don't be that guy.
Have fun living on your planet where everyone has the budget and time for multi-provider multi-region setups. It's one thing to chide people for not having proper backups or never considering HA, but expecting every site to launch their own satellite to maintain continuity in case the internet fails is pretty pointless.
Those that don't have the budget presumably are spending it on features? That's not about cost that's about where one believes the revenue is ie features. However, if you're down, your features don't get used. A similar argument can be made for time expended in building HA: You can expend engineering effort once or support and admin effort every time you're down.
Ultimately, this is about your users. Do you care enough about them to put their fate and yours in another's hands or do you choose to use the available facilities (and if you used a DR style arrangement you could save much of the infrastructure cost until time of need, magic of elasticity) to protect everyone?
No doubt, for a fledgling company the choice has to be features but it should be a knowing choice. Amazon make it clear what needs doing for HA, choosing not to do it is on the respective business owner. For those with a decent paying user base the balance is somewhat different, all about how much you value your reputation. Blaming Amazon for your downfall will be limited consolation for your users. If you fall victim often enough you'll be paying the cost in lost revenue through inaction and support interactions. Alternatively you can pay the cost of moving clouds or developing your HA options.
"Suppose to cure this type of HUMAN fkck ups ?"
In a word, no.
Those tools and the processes they support are for automated testing of changes you plan to roll out, and automated deployment of those changes, hopefully after someone or something has approved them. They make replication of change across many environments simple, including setup of servers, environments and so forth.
The people in question were carrying out triage on a production performance issue. "Infrastructure as code" isn't really that helpful during triage. You usually have to dive in and run commands by hand. In such a situation, if what you are trying to resolve is related to production load and scale, you probably cannot replicate it on-demand in a test environment, even if you'd like to. That, in turn, can mean you can't really usefully test the command you plan to run.
Given the nature of AWS/S3, I'm quite sure the command line entered did something heavily automated at scale, and might well have been executed with their equivalent of something like Chef, but *what* it was told to do was likely derived from the triage efforts. You can bork your production environment just fabulously with the wrong command inputs to a tool like Chef. It will dutifully obey you if the command you give it is legit. (They mention that they will change their definition of what's legit based on this experience.)
I certainly do run what I perceive as "dangerous" commands in test environments before I run them in production, just to make sure I got them right. I can then copy-paste them exactly from dev into prod, at least where the command will be identical in either environment. But if I don't think the command is dangerous, possibly just because I've become used to running it without failure, I could conceivably type it out in full confidence and still screw it up. Triple-checking yourself before you hit "enter" is a matter of experience and, too often, not being over-tired or in a rush.
No. As long as someone uses a CLI he or she *will* make mistakes. Especially when the switches/parameters you have to set have a man page that looks like the Encyclopedia Britannica, and the average command line is just a little shorter than "The Rime of the Ancient Mariner".
I work for a large bank. I once took all of the ATMs off the air by entering a simple command to empty a load library on the mainframe. I was asked to do it by the application expert coz he knew what he was talking about and I had the access.
Oops. Came half a bee's dick from losing my job.
By the ability of amazon to do a route cause so quickly and go public with it.
In most corporates I've worked with it would take them at least 3 months to figure this out, even with C-Suite backing, and they'd only admit it 2 years later, because lawyers or something.
Makes a breath of fresh air that they have kept us informed. Unlike say, every bank ever, or talktalk, or adobe. Although naturally they've used up 5 years of their standard 99.99% availability quota in a single day so I'm by no means advocating they get supplier of the year... Just that others might learn that this is the proper way to keep users informed after a crisis.
"By the ability of amazon to do a route cause so quickly and go public with it"
If you're a CIO and the whole company is betting on your strategy and you choose to go with a provider who fails and it take them 3 months to figure out and report on that failure, then you're gonna be asked to move off it pretty quickly and your credibility goes down the swanny.
If AWS want to maintain their leadership and gain their customers trust, they have to be transparent and quick to resolve. And yes, it is impressive. But I don't think they get a choice.
Cloud Servers Other Peoples Tin (OPT) is just as likely to go down as your own, due to Layer 8 errors.
The problem is their marketing machine says "dont worry if Some idiot does it, we got some more Tin over here that we'll move your stuff to."
but as this demonstrates no one told ops that
"Those two subsystems handled the indexing for objects stored on S3 and the allocation of new storage instances. Without these two systems operating, Amazon said it was unable to handle any customer requests for S3 itself, or those from services like EC2 and Lambda functions connected to S3."
Move everything to the cloud yet have a single point of failure...nice.
That'll be the source of the problem. When such is in use, it can only mean one thing.
The person typing the commands, while "authorised" to do so, almost certainly hasn't got a clue what they actually do.
If they did then a) they wouldn't need some else to have written it down for them and more importantly, b) they'd have spotted the typo before hitting Enter.
First an admission, I have performed too many of the "fat finger" events over my lifetime. Back in the day when most of this was new it was accepted. Technology and its intricacies have changed dramatically and leaving the mere mortal to the capabilities of unrestricted command lines leaves us open to the next fat finger event.
Solutions to the problems have existed for a long time, it has had many names, currently it is called "service orchestration" but companies still need to invest in it and technicians have to embrace it. Simply put it allows any technologies command facilities to be exposed but in a controlled state. Removing the potential for the next "fat finger" - for any company where the technology is the company there is no excuse in blaming the human - it is time for the company to own up and say we didn't support the human - we provided the capability for this to happen and as always will - it happened.
So to Amazon and any other high tech company providing critical services I would ask - what are you doing to make certain this never happens again - get the cheque book out and remove the potential for it to happen in the first place.
BTST. Alternative good times.
These system were probably around since Amazon first needed a storage system. And never mind that the variables in the code all refer to books. That's historical legacy code, the original developers have left and you'd better not touch it.
Sorry, I really get annoyed by this "other people's tin" quark.
You cannot run Internet services on your own computers alone. If a major ISP mucks up their routers and goes dark, your precious customers will not be able to see your website, no matter how much disaster recovery you planned for.
AWS is a "cloud" hardware store. They stock some nice tools and might offer advice on your design questions. You're still responsible for building a stable system yourself.
It suddenly occurs to me the primary issue with cloud stuff.
Cloud is way cheaper than doing it on site. It's also comically, laughably more expensive than having it under your own roof. The differences are pretty simple: Doing it right is significantly more expensive than doing it wrong--to the point that the tools you're using aren't important. If your cloud stuff goes down because you don't have geographic failover redundant whatever etc because you cheaped out, you did it wrong. Holds precisely as true as the day the roof leaks and your rack emits a sad little pfzzt sound.
I do wish I lived in a world, or even a location, where this magical internet of perfect connectivity at wonderful speed was available, though. The last thing I want is any important data or processing being done on the other end of an unreliable, tiny, crooked straw. Someday, someday...
This post has been deleted by its author
Biting the hand that feeds IT © 1998–2019