Reply to post:

Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command

Anonymous Coward
Anonymous Coward

It's a super awesome convenience to be able to hit tons of machines in a big data center operation, but as you can see things can go wrong in a big way. It would be interesting to see a pseudo-syntax of what happened, if this was a webgui or a cli, or a script, what have you. I can tell you at the Yahoo! CEO shuffle I attended a few years back we could address wide swaths of machines, but most of the folks knew what not to do, and how to break up big jobs (ha!) into easy to handle tasks. For instance, my first task was to run a script that fixed the storage issue with NetApp "moosehead" disks that would cause it to loose data and the extra cool thing; not be able to recover from their RAID! Good times! This was on over 300 mail "farms" which were middle-tier mail handling clusters that did the sorting of mail vs junk/spam. The spam goes off to cheapo storage, and "good mail" goes to the main stores. Anyway, the IDs needing fixing to point mail user's mail to the new storage by running a script on close to 6000 machines, no VMs, all pizza boxes. No WAY was I to just go nuts and try and run them all at once, even though you could very well do that with Limo, their internal custom multi-host command tool, later replaced by a tool called Pogo. Clusters of machines could also be addressed with aliases, so I could say "all hosts in a group with a simple name"; turn off the flag to show availability to the VIP. For the script work I was clued in via change management meetings, then I ran the script on one farm to make sure it worked and that we did not clobber any users, then we did 10 farms, then 100, and the rest (are here on Gilligan's Island!). No problem. My goal was to not cause any issue that would make it into the news. :P I had nothing to do with the security also, which is a big embarrassment to their new owners, I'm sure.

I was also in Search (AKA the Bing Gateway) and there we typically choose UTC midnight on Wednesdays to perform updates to the front end search servers. In the US there were two big data centers, each with two clusters of 110 hosts to handle the web facing search front end. For maintenance, you just choose a single host, take it out of the global load balancer, then update it, and drop it back in with extra monitoring turned up. If it does not crap itself, we could then take out half of a data center, do the update, put them back in, then repeat the process three more times for the other clusters, and that was that. But, yes, super easy to fuck up and take out every data center if you don't pay attention to your machine lists.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon