Reply to post:

Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command

donk1

6000 machines...so run 200 machines at a time for 30 times.

What is this obession with 10,100,2000,rest and doing a massive population in 5 steps?

Even if 2110 machines worked fine how long would it take to fix the last 3900 machines if enough of them broke?

For failures it is not the number of times you have done it before but the size of the failure domain and how long it takes to fix.

it should be possible to rollout automatically in small batches and even had multiple upgrades rolling out at the same time on an automatic schedule, ripple across the farm!

If it is automated and scheduled who cares how many batches of upgrades are run?

You would catch errors with less impact that way as the failed batch size would be smaller and it would be minimal extra work if designed correctly.

This is the next stage in cloud service design - being able to have slower rolling upgrades with smaller batches!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

SUBSCRIBE TO OUR WEEKLY TECH NEWSLETTER

Biting the hand that feeds IT © 1998–2019