Its really not that difficult.
Resilience models are well known, understood and documented.
Monitoring tools are well known, understood and documented.
So why is this so hard for people to get right ?
Or is it the quick change to fix X that ends up breaking Y as insufficient testing was performed ?
If we're going to have to rely on Internet based services to run our lives, then at least the companies making mega profits can do the right thing and build them in a manner where they are rock solid.
Oh and give us a workable Plan-B for when you screw up, local branches, cash machines, you know that sort of complicated stuff.
Yes your profits might be a bit lower, but your customers will be able to get on with their lives when you screw up again.