Storing your Companys' commercial secrets on Google Cloud ...
... should be a sacking offence.
A couple of days ago Google's cloud went offline, just about everywhere, for 18 minutes. Now the Alphabet subsidiary has explained why and issued a personal apology penned by “Veep for 24x7” Benjamin Treynor Sloss. And yes, that is Sloss' real title. Sloss says the problem started when “engineers removed an unused Google …
Why? This issue didn't affect commercial confidentiality or data security, it affected availability. I'd be confident that my data is more secure in google's data centre than it is in some poorly patched corporate data cupboard-under-the-stairs. I might not always be able to get access to it, but hey, if I can't then the h4xx0rs can't either.
Moreover, when something goes wrong they are pretty good about finding out what it was, fixing it and then telling everyone in detail about it. Working at large financial services organisations, I've seen any number of major outages where the root cause analysis was "dunno" and the follow up action was "cross fingers".
Once you get past 4 9s or so downtime is mostly related to human error. From the list in the article of all of Google Cloud's outages since August, only one was not human error. And assuming all outages were of the 18 minutes magnitude of this one, Google Cloud is running at about 3 1/2 9s.
Once you get past 4 9s or so downtime is mostly related to human error
Once you get past two nines you should not be dealing with an external vendor connected via the Internet, and even that's a stretch already. Even when said vendor can deliver 6 nines, it's pointless if you can't get to them by using a network that has so many intermediaries between you and the vendor that ANY kind of uptime statement is as reliable as your guess for next week's lotto results.
So, even if we assumed that a "let's throw a lot of things at the wall and see what sticks" vendor like Google would be able to achieve good uptimes for other than their spying activities, you're still nowhere near reliability unless you have multiple, diverse, dedicated circuits going to their data centre.
Q: Are my Cloud Services running?
A: Yes sir, your cloud services are running perfectly
Q: Then why can't I get at them?
A: We don't know Sir. Please try your network provider... As the operator looks out the window at a JCB, a big hole and several people all scratching their heads and saying 'That fibre optic cable should not be there'.
This shows the weakness of "availability zones" from the same provider.
Better to take one availability zone from Google and one from Amazon (say). You end up paying for the traffic which goes from one to the other; but you get genuine resilience.
Alternatively, Google should manage their availability zones as if they were completely separate providers on separate networks - in particular with different AS numbers, and different management teams.
You get genuine resilience if you can manage the nightmare of getting two different Cloud providers to handle the same data without bungling things.
Apparently, Cloud is hard enough as it is with ONE provider. Put another one the mix and you just might become the poster child for a How Not To Do Cloud article.
But yeah, in theory redundancy is based on two of a thing.
I don't mean to criticise redundancy in principle, but often it turns out not to be all that it's cracked up to be.. at least in certain contexts. (We can all agree that putting data in a single place is silly).
Imagine that to avoid being hurt by the google failure you changed your architecture to run in parallel on two totally different providers (e.g. Google and AWS). Would it be more reliable? I'd suggest it's not necessarily true.
Why?
Well, it involves extra complexity in the design and operation of your system, which in turn makes it more likely that a programming error on your part will bring down the edifice, rather than a problem at either Google or AWS, especially when you have iterated through 50% of your team having changed since it all got designed.
And then there's the additional problem of the huge cost of planning to have an infrastructure where all the load can be handled by less than half of the whole lot. That's always a difficult sell, and in practice, over time, this tends to be ignored, meaning that when things do go titsup in one bit, then the whole thing grinds to a halt anyway.
In practise, the only well to tell if your system is reliable is to regularly break it on purpose and demonstrate that everything still works, and people are naturally reticent to make it their job breaking things on purpose! (Though years ago Netflix invented 'Chaos Monkey' that did precisely that to ensure certain classes of failure could be handled gracefully).
So in short, it might be more reliable if you just rely on a single provider!
Yes, but the whole point of this "cloud" thing is that it's supposed to have redundancy and a complete lack of any single failure point built in (i.e. "five nines" availability is implicit in the concept we were sold). It should be physically impossible for any human error to break the whole thing.
Using two clouds to make a reliable cloud is like owning a dog and barking for it. More to the point, you've just added a SPOF in the bit that handles which one your traffic's going to (or maybe we're thinking of hosting that on a third service???).
Wake me up when somebody comes up with an offering in this area that's actually bloody fit for purpose.
Better to take one availability zone from Google and one from Amazon (say). You end up paying for the traffic which goes from one to the other; but you get genuine resilience.
You really don't, you just make sure that multiple big vendors are a risk to your business.
If you want resilience run it in house, and fail over to AWS.