More CloudFog
Ah, internet redundancy... Or is it inter-dependency???
A fat-fingered human accidentally broke a transatlantic internet backbone that knackered Cloudflare's content delivery network in the US. Cloudflare – which props up loads of big names online – said that over the course of about fifteen minutes on Tuesday morning, it suffered a slowdown in traffic, meaning connections to …
...and this, boys and girls, is why people still build private networks, even though they are expensive. Although it does depend on things like service providers having redundant route-reflectors in the core of their MPLS networks; and truly separate fibre paths between P-nodes. Don't ask me why these points are foremost in my mind.
Some large companies with deep pockets buy private networks from multiple suppliers to try and guarantee redundancy of equipment; but unless you buy pairs of separate fibre routes from each individual supplier, you can never guarantee that both providers haven't bought a fibre on the same path from a third-party supplier. Even then, good luck with getting four geographically separate entry points into each of your buildings. If you have an anonymous building in a random business park on the edge of a typical large town, you can usually only get, if you are lucky, two paths. There's no money in providing very high degrees of diversity, separacy and redundancy to everywhere. There's actually a surprisingly small number of highly-connected datacentres around.
A lot of Cloudflare services depend on actually being able to contact the site behind it, as opposed to hosting it directly.
If the Intertubes are broken, then it can't do so. This was the sort of error that really breaks normal redundancy as well - the prefixes were still in BGP but traffic wasn't delivered.
There isn't really any way to automatically protect 100% against that. Certainly not when some traffic via the transit provider is getting through and other traffic isn't - you really need a human to make the decision in that case. Especially since the only way to fix it is killing the entire transit connection, and doing so comes with its own risks (flap damping, for example).
Our monitors picked this up (pingdom.com and check_mk distributed deployments) and we had escalations within 3 minutes.
http://status.pingdom.com/incidents/0n95zvcxb19m
I watched multiple routes reconverge and as they did so the lights went green, for example we had one German monitor going via London/NYC on Telia.net with 99.3% loss, it re-routed and went via another carrier to some US deployments we had in IAD and DFW.
I have some screenshots if you want this has nothing to do with Cloudflare if a user is routed via some carrier network that goes down.