back to article Transatlantic link typo by Sweden's Telia broke Cloudflare in the US

A fat-fingered human accidentally broke a transatlantic internet backbone that knackered Cloudflare's content delivery network in the US. Cloudflare – which props up loads of big names online – said that over the course of about fifteen minutes on Tuesday morning, it suffered a slowdown in traffic, meaning connections to …

  1. Anonymous Coward
    Anonymous Coward

    More CloudFog

    Ah, internet redundancy... Or is it inter-dependency???

    1. Anonymous Coward
      Anonymous Coward

      Re: More CloudFog

      "The issue is related to a specific transit provider and we are working on temporarily disabling this provider to route around the issue,"

      Sounds pretty redundant to me.

  2. Len Goddard

    Is there anywhere left?

    "In addition to connections in North America, the outage caused problems for Cloudflare services in South America, Europe, Asia, Africa, and the Oceania region."

    Antarctica, perhaps?

  3. Anonymous Coward
    Anonymous Coward

    True distributed computing ...

    .. is when a server you've never heard of and never use prevents you from getting your job done.

    As coined back in the days when DEC meant something much more than the last month in the year. Plus ça change ...

    1. Norman Nescio Silver badge

      Private Networks

      ...and this, boys and girls, is why people still build private networks, even though they are expensive. Although it does depend on things like service providers having redundant route-reflectors in the core of their MPLS networks; and truly separate fibre paths between P-nodes. Don't ask me why these points are foremost in my mind.

      Some large companies with deep pockets buy private networks from multiple suppliers to try and guarantee redundancy of equipment; but unless you buy pairs of separate fibre routes from each individual supplier, you can never guarantee that both providers haven't bought a fibre on the same path from a third-party supplier. Even then, good luck with getting four geographically separate entry points into each of your buildings. If you have an anonymous building in a random business park on the edge of a typical large town, you can usually only get, if you are lucky, two paths. There's no money in providing very high degrees of diversity, separacy and redundancy to everywhere. There's actually a surprisingly small number of highly-connected datacentres around.

      1. Chairman of the Bored

        Re: Private Networks

        Outstanding observations. The sad thing is, as difficult as the technical issues are to address, the real challenge is to get sufficient resources from ones management to do the job well. After all, why spend money on some disaster that hasn't happened yet?

  4. EveryTime

    "An increase in response time" generally means that the servers were completely unavailable, but that the vendor won't count the downtime against their "5 nines" marketing reliability.

    1. Richard 12 Silver badge

      Half hour ping.

      No, no, the server wasn't down for 20 minutes, the Intertubes were slow.

  5. Winkypop Silver badge
    Trollface

    Bork, bork, bork!

    Bork, bork, bork, bork, bork, bork, bork, bork, bork, bork,

  6. ckm5

    To be fair, I got a notice from Pingdom

    Their network was affected as well, so was Stackdriver.

  7. Anonymous Coward
    Anonymous Coward

    So 10 months ago, Telia has a hiccup, and CloudFlail loudly, and publicly shifted blame to Telia.

    It's been 10 months and CloudFlail have still not managed to understand how CDNs are supposed to work.

    They're really a PR company with a thin shim of engineering around them.

    1. patrickstar

      A lot of Cloudflare services depend on actually being able to contact the site behind it, as opposed to hosting it directly.

      If the Intertubes are broken, then it can't do so. This was the sort of error that really breaks normal redundancy as well - the prefixes were still in BGP but traffic wasn't delivered.

      There isn't really any way to automatically protect 100% against that. Certainly not when some traffic via the transit provider is getting through and other traffic isn't - you really need a human to make the decision in that case. Especially since the only way to fix it is killing the entire transit connection, and doing so comes with its own risks (flap damping, for example).

    2. ckm5

      Not just Cloudflare

      A huge chunk of transatlantic traffic was affected. I understand it might be hard to grok for commentards but the Reg hacks have no excuse....

  8. jerky_rs

    Our monitors picked this up (pingdom.com and check_mk distributed deployments) and we had escalations within 3 minutes.

    http://status.pingdom.com/incidents/0n95zvcxb19m

    I watched multiple routes reconverge and as they did so the lights went green, for example we had one German monitor going via London/NYC on Telia.net with 99.3% loss, it re-routed and went via another carrier to some US deployments we had in IAD and DFW.

    I have some screenshots if you want this has nothing to do with Cloudflare if a user is routed via some carrier network that goes down.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like