back to article Telia engineer error to blame for massive net outage

Swedish infrastructure company Telia is to blame for a massive internet outage today after an engineer apparently misconfigured a key router and sent all of Europe's traffic to Hong Kong. The Tier 1 network provider is one of fewer than 20 companies that provides a basic foundation for much of the internet. It sent a note to …

  1. petur
    Meh

    phew

    I had just installed a new router this weekend and I was beginning to fear I configured something wrong when connectivity was showing issues.

    Adn now I also know our ISP uses Telia....

  2. Anonymous Coward
    Anonymous Coward

    Engineer mishap, but with idiot telco partners

    While a fairly crappy thing to have happen from Telia, it shows various connected telco's don't have rules in place to filter out bogus routes (which they should). That's absolutely on them, not Telia.

    People, please fix your route filtering so bogus external routes (including TLA-sponsored crap) don't automatically hook you.

    Cloudflare... hmmm... playing up to the media maybe? :(

    1. Alan Brown Silver badge

      Re: Engineer mishap, but with idiot telco partners

      > it shows various connected telco's don't have rules in place to filter out bogus routes (which they should).

      Given this is a fairly regular occurance, you're spot on.

    2. Anonymous Coward
      Anonymous Coward

      Re: Engineer mishap, but with idiot telco partners

      Not so easy, that is the problem with BGP, was never built for such scale and world, BGP leaks happen every day and they are difficult to stop, specially wit big telcos that have peering, transit and transport from each other. The protocol is based on trust, any other thing as filters are inefficient and only really manageable in small BGP implementations that aren't dynamic. There is an attempt to review the protocol exactly because of this, the reg has something about it.

      Never the less as the article suggests, if Telia is transporting the traffic for other ISPs if the route it somewhere wrong, there aren't filters to resolve it, they are simply transporting the traffic to a black hole, because the next gateway has no clue what to do with the prefixes and just drop them.

  3. JeffyPoooh
    Pint

    Thus is revealed...

    Remember all that explanation about how cleverly self-organizing and self-repairing the internet is? How it can route around a damaged link? Remember all that?

    Bollocks. Now it's apparently manually controlled by fat fingered blokes. Might as well be levers and handles controlling pipes. Fry was right.

    Is this described under an RFC document? RFC 9995 'Fry's List of Levers and Handles to Control the Internet Pipes'

    Stupid.

    1. Terry 6 Silver badge

      Re: Thus is revealed...

      Yes.

      This is way outside my area of competence.

      So I'd always assumed that at this top level of function there was a top level of error checking. Both human and digital.

      Systems for checking that what is meant to go to x actually goes to X and systems that make sure humans don't press the wrong button. Or at least if they do that it's spotted PDQ.

      Engineers in other system critical areas seem to have this, usually. e.g. Aircraft engineers.

      1. sdunga

        Re: Thus is revealed...

        Majority of these services are provisioned separate and along years, so in reality there is little holistic view over everything and dependencies and inter dependencies.... Not to mention that half of the telcos bought another number of telcos and there are a lot of skeletons hidden in the closets.

        In general these big networks are everything less well controlled, no matter what the telco is, prove is the successive amount of big outages, no matter how big the name is, they all had their own episodes in the last 2 years.

        Reality is plain simple and hard, all protocols Internet is based in are old, very old and were never meant to scale out, along the years people have been implementing a number of best practices and workaround to try to make inefficient protocols work to the new demands. IPv4, ven TCP, DNS, SMTP.... all completely inadequate for today needs, but all the baseline of everything we call internet.

        Even the IPv6, that is almost not implemented at all, is dated somewhere in 2003, that is close to 15 years ago. BGP that is the big engine of all of this, is dated somewhere in first half of the 90s.

        To make it more interesting very few big telcos really have any automation in place in the cores, because the networks are so big and have been built over the years with all kind of hidden stuff, no one dares to apply the blind law of automation and risk an outage. Automation is normaly implemented in the edge in customer services, even there due to the complex nature of many, there is no automation possible. Normally things like DSL, cable and this kind of services is where you find the automation piece.

    2. Anonymous Coward
      Anonymous Coward

      Re: Thus is revealed...

      @jp

      If you absolutely need to route that way you will. If your sharing tables with customers hopefully their kit will show a shorter path via their alternate service provider and avoid the long erroneous route, else they are wholly reliant on their provider doing the right thing & there lies risk and danger to service availability.

      1. Evil Auditor Silver badge

        Re: Thus is revealed...

        ...via their alternate service provider...

        Absolutely. If it is business critical you have to have an alternate link and service provider. If you don't, stop crying.

        1. Anonymous Coward
          Anonymous Coward

          Re: Thus is revealed...

          Victim blaming?

        2. Anonymous Coward
          Anonymous Coward

          Re: Thus is revealed...

          But having an alternate service provider would not have helped. Telia didn't go down. The routes they injected told the whole world that these particular subnets were available (or withdrawn from) HERE. Any alternate providers would redirect that traffic to Telia. Any transit peers or tier 2's connected would still suffer. Alternate service provider would typically not help a business in this case.

    3. Yes Me Silver badge

      Re: Thus is revealed...

      It's common knowledge that BGP-4 misconfiguration errors can do things like this, and not all such errors can necessarily be filtered out automatically. I'm guessing that this was a route that was supposed to be in iBGP but got announced to BGP-4 peers as well.

      (Not defending Telia in the least, but there are daily BGP-4 misconfigurations, it's just that most of them have more localised impact.)

  4. Anonymous Coward
    Anonymous Coward

    I do not understand how you're able to write this article, since the last update we had from Telia was:

    > We would like to let you know, issue experienced earlier it is now solved. Our engineers, have confirm the network performances it back to normal. You should not experience any more disturbances due to this outage.

    > We will provide you with an RFO once the information its available.

    Anyway, maybe the RFO will confirm your article.

    And a footnote: CloudFlare were affected - no other CDN mentioned this publicly. I'm just gonna point this out: http://28.media.tumblr.com/tumblr_labmz1v2QU1qd0a5no1_500.jpg

  5. Alan J. Wylie

    Cloudflare post-mortem on the Jun 20th incident

    Also mentions one on the 17th

    https://blog.cloudflare.com/a-post-mortem-on-this-mornings-incident/

  6. Unicornpiss
    Meh

    Wouldn't want to be in his shoes...

    I've certainly screwed things up over the years. But when you make a major change and part of the Internet goes dark, wouldn't you put 2+2 together and check your work? Or did he just hit enter and go on vacation?

    I remember once when I inadvertently tripped a facility's fire alarm. I threw a switch and ominously the alarm went off a second later. That couldn't be me... Could it? COULD it? Aarrgh.

    Where is the "oops" icon?

    1. Sir Runcible Spoon

      Re: Wouldn't want to be in his shoes...

      I had that kind of feeling when I tested the redundant power supplies on an HP rack - how was I to know they are paired corner-corner, not side by side!?!? :)

  7. allthecoolshortnamesweretaken

    Yes, an "Oops!" icon would be nice...

  8. joeldillon

    I, uh, would not want to be the poor sod who made that mistake and is quite possibly reading this article right now.

    1. DNTP

      Uhh… I think I found him in my building. He's the guy with his shirt off standing on a desk in the middle of the cube farm while a burning network switch illuminates him from below, drinking from a gallon jug of 200 proof we keep for cleaning purposes. Hang on, he is yelling something at the CIO.

      "I DID SOMETHING. I AFFECTED THE WORLD. I AM A PERSON WHO MATTERED TODAY."

      1. Mark 85

        Has he since been moved to the quarterdeck, stripped to the waist and lashed to a grating? I'm expecting the sargent-at-arms to be carefully selecting the cat-o'-nine-tails at this point.

        Yeah.. he mattered... but someone high up if probably pissed about it.

      2. Unicornpiss
        Happy

        Clearly he was "engaged"

        The big thing right now is our corporate "Engagement Survey", scheduled to begin in July. Ugh.

  9. cmaurand

    One fat fingered mistake and the world goes to hell. That's the problem with commandline thinking in the Cisco/Juniper world. Too easy to make a typo on the commandline and sink a bunch of ships.

    1. Anonymous Coward
      WTF?

      Odd, the most reliable hardware in the world tends to be command line (hint our old phone system wasn't "rebooted" for close on 15 years). But still, stick a GUI on it for those that don't know what they are doing.

    2. LionelB Silver badge

      Yeah... hey, wonder what that red button with the "rm -rf /" does?

    3. John Brown (no body) Silver badge

      "Too easy to make a typo on the commandline and sink a bunch of ships."

      It's just as easy to get distracted and click the wrong button on a GUI too. Most GUIs are not well designed and those that are are usually quicker and easier to use from the keyboard anyway.

      Unless you are drawing, most GUIs will usually let you do pretty much everything from the keyboard but there does seem to be a trend to "mouse only" GUIs which can be a pain when you have to type some text in then get the mouse and click "submit" when the previous version let you use ALT-S without having to move your hand so far to find the bloody mouse!!!

      GUIs are great at keeping the learning curve shallow but can be draining on productivity once the user is familiar with the software. And beginners who need a hand-holding GUI probably should not be messing with tier 1 routing tables.

  10. FuzzyWuzzys
    Facepalm

    QA?

    I'm guessing the "plank" in question, accidentally loaded a config into a live box instead of a POC/dev/test box?

    I can't begin to image the "white fear" ( the blood draining from your face ) as it suddenly dawned on them what they had done! "Hello love? Yes it's me, I'm going to be working late...assuming I still have a job at the end of the day!"

  11. Anonymous Coward
    Anonymous Coward

    It's not all DevOps and Roses Yet

    As someone who's spent time working for one of those 20 or so companies (The lot I'd worked with were into the top 10 of that list). This kind of configuration/testing is nigh-on-impossible to do in Labs. You can, and do have multi-million pound labs for testing, but unfortunately, you can't quite make a test lab anywhere near as big, or as complex, as a large chunk of the public internet.

    In theory, most things at this level are ok because normally they follow fairly rigorous change control procedures where changes are vetted by multiple R&S CCIEs and people who not only have the practical knowledge, but also the theory (very important, if you can't figure out what'll happen in your own mind, you'll make lots of mistakes).

    While at an Enterprise Level, complex networking is albeit a simpler problem, at a Global Service Provider Level, it's extremely complicated, and no vendor has yet got anywhere near the level of automation needed to prevent fat-finger mistakes.

    This one will continue to happen (very occasionally) for a long while to come I think. I would agree with others on here that have mentioned customers/CDNs should perhaps be filtering routes and doing a few other widely available tricks to be more in charge of their own destiny, however, it doesn't seem to be the case, and most get slammed when Tier 1 issues propagate downwards.

    1. Anonymous Coward
      Anonymous Coward

      Re: It's not all DevOps and Roses Yet

      Such changes are supposed to only be made under change control. Write down the commands that are to be implemented and why. Include tests to prove the change worked as expected. Include backout plan in case needed. Send to change panel for approval. Peers review, suggest updates, eventually approve. Follow implementation process, sorted.

      Even if wrongly typed, the post-change checks and tests would pick up on it (look at advertised routes in changed range before, and after. Are they as expected?)

      So yes transit peers should filter, but Telia is a tier 1 and therefore ANY route from them (except those owned by the transit peer) can be trusted! That's the nature at tier 1, alternate routes to the same subnets are frequent and common!

  12. Andy The Hat Silver badge

    Misconfingered... again?

    is it just me or does this story sound familiar?

    http://www.theregister.co.uk/2014/05/19/eu_cable_outage/

  13. J.G.Harston Silver badge

    blah blah .co.hk no, that must be a typo, blah blah .co.uk there fixed it. Wait, what?

  14. Maty

    I remember a company director telling me that the scariest sound in the world is when the computer tech doing something vital to the system says 'oh' in a small, quiet voice.

    1. Jeffrey Nonken

      Scarier than your surgeon saying "Oops!"?

    2. Unicornpiss
      Meh

      That or...

      The comment: "Well, it's doing something..."

  15. Dave Bell

    There were no obvious direct effects for me, but \i suspect a couple of services I use were hit by CDN issues over the weekend. CloudFlare is one of I don't know how many CDN networks.

    It ought to be clearing by now, but I am not sure it is.

    It would be nice to know just which CDN the service I am paying is using, but i can't really see this problem affecting just CloudFlare.

    What would the BOFH do?

    Ah, opening time, never mind.

  16. Commswonk

    I must remember...

    ...to use the expression Deprioritizing them until we are confident they've fixed their systemic issues. (with one or two very minor changes of word) in a domestic setting when DDO * does something to upset me. Could turn out to be a high risk strategy, of course.

    * Director, Domestic Operations, just in case you were wondering.

    1. Anonymous Coward
      Anonymous Coward

      Re: I must remember...

      *meaning that any route cloudflare receives with and Telia AS numbers in the ASPATH will have a few more of Cloudflares' own AS numbers appended. Longer AS path reduces it's liklihood of acceptance to the routing table, before other routes. But Telia will still be used as last resort..

  17. Bucky 2
    Happy

    Guilt Issues

    The first thing that happens to me when I read an article like this is a sudden wash of profound relief that it wasn't me.

    I suppose there's probably a complex named for it.

    1. Anonymous Coward
      Anonymous Coward

      Re: Guilt Issues

      Yes, it is https://en.wikipedia.org/wiki/Schadenfreude#In_popular_culture

  18. DanceMan

    Following the end of the Muppet Show, the Swedish Chef embarked on a new career..............

  19. Yes Me Silver badge
    Pint

    There's an RFC for that

    Published about an hour ago:

    RFC 7908

    Title: Problem Definition and Classification of BGP Route Leaks

    Author: K. Sriram, D. Montgomery, D. McPherson, E. Osterweil,B. Dickson

    Status: Informational

    Stream: IETF

    Date: June 2016

    URL: https://www.rfc-editor.org/info/rfc7908

    DOI: http://dx.doi.org/10.17487/RFC7908

  20. Anonymous Coward
    Anonymous Coward

    Ah the joys of Web based training.

    Just saying

  21. tim 13

    Never noticed

  22. simonorch

    Punching monkeys

    Would love to be a fly on the wall in the subsequent meetings at Telia.

    Given my experience the last year with another large nordic ISP i'm not surprised.

  23. JohnG

    No change management at Telia?

    Do Telia not operate some kind of change management or is their change management system inadequate (e.g. checking the results of the change)?

    Do they really let individuals make it up as they go along, rather than stick to changes that have already been discussed?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like