back to article How four rotten packets broke CenturyLink's network for 37 hours, knackering 911 calls, VoIP, broadband

A handful of bad network packets triggered a massive chain reaction that crippled the entire network of US telco CenturyLink for roughly a day and a half. This is according to the FCC's official probe [PDF] into the December 2018 super-outage, during which CenturyLink's broadband internet and VoIP services fell over and stayed …

  1. Kernel Silver badge

    Not the first time this has happened

    I seem to recall that sometime last century a US telco (I think either Bell or AT&T) suffered a massive outage due to a very similar cause - in that case it was due to a corrupt SS7 signalling message being continuously propagated in the network.

    It stands out in my memory as the telco I was working for at the time was just introducing SS7 at the time and there was a flurry of patching in the NEAX61E exchanges we were using.

    1. Anonymous Coward
      Anonymous Coward

      Re: Not the first time this has happened

      Colt had an issue on their European network that resulted in a multiday outage around ten years ago where a configure/software bug resulted in traffic being sent across a reserved VLAN and bridged between devices.

      However, Colt's issue was with ADVA rather than NEC equipment.

  2. Will Godfrey Silver badge
    Facepalm

    Witchhunt time?

    They'll have to find the {cough} single rogue engineer {cough} and do nasty things to him/her, then they can all forget it.

  3. swm Bronze badge

    As someone at Xerox said (long before TCP/IP), "Networks propagate badness."

    1. Jellied Eel Silver badge

      LAN, meet WAN, with minimal supervision

      And Ethernet makes it easy!

      So take a protocol intended to run on a LAN, where damage could be limited to the number of devices drilled & tapped onto a chunky coax. Then came bridges, switches and routers, and of course broadcasts.. Which were (and still are) pretty much a caveat emptor thing. Broadcast storms can and will happen, and can now go global!*

      But that's ubiquity for you. Ethernet's everywhere, even if it's not very good, and broadcasts are a cheap way to flood messages to every Ethernet device on the network that may or may not care (hello, Microsoft!).

      *Way back, we developed one of the first Ethernet WAN services. And sales being sales, and mostly used to Internet stuff wanted a 95th percentile billing model. Which got some pushback, but sales being sales didn't understand the implications of broadcast storms generating lots of data, and disputes would be a PITA to show the traffic was due to customer misconfiguration, and thus billable. And sales would have to explain the invoice, and suffer any clawbacks of their commission. Luckily we convinced them to drop it after working out just how much it'd cost to try and bill this feature, and it'd be coming out of sales/marketing's budget.

      1. Anonymous Coward
        Anonymous Coward

        Re: LAN, meet WAN, with minimal supervision

        And if you bridge any physical medium across multiple segments, you have the same issue.

        Maybe the physical medium isn't the issue?

        1. Jellied Eel Silver badge

          Re: LAN, meet WAN, with minimal supervision

          Depends what you mean 'physical'. Bridging is a huge issue for Ethernet because of broadcasts, and it never having been designed to work on a traditional LAN.. But it's become ubiquitous & a bit late/hard to change the specs so broadcasts are less of a problem.

      2. tip pc Bronze badge

        Re: LAN, meet WAN, with minimal supervision

        we use routers to connect LANS together, keeping broadcasts within the broadcast domain and routing legit traffic through a gateway.

  4. Simon 49

    Control plane policing?

    This isn't a new thing, mind boggles how they would not either have completely OOB access and enough reserved CPU cycles to service admin access.

    1. Jove Bronze badge

      Re: Control plane policing?

      On the contrary; it is quite understandable if you have ever had to work such businesses States Side.

  5. elDog Silver badge

    Significant fines and jail time against the executives and massive shareholder hits?

    An answer to: What can be done to stop industry from ignoring normal safety practices?

    1. Stoneshop Silver badge
      Holmes

      Re: Significant fines and jail time against the executives and massive shareholder hits?

      With the current FCC regime that's just Pai in the sky.

  6. Anonymous Coward
    Anonymous Coward

    lastcenturylink

  7. Alister Silver badge
    Facepalm

    It took them three fucking days to kill off a packet flood, because some dickhead decided OOB management was too expensive.

    1. Kernel Silver badge

      "It took them three fucking days to kill off a packet flood, because some dickhead decided OOB management was too expensive."

      Infiniera are a manufacturer of DWDM equipment, so it seems reasonable to assume that the inter-node comms are on the ITU standard optical supervisory channel (OSC) - which is a completely separate wavelength to those that are carrying services and is, in fact, sufficiently separate that the OSC is actually outside the optical amplifier passband. It is terminated at the end of each fibre span to provide access to the local node and a new OSC created for the next span. The node controllers have no access to the payload wavelengths as the only place they appear in an electrical form is in the payload mappers of the Optical Transponder cards at the terminal nodes where a client signal enters and leaves the network - everywhere else the payload is in optical form and therefore an analogue signal.

      You don't get more OOB than that.

      Might I suggest you break out your favourite search engine and do some research on Optical Transport Networks (OTN) before you continue to call some quite clever people "dickhead"?

      1. Fred Goldstein
        FAIL

        But the problem occurred because there was no working supervisory channel. Americans don't follow the ITU. It's quite possible that Infinera or CenturyLink (an old line telco not known for smarts, just a knack for buying shit up) thought that inband management would be sufficient, since everybody knows that DWDM networks have lots of bandwidth.

        1. TomS_

          Regardless of ITU, America, or what/who ever, all optical networks of this scale, no matter who makes the gear used on them, have OSC between nodes. If they dont, the nodes get isolated and can no longer be managed unless they have an entirely separate out of band management connection.

          As already stated, the OSC is entirely separate from all other traffic carrying channels, running on its own dedicated wavelength and bypassing any active equipment such as amplifiers etc. They also tend to run at a much slower speed, 100-155mbit because of a number of factors not limited to: greater unamplified reach, they dont need to carry that much traffic, and the processors that manage the nodes arent very powerful anyway.

          Some of these networks cross areas of country where there may be no other means of connectivity around, e.g. across the Nullarbor in Australia. You arent just going to pick up a DSL or other kind of connection to provide connectivity to your optical site - heck, a lot of them are even solar powered because you cant get a connection to the electrical grid. Therefore, OSC runs "in-band" on the same optical fibre used to carry the rest of the traffic, because thats the only way you'll get a management network out to these very remote sites.

      2. Alister Silver badge

        Might I suggest you break out your favourite search engine and do some research on Optical Transport Networks (OTN) before you continue to call some quite clever people "dickhead"?

        Ahem:

        Infinera provides its customers – including CenturyLink in this case – with the proprietary management channel enabled by default. CenturyLink was aware of the channel but neither configured nor used it

  8. G Mac
    Angel

    ...they were generated by a switching module in a node ... for reasons still yet unknown...

    A broadcast packet greater than 64 bytes with a valid header and checksum and no TTL magically appears ex nihilo from the quantum soup in the transmit buffer of a switching module?

    Kind of puts the Infinite Improbability Drive to shame.

    Or it could be a bug... nah.

    1. DougS Silver badge

      Re: ...they were generated by a switching module in a node ... for reasons still yet unknown...

      This was my thought. This was obviously some type of DoS attack, the only question is whether it was somehow introduced externally, or if it was done by an insider.

      1. Doctor Syntax Silver badge

        Re: ...they were generated by a switching module in a node ... for reasons still yet unknown...

        Obviously a Huawei box must have been involved. No good ole American kit would ever do such a thing.

    2. Kernel Silver badge

      Re: ...they were generated by a switching module in a node ... for reasons still yet unknown...

      "A broadcast packet greater than 64 bytes with a valid header and checksum and no TTL magically appears ex nihilo from the quantum soup in the transmit buffer of a switching module?"

      I've never worked with kit from Inifieria, but if it's anything like the kit I am familiar with the "switching module" is likely to have been an optical wavelength router with it own on-board CPU running a carrier grade Linux - more than capable of generating a defective packet or four without any magic other than a program glitch

  9. Mark 85 Silver badge

    This has bit of "intentional" to it. Do systems sometime generate packets? Did an engineer decide to do some "testing"? Or was it a test of an attack method? I'm shocked though that there's no fines involved or scapgoats called out.

  10. jake Silver badge

    There is a reason ...

    ... that packets are SUPPOSED to evaporate when the TTL field counts down to zero. It's kind of an important part of packet switching. Did nobody bother to tell the engineers involved, or did they skip that day of school?

    1. Kernel Silver badge

      Re: There is a reason ...

      Assuming the diagram and article are correct, the packets would be carried on the inter-node supervisory wavelength - which is terminated at the end of each fibre span, processed by the node's controller card, and then a new supervisory signal generated for transmission on the next span.

      The TTL would be set anew each time they left a node as it is a new packet being sent, not the received packet being merely repeated in the way a router might do.

  11. nojobhopes

    Going Postal?

    Reminiscent of parts of Terry Pratchett's 'Going Postal', where a Clacks message is kept 'moving in the Overhead' using the tags G, N, and U.

    1. jake Silver badge

      Re: Going Postal?

      More to the point, Jon Postel was probably spinning fast enough to power MAE-West as this was going down. (See 1980's RFC 760 if you don't know why this is relevant ...).

      1. Kernel Silver badge

        Re: Going Postal?

        Relevant, but deceptive - from RFC760:

        "The Time to Live is an indication of the lifetime of an internet

        datagram. It is set by the sender of the datagram and reduced at the

        points along the route where it is processed. If the time to live

        reaches zero before the internet datagram reaches its destination, the

        internet datagram is destroyed."

        But in this case the datagrams were reaching their destination - which is the node at the other end of the fibre span for an optical supervisory channel.

      2. Warm Braw Silver badge

        Re: Going Postal?

        Actually, there was just such a problem with the then ARPANET on 27th October 1980.

        The problem resulted from a node with a memory fault causing some bits to be corrupted. The routing vectors contained a circular sequence number space controlling their lifetime, but the bit corruption resulted in packets with lifetimes A, B and C such that A < B < C < A in the circular space so that routing information was constantly being replaced and retransmitted as all sequence numbers seemed valid as replacing each other. The network was recovered by turning all the routers off and back on again - not something that would be easily possible today.

        This is why sequence numbers in the OSI IS-IS protocol have a "lollipop-shaped" number space.

        1. Doctor Syntax Silver badge

          Re: Going Postal?

          "The network was recovered by turning all the routers off and back on again - not something that would be easily possible today."

          My immediate reaction to the story was that turning it all off and on again is just what should have been done. Assuming, of course, that turning it all off doesn't preserve state.

        2. Fred Goldstein

          Re: Going Postal?

          No, the lollipop-shaped number space was a mistake. As was putting the MAC address into the NSAPA. Both ideas came from DECnet Phase V. Radia Perlman, a brilliant woman, came up with the lollipop-shaped space, but later told me (I was at DEC) that it was wrong, and didn't really fix anything. But sometimes a mistake is taken to be gospel, and even its inventors can't correct it. TCP/IP is full of them. And they're obvious if you don't (incorrectly) assume that their creators were smarter than you. Radia was smarter and she admitted a mistake. Lesser minds pretend to be infallible. And sometimes get elected.

          1. Warm Braw Silver badge

            Re: Going Postal?

            > the lollipop-shaped number space was a mistake

            It's amazing now the contortions that were once employed to try to work around limitations on memory and the presumed unavailability of stable storage and an accurate time source...

    2. Ochib

      Re: Going Postal?

      No, it was a group called the "The Smoking GNU". They created a packet of code called the Woodpecker which would have destroyed many of the chain of towers, but Moist von Lipwig persuaded them to send a different message: a "neutron bomb" which would destroy the corrupt company but leave the towers standing

  12. DryBones

    So, step 1 would be to reject out of hand any packets with a TTL above a certain value, and slap a default on any without one?

    1. John G Imrie

      Err no

      If I'm reading this correctly then each packet was sent only one hop. The device it was sent to processed the packet and created a new packet for everything that was connected to it, including the originating device.

      TTL = 1

      1. TechnicalBen Silver badge

        Re: Err no

        So a fork bomb?

        Accidental or nefarious. Both happen at times.

      2. You what now!

        Re: Err no

        If this were the case surely even a valid broadcast packet would by definition create a packet storm since it would be bounced back to the originating node and out again for ever?

      3. diodesign (Written by Reg staff) Silver badge

        "each packet was sent only one hop"

        According to the FCC, the equivalent TTL was infinity. These packets are not standard TCP packets, I believe, they are proprietary Infinera packets, at least over the management channel, anyway.

        Each time one of the bad packets hit a node, the node spammed *all* neighboring nodes with the same packets due to the broadcast address. That's the main problem, not the TTL, IMHO.

        C.

        1. Doctor Syntax Silver badge

          Re: "each packet was sent only one hop"

          Infinera: maybe the clue's in the name.

        2. Jellied Eel Silver badge

          Re: "each packet was sent only one hop"

          According to the FCC, the equivalent TTL was infinity. These packets are not standard TCP packets, I believe, they are proprietary Infinera packets, at least over the management channel, anyway.

          Looks like they're 'standard' Ethernet frames, ie-

          Currently, CenturyLink is in the process of updating its nodes’ ethernet policer to reduce the chance of the transmission of a malformed packet in the future. The improved ethernet policer quickly identifies and terminates invalid packets, preventing propagation into the network.

          Shame the FCC report doesn't include the 'malformed' frame because presumably it was 'vaild' as far as being a correctly structured Ethernet frame. There's no TTL in Ethernet, so why broadcast storms happen. I'm guessing having a length of 64 bytes, it might have been a zero length frame, ie all header and no payload, so nodes <shrug> and pass them on.

          Each time one of the bad packets hit a node, the node spammed *all* neighboring nodes with the same packets due to the broadcast address. That's the main problem, not the TTL, IMHO.

          Yup. Ethernet working as intended. Just not the way Infinera/CenturyLink intended. Issue also seems to be-

          As the supplier of these nodes, Infinera provides its customers – including CenturyLink in this case – with the proprietary management channel enabled by default. CenturyLink was aware of the channel but neither configured nor used it

          And then why resource depletion on the switching card lead to LOS on customer circuits. Fix for that would I guess be to set process quotas and rate limit broadcast frames, so standard methods to reduce the impact of broadcast storms.

          Also OOB access can be a real PITA for modern networks. So in the good'ol days, a modem hooked up to a craft/console port.. But that assumes you have a working phone line, and this outage affected phone calls because they were carried across CenturyLink's network. And same happens if you try using leased lines/xDSL if that supplier is also your wholesale customer. In an ideal world, have a good'ol fashioned order wire, or run dual networks so if one goes down, you should be able to access via the other. Biggest challenge is your OOB may not work anyway if the switching card's running flat out & out of resources.

          (anyone who's ever done 'debug all' on a big-ish, busy Cisco knows what happens then...)

  13. Nano nano

    Logic ...

    "Have I seen this packet before? If so, discard,"

    "Does this node allow no-expiry packets ? If not, discard "

  14. The Original Steve

    How to fix?

    I'm what my MD has called an "expert generalist" in that I've spent the best part of 2 decades doing ethernet networking, storage (FC and iSCSI SAN's) and compute (VMWare and Hyper-V) as well as applications (SfB, Exchange, SharePoint etc.)

    Whilst I'm familar with network loops and can have sufficient networking knowledge around TCP/IP for things like QoS/DSCP, STP, VSRP etc, I'm curious as to what the engineers in this case needed to do once they "fixed" the offending kit that sent out the bad packets. Article says that the packets already generated were still bouncing around and continuing to broadcast around the nodes and it took them a further 3 hours to get onto these nodes to remove the bad packets already generated.

    My question is how..? A case of just rebooting the kit as the packets and sessions shouldn't be persistent? Or is there some dark art I'm unfamilar with where you can go onto a node (I'm reading a node as a switch / router. Afraid fiber isn't one of my skills) and almost select packets based on a filter and then remove them?

    Just curious.

    1. Jellied Eel Silver badge

      Re: How to fix?

      My question is how..? A case of just rebooting the kit as the packets and sessions shouldn't be persistent? Or is there some dark art I'm unfamilar with where you can go onto a node (I'm reading a node as a switch / router. Afraid fiber isn't one of my skills) and almost select packets based on a filter and then remove them?

      Yes and no.. But usually once a broadcast storm's started, and amping up, then jumping on the device to add filters is too late, because the device is busily broadcasting. The dark art* often becomes a seek & destroy exercise to figure out where the broadcast traffic originates, and isolating that. Once a storm's started though, it can be like playing whack-a-mole given broadcast frames already in flight, or buffered/queued.

      *That art often being waving the rubber chicken and rebooting the device. Then hoping you can reconnect to that device. My first experience of this was on a large Cascade network, where much the same thing happened.. Except the switch control cards also decided they'd faulted and switched out/to backup cards. So then playing hunt the master switch & trying to get control back of the network. That day, I learned why it's a really good idea to have OOB access to all potential masters.

      1. Flightmode

        Re: How to fix?

        Part of the problem with this particular situation is that these broadcasts were regenerated at each hop, it wasn't simply a case of rebooting the devices one by one. All the devices not currently being rebooted would still be busy happily regenerating the broadcasts and sending them to all their neighbours - including those devices that had already been rebooted...

        The only solution here - if the CLI or other management tools aren't able to access devices to add filters on the fly (if it's even possible on this equipment type) - would be to take down ALL THE DEVICES that take part in the broadcast mayhem AT THE SAME TIME to ensure that the bad packets are gone.

        1. Jellied Eel Silver badge

          Re: How to fix?

          The only solution here - if the CLI or other management tools aren't able to access devices to add filters on the fly (if it's even possible on this equipment type) - would be to take down ALL THE DEVICES that take part in the broadcast mayhem AT THE SAME TIME to ensure that the bad packets are gone.

          Yup.. Which even if one can convince management, can be easier said than done. So lots of nodes, many of which would probbably be in transmission huts & unmanned sites. So just use your remote power management system to do an emergency shutdown.. But that assumes you've got working OOB access to the power systems, and enough field engineers to deal with nodes that don't power back up. Which should be possible on a good network, but then there's pressure to cut costs..

          But it's an inherent risk with using Ethernet for anything, especially when it's deployed on systems that would naturally be physical or virtualised rings. So it's going to loop, and it will result in broadcast storms, unless you have a way to mitigate those built-in, or contingency plans for when they happen.

  15. jj_0
    Devil

    "...high-speed optic fiber..."

    I wonder how fast the fibre was travelling.

  16. Anonymous Coward
    Anonymous Coward

    NetBUI, is that you ?

    Someone may have been pluging a win95 with Netbui enabled, there :)

    Seriously, am I the only one to think *any* packet with no TTL should be promptly discarded ?

    1. Jellied Eel Silver badge

      Re: NetBUI, is that you ?

      Seriously, am I the only one to think *any* packet with no TTL should be promptly discarded ?

      Ethernet doesn't include TTL, so although that may be wishful thinking to keep those frames safely on a LAN, implementing this would cause a.. certain amount of network disruption :)

  17. tony72

    Oops

    I guess this was the router equivalent of a "reply all" email storm.

  18. Claptrap314 Silver badge

    Parsing passing?

    What I don't get is why in the world one would EVERY forward a malformed packet to ones peers? The only reason to forward a packet is if you believe that the other end might be able to process it. If the packet is malformed, it gets dropped--because your peers are also going to see it as malformed.

    It's not like their parser is supposed to be different.

    1. Jellied Eel Silver badge

      Re: Parsing passing?

      What I don't get is why in the world one would EVERY forward a malformed packet to ones peers?

      It's a broadcast frame, so it's what you're meant to do. Ethernet's 'fast' because it's simple/dumb, so problem would be figuring out if it's 'malformed'. Valid header & FCS, pass it on and if the malformation's in the payload, that gets into DPI-type inspection with more delay & computational expense.

  19. gal5

    Why would a node agree to forward a broadcast packet? Are there legitimate internet (vs intrAnet) broadcasts packets?

  20. fredesmite Bronze badge
    Mushroom

    I'll try this at work on Monday

    Using my packet injection program.

    Should be a hoot.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019