back to article Google cloud outage caused by failure that saw admins run it manually ... and fail

A mistaken peering advertisement from a European network took Google Cloud's europe-west1 region offline last week for around 70 minutes. The slip-up happened when an unnamed network owner connected a new peering link to Google, and in the process, it advertised reachability for far more traffic than it could handle. As a …

  1. Anonymous Coward
    Anonymous Coward

    Oops!

    To err is human, but to really cock it up takes both a human and a machine?

    1. Trevor_Pott Gold badge

      Re: Oops!

      To err is human. To really cock up requires a committee.

      See: the design of pretty much every internet protocol since the beforetime. Anything that relies on trust is automatically a failure. Too bad techocrats never seem to understand that.

      1. Doctor Syntax Silver badge

        Re: Oops!

        "See: the design of pretty much every internet protocol since the beforetime."

        And they used to say "the internet routes round damage".

        1. Sam Liddicott

          Re: Oops!

          but not instantaneously. The damage must be observed first

        2. Trevor_Pott Gold badge

          Re: Oops!

          And they used to say "the internet routes round damage".

          It has. Unfortunately the world's governments have been working hard to ensure that any overlay networks that ensure privacy and security not only are eventually compromised, but are illegal or outright blocked.

          Trust is fundamental to the current internet. Unfortunately, trust is damage. Thus the fundamentals of the internet must be replaced or overlaid upon with network and protocol designs that don't require trust. The result, however, won't be anything like the internet of yore. The internet can't route around itself without becoming something altogether different.

          And, quite frankly, that's a good thing. The internet should be a bastion of free speech and anonymity. A place where people can communicate without fear of surveillance. Only then can new ideas truly be explored and - ultimately - flourish.

          Until then, the Internet is merely a means to give everything you are and have over to those who have proven repeatedly they will use all of it against you.

          We really need an "International Industrial Espionage Day" where we educate people about how the internet is where governments laugh and play and surveil the innocent. Usually for economic benefit.

      2. SImon Hobson Bronze badge

        Re: Oops!

        > Anything that relies on trust is automatically a failure.

        There are historical reasons for most of the protocols, most of which originate from "better times" when trust was mostly trustworthy. The problem is that most of the answers begin with "well to get to there, I wouldn't start from here" - the problem being that to "fix" the problem usually means "breaking stuff".

        I'll leave out my rant about SPF and what it breaks, but for BGP there really is no answer other than a global update to a new standard protocol. But that means updating every single one of millions of routers globally - and in the meantime running a "dual stack" routing table. And as long as there is just one "old" router there, then the dual stack has to remain, and even stuff running the new (secure) protocol has to trust (either directly or indirectly) the old BGP advertisements.

        But I'm not sure whether it's completely fixable without causing the patient more illness. Think about what the problem is: each router maintains a list of routes it knows about, and works out for each destination which "next hop" it needs to send traffic via. To know this, it must trust information received from it's neighbours, which must include information received from their neighbours, and the neighbours of those neighbours, and so on. So Bob tells Alice that he has a route to Frank. Bob knows this because Charlie told Bob that Charlie has a route. Dave told Charlie that he has a route, Edward told Dave, Frank told Edward, and so on. So alice needs to trust not just Bob, but that Bob is honest when he says that he trusts Charlie, and Charlie is honest when he says he trusts Dave, and so on down the line.

        I suppose it might come down to some sort of PKI setup where Zebedee "signs" his route. But what if Edward is dishonest ? He says he'd checked the keys for the route he got from Frank ? He can still sign the list of routes he gives to Dave, who duly checks and then signs his list before sending it to Charlie.

        > Too bad techocrats never seem to understand that.

        Too bad commentards don't understand history ;-)

        As an aside, these articles are worth a read :

        https://www.lightbluetouchpaper.org/2015/10/02/badness-in-the-ripe-database/

        https://www.lightbluetouchpaper.org/2015/11/02/ongoing-badness-in-the-ripe-database/

        It's a slightly different problem, but does show what a difficult task it is when there is no choice but to trust 3rd, 4th, 5th parties.

        1. Trevor_Pott Gold badge

          Re: Oops!

          I understand my history just fine, thanks. I even understand the issues in transitioning from the old to the new.

          Problem is, even new protocols being developed today rely on trust. The internet still relies utterly on people to behave with honour. People don't. Governments especially don't.

          It is a pain to transition to a new architecture. It will take decades and billions of dollars. Tough. It needs done. Best to start down the path and get it over with.

          Unfortunately, we're in the process instead of having the technocrats try to transition us to shit like IPv6. This doesn't benefit the individual in any way, but instead makes them even more vulnerable, traceable and exposed. Yes, I understand IPv6 is from the beforetime when chowderheads still believed in trust. But any attempts to actual solve the problems in IPv6 such that individual privacy is made paramount (or start a post-IPv6 transition that will move us to such a protocol) are simply shouted down.

          The technocrats are obsessed with making life easy for developers. (See: end-to-end model obsession, amongst many other things.) Anything that requires a poor developer to load a few extra libraries and understand a little bit about network when designing an application is apparently such a cosmic problem that everyone else should be rendered tracable all the time.

          And IPv6 is just one example.

          The BGP issue can be solved by making two routing tables on the net. One secure and one insecure. BGP, of course, being insecure. Systems advertising along secure channels would have a multi-point reputation system. Some central registrar (preferably multiple, in different jurisdictions) would ensure that A) yes, the organization in question has the right to post routes and has agreed to play nice and B) owns X routes, and can advertise them as they wish.

          If an organization tries to advertise routes on the secure channel that belong to someone else (which should be fairly easily traceable with the reputation system above) then those routes aren't accepted, and the reputation of the sender is demoted.

          If there ever appear to be two "legitimate" owners of a route - which shouldn't happen, but does from time to time due to administrative screw-ups - then the providers with the highest reputation wins, until the issue is resolved.

          BGP routes would then be considered as the lowest reputation routes. They will be accepted, but only if they are not overridden by a more reputable and verified source using the secure channel.

          Oh look, we now have a transition mechanism. That was hard.

          Yes, the reputation managers in this system would have many of the same flaws as certification authorities. This can be partially mitigated by having multiple reputation managers in multiple jurisdictions, making it hard (though admittedly not impossible) to compromise all of them.

          We could also look at some sort of distributed reputation system (blockchain-based? It's all the rage!) that supplements the "canonical" reputation systems, but is based more on "number of times an advertiser has caused route problems".

          Essentially the transition mechanism could be handled as something along the lines of a more advanced SPAM blacklist/greylist system, incorporating lessons learned from those attempts and giving ultimate priority to those advertisers who have done the leg work to get properly verified and whose ownership of a route can be confirmed through multiple sources.

          Clearly, however, this is completely unworkable and impossible. Because reasons.

          Trust is anathema to privacy and to security. Relying on it for anything is ridiculous.

          1. SImon Hobson Bronze badge

            Re: Oops!

            > If an organization tries to advertise routes on the secure channel that belong to someone else ...

            And that's where your whole argument falls over - because the whole point of the routing tables is that Alice has a connection to Bob (amongst others), Bob has a connection to Charlie (amongst others). So for Alice to send a packet to Charlie she needs to decide whether to send it via Bob or via another connection. She can only make that choice based on comparing the route that are offered and (in this case) concluding that Bob has the shorter/better route.

            So Bob MUST advertise a route to Charlie. Bob has no a priori relationship with Charlie so the question is : given that Bob must advertise a route to an address block belonging to Charlie, how do you secure that ?

            You cannot block Bob for advertising Charlie's address blocks - because if you stop Bob (and all the others) doing that, then no-one can actually send packets to Charlie without being directly connected.

            There's your challenge - come up with a practical method that allowed arbitrary organisations to re-transmit the routing tables securely. Other than the endpoints who just squirt stuff to their ISP and forget about it, everyone needs to know the best routes to every other destination on the internet. So you need a method to stop Zebedee who is down on the end of a long line of connections getting a route advertisement which includes Charlie's addresses, and retransmitting that claiming to be a good route for it.

            1. Trevor_Pott Gold badge

              Re: Oops!

              Well, for one, you check to see if Charlie is advertising a route back to Bob. If both networks are willing to register connectivity to one another and have high reputation, you trust them. If, however, you have a record of someone else owning Charlie's block who not only does advertise their connectivity but participates in the reputation system and then Bob starts advertising about connectivity to Charlie that Charlie is in turn not also advertising about, you either fail to accept the route or you squash it all the way down the reputation system so that the slightest glitch means they get dropped.

    2. Anonymous Coward
      Anonymous Coward

      Re: Oops!

      I'm sure the tumbleweeds rolling across Google's largely empty cloud were a tad annoyed at all the extra peace and quiet...

  2. -v(o.o)v-

    Get deploying that RPKI.

  3. Mark 85

    Automation is a good thing...

    Or so they tell us and then we get something like this since everyone is relying on the automation to do the mundane, watchful things. Which, if one thinks about it, self-driving cars, self-flying airplanes, just about any robotic manufacturing process that we've come to rely on will fail. Not will it fail, but when is the question. I think we're dumping too much trust and faith into a tech that doesn't deserve this kind of blind faith. Might as well believe that a sky fairy will protect you.

    1. Anonymous Coward
      Anonymous Coward

      Re: Automation is a good thing...

      Wonder if there will be a BGP for self driving cars, or for self-flying airplanes?

      It would be kind of funny/ironic to route a bunch of traffic from (say) France to Spain via Ireland due to screw ups like this. ;)

      1. A Non e-mouse Silver badge

        @A/C Re: Automation is a good thing...

        I thought SatNav did this already?

  4. channel extended
    Happy

    BGP

    So that's how the norks handled Sony

    1. -v(o.o)v-

      Re: BGP

      Route leaks have nothing to do with Sony

  5. Anonymous Coward
    Anonymous Coward

    So, what I read from this is that google are not doing IRR filtering, RPKI or any other similar route filtering/validation mechanism.

    That's somewhat frightening.

  6. theOtherJT Silver badge

    We've all done it...

    Sounds like typical "Fuckit, I'll do it live!" syndrome to me. I'm sure we've all been guilty of that at some point, although I imagine for most of us when we DO cock it up, our failures are less embarrassingly public.

    1. phuzz Silver badge
      FAIL

      Re: We've all done it...

      Yeah, I'm feeling quite forgiving about other people's mistakes today, after having to deal with the aftermath of my ill-advised GPO change on friday.

      If you're going to make a change on Friday afternoon, just make sure it's not in production.

      1. Anonymous Coward
        Anonymous Coward

        Re: We've all done it...

        That's what Mondays are for - you know the day is already going to be filled with grief, so whats a little more?

  7. Stevie

    Bah!

    BGP? All those ARPANET-era protocols were built with an optimistic "trustworthy" outlook. They should have been scrapped and replaced years ago when the internet stopped being the sole province of students and the US armed forces.

    The internet: a model for how politics works in a democracy. You only replace something or fix it if it breaks catastrophically. Otherwise, you soldier on with a brave face.

    1. Nate Amsden

      Re: Bah!

      Seems constantly I am reminded of people or products that get "fixes" which in fact make things worse than before. So be careful what you wish for. I'm not responsible for anything BGP related but my experience with ISPs in general tells me for the most part BGP works fine, there are hiccups here and there but as a customer I'll happily have those then the unknowns that come with a new protocol or significantly different architecture. My main upstream ISP has a 100% uptime SLA ( http://www.internap.com/support/sla/ ).

      I can recall one time where a company I worked for was severely impacted by a BGP issue, that was literally 11 years ago(traceroute for what should of been ~8 hops ended up being 32 hops and 98% packet loss - http://elreg.nateamsden.com/funkyroute.txt ). The routing issue was exposed due to a fiber cut in another part of the country, took about 7 hours for AT&T to work around the issue they were unaware as to the severity of the problem for the customers that were impacted.

      One more time when performance was degraded due to a BGP reverse path issue, that was about 6 years ago. Ironically enough AT&T in that situation too, Internap shut off peering with AT&T at that site for several weeks until AT&T fixed it (they never disclosed to Internap what the fix was or what happened). Until that time I had never heard of the term or concept of BGP reverse path.

      Other than that just small hiccups here and there that tend to get fixed pretty quickly without me having to contact anyone about them.

  8. virginiajgarza
  9. Anonymous Coward
    Anonymous Coward

    web scale, but not enterprise ready

    99.9999%, where is it today

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like