back to article Google promises proper patch preparation after new cloud outage

Google Compute Engine (GCE) users experienced a brownout over the weekend, after an incident that bears plenty of likeness to a worse outage that took down the service in February. The February FAIL came about when “The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated …

  1. Anonymous Coward
    Trollface

    I know...

    “investigating why the prior testing of the change did not accurately predict the performance of the isolation mechanism in production."

    Maybe you gave them 90 days to test when they needed 93?

  2. Bob Vistakin
    Unhappy

    Sigh. No-one can get the cloud right.

    Yet we're all being pushed inexorably towards it.

  3. Peter Gathercole Silver badge

    I hope that...

    ... the outage did not affect the whole of the GCN, just parts of it.

    If it did affect it all, then it would appear to be that they've got at least one too many single points of failure.

    Whilst I know that certain parts of the core infrastructure are difficult to make completely redundant, multiple network fabrics that can be run in isolation for resilience is not a new idea, nor is only working on one of your fabrics at once a particularly taxing notion.

    I feel that all cloud providers should effectively have the best High Availability features for their infrastructure. Don't rely on the MTBF figures provided by your equipment suppliers to be a realistic figure for the availability of the service run on the kit, because unexpected 'stuff' happens.

    1. Aitor 1

      Re: I hope that...

      In order to be profitable, it has to have many single points of failure, otherwise it wouldn't use all the capacity and would be expensive.

      I have seen this kind of thing happen many times: you do have redundancy, while using all the hardware you can, but performance is so poor while degraded, that in practice you have downtime.

      1. Peter Gathercole Silver badge

        Re: I hope that... @Aitor

        All good points... provided that the SLA with their customers reflects a realistic amount of downtime for the service. I've been involved in implementing services that are designed to be able to cope with failures, and this is what I feel the cloud providers should be aiming at.

        The problem as I see it is that cloud providers sell their service as resilient ("just look at all the places we can move your services to"), giving customers an expectation, without actually investing in the infrastructure that actually allows this expectation to be achieved.

        Cost is an issue, of course. Cloud providers must match a realistic expectation of total system availability with the cost, with different tiers of pricing. Otherwise, providing cloud services will become, in a currently popular phrase, a race to the bottom, with price being the ultimate factor in the choice.

        Unfortunately, the people buying the service may not actually really understand what they are being sold, and if someone with real experience of service continuity tries to point out the deficiencies, they will be branded alarmist, or protectionist (of their own jobs), and be sidelined.

  4. choleric

    In other news:

    Google engineers have actual bowel movements.

    1. Dr Who

      Re: In other news:

      Indeed. Which could be messy if their internal software system fails to correctly route the egress traffick, resulting in the shit hitting the fan instead of the correct target destination (the pan).

  5. Anonymous Coward
    Anonymous Coward

    google eh?

    el reg always seems to have much more of a fanfare when microsoft have an outage.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like