back to article Kubernetes bug ate my banking app! How code flaw crashed Brit upstart

Monzo, a UK online banking startup, suffered an outage on Friday for over an hour due to a four-month-old Kubernetes bug. The Fatal Flaw, as the event might be titled by author Lemony Snicket, took down a complete production cluster, according to Oliver Beattie, head of engineering for Monzo, "through a very unfortunate series …

  1. Walter Bishop Silver badge
    Facepalm

    Rolling update causes outage

    "apiserver timeouts after rolling-update of etcd cluster"

    Back in the day of steam driven computers, we were taught to never ever update a live system. First test on a test rig before rolling out to the live system or at least have a working roll-back procedure in place. One that won't fail because it can't find its configuration data, because the rolling update borked communications to the server.

    "Beattie posted an analysis of the incident and lay the blame on Kubernetes"

    NO NO NO, the blame lies with whoever at Monzo rolled-out the update without first verifying it, or at least have a working roll-back procedure in place ..

    "To restore service, they turned to an updated version of linkerd being tested in the company's staging environment."

    Is this the same 'staging environment' you didn't bother to test the rolling update on in the first place?

    1. This post has been deleted by its author

      1. Jim Mitchell

        Re: Rolling update causes outage

        @Oliver Jones This isn't "agile", this is DevOps/Constant Integration/Constant Deployment. They must have attended that Reg lecture series...

        1. Anonymous Coward
          Anonymous Coward

          Re: Rolling update causes outage

          Nice buzzwords for untested/disaster.

          Anon, as we do exactly the same.

    2. I said the red button Igor!

      Re: Rolling update causes outage

      I see you attracted a down-vote ... must have been from a DevOps devotee!

      Have an up-vote on me.

      1. This post has been deleted by its author

        1. Bronek Kozicki Silver badge
          Megaphone

          Re: Rolling update causes outage

          If I had to make a choice, I very much would prefer a few-hours long outage where no data is actually lost (only the processing is delayed) as opposed to whole day, or even whole week, outages where unfinished transactions are dropped on the floor. Something that more traditional institutions "excel" at - banks and airlines alike.

          If anything, this outage has proved that Monzo knows how to deal with an outage and how to communicate with customers. Again, older banking institutions put to shame in comparison.

          EDIT: in actual agile environment "lessons will be learned" actually means what it says on the tin. Especially in a new institution, which almost by definition is in "learning mode" all the time. There is hoping that with this, relatively small, outage Monzo will have learned to appreciate more through integration testing.

          1. This post has been deleted by its author

            1. Bronek Kozicki Silver badge

              Re: Rolling update causes outage

              "Agile ... led by managers" You seem to have confused Agile with "Agile" label, which anyone can pin to their chest/team/process/mess, without actually bothering to understand what it means.

    3. Anonymous Coward
      Anonymous Coward

      Re: Rolling update causes outage

      "They realized the failure to parse empty responses was due to an incompatibility between the versions of Kubernetes and linkerd being run"

      Welcome to the world of Open Source. You too can run a zoo of crap that breaks all the time because of dependency failures.... Good luck pinning it on a specific vendor and getting it fixed anytime soon. Once they even get round to reading your forum post!

      1. sabba

        Re: Rolling update causes outage

        Yeah, cos' everyone knows you never get similar problems with proprietary systems/tech.

        Get real, the fact it is open-source bears no relevance here. Even in a homogenised environment (single source for all solutions) there are often incompatibilities between different versions of inter-connected applications (since few organisations feel the desire to uplift their entire enterprise landscape every time a single application is updated).

    4. Mark 65 Silver badge

      Re: Rolling update causes outage

      You beat me to it. I was about to say "so they tested this in a staging environment first" before realising that I'm clearly not AgileDevOpsContainer compatible and just stick to doing the mundane shit of test before release. Newer practices just seem to yield better fuckups.

  2. Adam 52 Silver badge

    Fundamentally Kubernates isn't ready for production use, it is full of little gotchas and bugs like this. It might be the new cool toy but it just isn't stable.

    Give them their due, Gartner have been warning their customers - experiment but not rush in.

    Hopefully the arrogant "no test", deploy to live and redeploy on failure crowd will learn from this, but I doubt it.

    1. Mark 65 Silver badge

      I'll be fair, deploy to live and redeploy on failure can be used in certain cases depending upon what it is your system does. Clearly real-time payment processing and banking isn't one of those cases.

    2. skies2006

      No, pure Kubernetes is ready for production use.

      This is a issue with integration of third party tools with Kubernetes, then one has to be very careful and always make sure to research and test that versions are compatible and work together.

      One can not just grab any version of Linkerd, Envoy, Istio, Spinnaker, et all and just expect things to work. That is foolish.

  3. I said the red button Igor!
    Facepalm

    Oh dear ...

    Back to the playground with you, you poor, naive DevOps gurus!

    There is no reasonable place for DevOps in any 'mission' (life/business) critical systems where customers depend on the services provided to complete essential tasks involving others such as checking bank balances, checking patient records and prescriptions, paying for goods etc.

    Please feel free to play play DevOps guru with Netflix, Twitter, Facebook and all those non-essential services that far too many think represent 'real life', but leave the serious systems to serious operators who respect proper dev/test/approve/release cycles applying tried and tested deployment and contingency plans for when the *inevitable* snafus do rear their ugly heads.

    1. Rob D.

      Re: Oh dear ...

      I wouldn't confuse DevOps with running a weak release cycle. I've too many grey hairs to play at being a DevOps guru, but I don't think that DevOps is about being cavalier with a release cycle. In my book, DevOps should be (not necessarily is) about the exact opposite - efficiently linking development activities with operational activities in a manner that reliably supports the business with working IT systems. That includes all the contingency support to mitigate risks during releases the same as any other release approach.

      Unfortunately, in addition to being a bit hip and trendy (and so subject to the whims of incapable muppets), DevOps practice suffers the same fate that other release management procedures eventually run in to - given procedures/experiences that work, someone decides to make things more efficient by stripping out necessary planning, contingency and testing.

      1. Anonymous Coward
        Anonymous Coward

        Re: Oh dear ...

        Spot on.

        My own experience is that there is too much emphasis on the dev of devops - what is being lost sometimes is the experience of Ops to keep the system stable is the most important thing (new features are great but unstable systems lose customers\business support). Ideally it's melding both sides. And that means making sure all the testing (including implementation dummy runs, reversion etc) is done before shit lands in prod. That dev's understand what works in prod and what doesn't because they need to support it with Ops and not just throw it over the wall.

        As for the rose tinted past - most places don't do DevOps and IT is crap - heroic efforts to keep systems up, Change Boards where you can do any dangerous crap as long as you tick all the neccesary boxes, you can't make change x to the network in anything under 6 months, only god knows how the system was configured because we once wrote a manual but now it's out of date because we never had time to update it. Sure some won't have that experience but I'd guess it's a minority (at least at big IT places). If DevOps can be used to address some of those problems why wouldn't I use it?

        1. Mark 65 Silver badge

          Re: Oh dear ...

          To me DevOps is click to deploy whereby the action is fully automated and reproducible such that if it works in Dev and Test/Staging then you can have a high degree of confidence that your production release will go smoothly. It should never be about rapid release without adequate testing.

          1. Mark 110 Silver badge

            Re: Oh dear ...

            Well said. There's lots of comments further up making assumptions it hadn't been tested properly. It doesn't matter whether you go waterfall, agile, devops - there's always weird behaviours sneak through your testing every now and then. Doesn't matter who you are or how good you think you are.

            All DevOps says is if you automate as much as you can you should reduce the risks and save time/money.

      2. Aristotles slow and dimwitted horse Silver badge

        Re: Oh dear ...

        Absolutely this.

        1. Bronek Kozicki Silver badge

          Re: Oh dear ...

          "It should never be about rapid release without adequate testing." - that's right, the problem is (of course) the scope of automated testing in any particular implementation. You need to have more than just unit tests (test each individual functionality/use case in separation), but adding automated regression (test whole binary end-to-end), integration (test data flows within the system) and performance tests takes more effort.

  4. Ilsa Loving

    I wonder...

    I wonder if they're also using NoSQL databases for their back end? ACID compliance? We don't need no steenking ACID compliance!

    1. Korev Silver badge

      Re: I wonder...

      Especially not for webscale databases...

  5. The Original Steve

    Where's the HA and DR?

    Sure, shit can happen - although as nearly everyone else has said above - proper testing would have prevented this.

    But what struck me is they have a single cluster. Is there not a mirrored version elsewhere using different infrastructure, where the changes get applied later? Sounds nuts to me that a BANK is depending on a single tech stack that they don't fully understand without a different stack running in a different environment in a different datacentre.

    Like their debrief, but I wouldn't let them hold my money.

  6. Anonymous Coward
    Boffin

    Another agile win in an agile way for Agile.

    Yo! Here's the thing: in this agile, cloudy world we live in, backup/standby and test environments are a waste of machines and hence a waste of money that could be better spent on making your shared encounter space look like Google's or a really good chai machine.

    Why pay for multiple environments when you can pay far less for a single on that nearly works, quite a lot of the time? Internet influencers are very forgiving of outages anyway, so you can try software upgrades (because you want to be an agile upgrader!) any time as soon as they are available. Just download the old one if things don't quite gel.

    It's been proven time and time again that with an active twitter feed, a team of keen, confident millennials (paid agile, millennial money too), and the latest versions of everything going you can solve pretty much any problem anywhere at any time. And for those tricky 1% cases, there's Blockchain.

    My latest book, iAgileBank v2.1: An agile realisation of under the digital mattress discusses, with a general absence of verifiable facts, but an awful lot of "quotes" and unscaled projections, how anyone can start up a bank with little more than Javascript and Social media. Cost to Register readers: 5 Bitcoins or 6 feet of Blockchain. But I'd prefer a BACS transfer to my Barclays account please.

    1. Anonymous Coward
      Anonymous Coward

      Re: Another agile win in an agile way for Agile.

      "It's been proven time and time again that with an active twitter feed, a team of keen, confident millennials (paid agile, millennial money too), and the latest versions of everything going you can solve pretty much any problem anywhere at any time."

      Course it has, everybody who's anybody knows it. When's your next IPO, and can you pre-announce it here on the QT (meaning: on the quiet) so your early adopters can be in on the joke this time? Why should TwitBook etc be the ones having all the fun?

      ps I've a couple of bottles of snake oil and some DIY bridge components if anyone's got a few spare Bitcoins.

      pps back in the late 1960s, the UK's state-founded GiroBank and its use of modern technology and radical approach to reaching customers (telephone banking!) really put the wind up the dinosaur high street banks, to the extent that GiroBank eventually had to be privatised and shut down, to prevent further damage to the dinosaurs. Sensible countries still have a GiroBank or equivalent, serving the needs of the general public. Meanwhile most of the UK's High Streets no longer have a bank or a proper Post Office.

      https://en.wikipedia.org/wiki/Girobank

    2. This post has been deleted by a moderator

  7. John Smith 19 Gold badge
    IT Angle

    If they are an "internet bank" shouldn't they be regulated like a bank?

    Or is this the "We're on the internet, your laws don't apply to us" business model?

    Yes it does sound like they actually did take this seriously. Hopefully " Check/synchronize release versions of different S/W packages for known incompatibilities" will be on the "lessons learned" list.

    I'd never heard of them before.

    Let's see what happens next.

    1. Bronek Kozicki Silver badge

      Re: If they are an "internet bank" shouldn't they be regulated like a bank?

      They are regulated like a bank.

      1. John Smith 19 Gold badge
        Unhappy

        "They are regulated like a bank."

        So not the "We are in the cloud. Your laws do not apply to us" BS of some other companies then.

        Good to know.

        But has to be asked after every other f**king chancer (Uber er al) has played that card.

    2. fords42

      Re: If they are an "internet bank" shouldn't they be regulated like a bank?

      They're governed by the FCA, just like any other bank. To be fair, they did handle the outage well and got everything up and running again within the 48 hour deadline.

  8. Anonymous Coward
    Anonymous Coward

    It's an infection

    I am acquainted with the internal workings of the IT department of a major high street bank who have swallowed the Agile Kool-aid and I frequently hear stories about how there are some who regard the brave new world of Agile as an excuse for avoiding testing. Just stick the thing into production and everything will be fine. Except, of course, when it isn't. So far, they've been lucky and there haven't been any major production outages, merely handfuls of customers with borked accounts. But it's only a matter of time.

    Anonymous to protect the innocent.

  9. DJ Smiley

    And there's a lesson on why your testing platform runs the _same versions_ as live

    And there's a lesson on why your testing platform runs the _same versions_ as live

    If it had, they'd have noticed that during this upgrade it breaks, but as so many places, they seem to have decided to upgrade the test version at some point without actually testing the whole setup.

  10. sitta_europea

    NullPointerException.

  11. Anonymous Coward
    Anonymous Coward

    hm, Discovery broke... weren't we supposed to do ourselves a favor and use static routing?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019