back to article Fallover Friday: NatWest, RBS and Ulster Bank go TITSUP*

Online and mobile banking services from RBS Group subsidiaries NatWest, Royal Bank of Scotland, and Ulster Bank crashed at around 5am this morning and remain down. Reports of the digital account blackout began surfacing on Downdetector early this morning (see here, here and here) and a mouthpiece at the Natwest ‘fessed up to …

  1. malle-herbert Silver badge
    FAIL

    This happens so often...

    You should make a macro for this article...

    Just automatically fill in the date and their canned twitter response...

    1. Anonymous Coward
      Anonymous Coward

      Re: This happens so often...

      More like a ticker that says when it's up.

    2. john.jones.name
      WTF?

      no DNS security or client-initiated renegotiation protection either

      for a start the web server allows for client-initiated renegotiation, which is NOT good at all..

      Although the option does not bear a risk for confidentiality, it does make a web server vulnerable to DoS attacks within the same TLS connection. Therefore you should not support it.

      they have not enabled DNSSEC... so you can trivally spoof it even if your using the latest and greatest security !

      maybe they should look at the top level domain .bank which requires security...

      http://go.ftld.com/dnssec-implementation-guide

  2. Anonymous Coward
    Anonymous Coward

    Back now

    It was a botched firewall change which has now been backed out. Things (as of 10AM) are back now. This is actually an example of things working pretty well: a mistake was made in a configuration but the change made had a proper backout process which worked and recovered service. Of course, it's RBS so everyone (understandably, their crimes were historically very great) likes to be nasty about them, but, well, compare this to what happened to TSB (or to RBS in the big incident a few years ago): it's never OK to lose service, but realistically that does happen sometimes, the real test is whether it can be recovered quickly without losing transactions or leaking information, which it was in this case.

    Disclaimer: I don't work for RBS any more but I do know people who do hence AC post.

    1. Dr Who

      Re: Back now

      Agreed that it's good to have an effective rollback procedure. That said, five hours seems like a long time to roll back a firewall change and in terms of PR, the status page seems to have been behind the borked firewall and therefore unavailable and the customer support bods didn't seem to have a clue what was happening either.

      1. toffer99

        Re: Back now

        Have they lied about the cause yet? That's the next step.

      2. TkH11

        Re: Back now

        It is not hard to write an effective rollback procedure, it really isn't.

      3. TkH11

        Re: Back now

        The 5 hours probably wasn't for the reversion of the firewall configuration. It was most likely down to tracking down why the system wasn't working.

        Remember, that the guys who make firewall changes rarely understand the system and how it works.

        Somebody would have reported that some functionality wasn't working, but if you've just carried out a large deployment, the firewall change is just one small part of that.

    2. Anonymous Coward
      Anonymous Coward

      Re: Back now

      I guess people are rightly annoyed because:

      - RBS / NW / UB are closing branches in many towns, telling people to use internet banking instead, people are relying on the system and the system goes down.

      - It has happened before, and was a massive failure.

      I left the group in my late teens when I went to (what was then) a branch asking for a proper bank account (they had me on a basic account since my schooldays) and got a "computer says no" response. Went to a competitor bank and they were happy to throw all sorts of banking features at me (plus £100!). So I have no love for them.

      That said, why was the firewall change not tested in a sandbox / production mirror first?

      1. Anonymous Coward
        Anonymous Coward

        Re: Back now

        I don't know the answer to these questions: chances are it was tested but someone made a mistake in the final implementation I suppose. Note I'm not claiming that outages are acceptable, just that they will happen occasionally because people do make mistakes, and it's important to have working backout mechanisms, which it seems like they did.

        (Same AC as original comment, I won't followup further as I don't want to speculate on things I don't know about.)

        1. Lee D Silver badge

          Re: Back now

          So you make a firewall change.

          The alarms and monitors all go off that your outside connectivity is now non-functional since the change.

          You wait 30-seconds to see if it's just the config taking effect.

          The alarms are still going off.

          You go to your change management log, see that the change in question is the cause of the problems in question (and it's not just a lucky time correlation), and back out the change made.

          That should *not* take five hours. On a multi-million pound banking system. With a competent team and proper processes. Where it's literally *costing you money* each seconds it's done.

          1. tfb Silver badge
            Boffin

            Re: Back now

            Well, assume that the monitors do go off, and that they go off promptly. If they do, then you don't just reverse whatever change you made: you have to fill in a great mass of forms which describe what you're going to do, apply for the access which lets you do what you're going to do, get approval from a bunch of very cautious people many of whom don't understand what you did to break it, how your proposed fix is going to fix it, or indeed any of the technical details of the thing at all, but who have burned into their brains the memory of a previous instance where someone 'backed out a change' which then took the bank down for days and are really concerned that this should not happen again. This takes time. It would be much quicker if you could just apply the fix you know will solve the problem, but the entire system is designed to make that hard.

            Yes, it would be easier, and much quicker, if all this laborious process did not get in the way. But no-one really knows how to do that: the laborious process makes it hard to do bad things as it's intended to do, but it also makes it hard to do anything at all as a side effect. It's like chemotherapy: it kills the bad stuff, but it nearly kills the good stuff too. I think this is an open problem in running large systems: people like Googlebook claim to have solved it ('we can deploy and run millions of machines') but they do that by accepting levels of technical risk which a financial institution would find terrifying (and there's a big implied comment here about people (like banks) moving services into the clown and hence implicitly accepting these much higher levels of technical risk which this margin is not large enough to contain).

            1. Anonymous Coward
              Anonymous Coward

              Re: Back now

              "Well, assume that the monitors do go off, and that they go off promptly. If they do, then you don't just reverse whatever change you made: you have to fill in a great mass of forms which describe what you're going to do, apply for the access which lets you do what you're going to do, get approval from a bunch of very cautious people many of whom don't understand what you did to break it, how your proposed fix is going to fix it, or indeed any of the technical details of the thing at all, but who have burned into their brains the memory of a previous instance where someone 'backed out a change' which then took the bank down for days and are really concerned that this should not happen again."

              Um....no.

              When I put in a request to make a network change, the last page is the backout plan, that gets approved along with the change, in fact it's part of the process:

              1) reason for change

              2) when will the change occur

              3) customer impact, including will the change cause an outage

              4) if this will cause an outage, how long will the outage be

              5) where the change will be made

              6) steps to effect the change

              7) steps to verify the change was successful

              8) steps to undo the change

              9) steps to verify that it works again

              1. tfb Silver badge

                Re: Back now

                Between (7) and (8) there is 'get approval to back it out': that's what takes the time, especially when (7) passed (but the tests were not adequate).

                1. tin 2

                  Re: Back now

                  Approval to back out? Is this some kind of new change control hell I haven't been exposed to? I back out if I think it needs backing out, end of. That's why there's a backout plan and why it's filed in advance.

                  1. DuchessofDukeStreet

                    Re: Back now

                    Clearly not - and be grateful. It involves spending a lump of time (usually in the idle of the night) talking to generic managers who don't know or understand what your specific area of technology is about, never mind your change so that they can make the decision to do the thing you knew about several hours ago.

                  2. tfb Silver badge

                    Re: Back now

                    Yes, approval to back out. The technical person or people implementing the change are absolutely not in a position to make decisions which could influence the functioning of the organisation, especially where the functioning or otherwise of the organisation is going to be in the papers. That's why there are elaborate governance structures in banks.

                    1. Lee D Silver badge

                      Re: Back now

                      And back-out plans means the worse that happens is the upgrade doesn't go through tonight, try again tomorrow.

                      In case you didn't notice - ABSENCE of a rapid, pre-approved back-out plan... got them into the papers.

                      I'll be much more worried about a place that requires approval of a back-out plan (rather than taking care to only approve plans with a safe back-out) - when the change is slowly churning through the entire database causing widespread corruption and affecting more and more and more records, and you have to wait for "approval" from someone to back that out.

                      Hey... maybe that explains TSB, eh?

                      1. Anonymous Coward
                        Anonymous Coward

                        Re: Back now

                        Don't want to say too much, but I hear the reason for backout being delayed somewhat was because the botched firewall change resulted in remote access being locked out.

            2. Allan George Dyer Silver badge
              Coat

              Re: Back now

              @tfb - "moving services into the clown"

              I know that sketch, it's the one with the ladder, bucket of whitewash, and the hilariously large syringe. This way is a lot more fun than running your services on someone-else's computer.

              Yes, the large coat with the flower in the buttonhole. Would you like to smell the flower?

            3. tony2heads

              Re: Back now

              Movie services into the clown

              Was that a typo or not??

            4. TkH11

              Re: Back now

              Re:

              Quote:If they do, then you don't just reverse whatever change you made: you have to fill in a great mass of forms which describe what you're going to do, apply for the access which lets you do what you're going to do, get approval from a bunch of very cautious people many of whom don't understand what you did to break it, how your proposed fix

              End Quote

              Not really. Because when you book in a change window, that change window should allow for reversion of the system. And the post deployment testing should be conducted within that change window too, so the failure should have been detected within the change window.

              There's probably a little huddle of people that occurs to make a decision to revert, but there won't be any more lengthy than that.

          2. john.jones.name

            no DNS security

            well they have no DNSSEC so changes are pretty instant...

          3. TkH11

            Re: Back now

            That's rather naive. I can see you clearly have not worked on large and complex production systems.

        2. TkH11

          Re: Back now

          >chances are it was tested but someone made a mistake in the final implementation I suppose.

          Chances are it was not tested. Why? Because in my experience, test environments usually contain the applications and the logical solution architecture, not the real physical hardware with the firewalls.

          The network infrastructure element of the production system is rarely duplicated in a test environment, or duplicated to a sufficient fidelity to reality.

      2. Anonymous Coward
        Anonymous Coward

        Re: Back now re: sandbox

        Because firewalls cost lots of money. VM's are relatively free, even Windows Server if you have enough of them, but firewalls, switches, load balancers and HSMs are expensive.

    3. StephenR1973

      Re: Back now

      Clearly a fluff piece response by 'anon' former employee. 'Look TSB were worse' is not an acceptable response, neither is a non-redundant major firewall change with a 5 hour rollback.

      Most likely as with others - Outsrouced IT for critical business functions pushed by lazy bonus sucking senior managers.

      You pay less you get less - that is outsoucing.

    4. TkH11

      Re: Back now

      Credit for detecting what was wrong and reverting, but the mistake should not have occurred in the first place.

      Far too often I have worked on systems which are business critical where there has been an inadequate test environment, because managers wanted to save a few bucks.

  3. Dwarf Silver badge

    Its really not that difficult.

    Resilience models are well known, understood and documented.

    Monitoring tools are well known, understood and documented.

    So why is this so hard for people to get right ?

    Or is it the quick change to fix X that ends up breaking Y as insufficient testing was performed ?

    If we're going to have to rely on Internet based services to run our lives, then at least the companies making mega profits can do the right thing and build them in a manner where they are rock solid.

    Oh and give us a workable Plan-B for when you screw up, local branches, cash machines, you know that sort of complicated stuff.

    Yes your profits might be a bit lower, but your customers will be able to get on with their lives when you screw up again.

    1. tfb Silver badge

      Re: Its really not that difficult.

      It's hard to get right because we're dealing with systems which are at or beyond the ability of humans to understand them.

      And 'the companies making mega profits' are companies like Google, Intel, Facebook & Apple: not, for instance, RBS. Of course, those highly profitable companies never ever make mistakes. No company ever shipped several generations of processors with catastrophic security flaws, for instance.

      1. Dwarf Silver badge

        Re: Its really not that difficult.

        It's hard to get right because we're dealing with systems which are at or beyond the ability of humans to understand them.

        I disagree.

        Any given system, however complex can be broken down into a number of sub-components / sub-systems of lower complexity where their expected functionality can be documented and understood.

        This process can be repeated multiple times until its obvious that the system is a just a big pile of little systems all working in harmony. Understanding what each one does and how they interact makes it easy to get to the right bit when something goes wrong, similarly it makes it easier to assess the impact if you need to change something on a component for whatever reason (patch, upgrade, new functionality, etc)

        Take some time to read up on architectural frameworks.

        1. tfb Silver badge

          Re: Its really not that difficult.

          Quite so. I spend my entire life taking large systems and carefully dividing them down into smaller parts with controlled interactions. Yet still my programs have unexpected bugs: how strange!

          1. Dwarf Silver badge

            Re: Its really not that difficult.

            @tfb

            Your original point was about being able to understanding complex systems, not that code will contain bugs or that people will misinterpret the specification.

            The good thing about having a functional diagram when these things happen, its clear where the defective component is and what needs to be done to fix it. Comprehensive testing can also help.

            Failing to test adequately - particularly after a change is just inexcusable and lazy.

        2. Gordon 10 Silver badge
          FAIL

          Re: Its really not that difficult.

          @Dwarf. Thus spake the theoretician, take some time to look at architecture in the real world.

          I disagree with your disagreeing.

          Firstly any given system in any big corporate is a hodgepodge of 20-30 years of tech, many of which are complete black boxes with the techies responsible for them long gone.

          Secondly - nearly every IT professional I know has had a WFT? moment when some extreme edge case has hit them causing unexpected results.

          Thirdly - do some research on emergent behaviour.

          1. Dwarf Silver badge

            Re: Its really not that difficult.

            @Gordon10

            I take your point about corporate systems and the way the evolve, however even breaking down the black box systems into what the black box does and its inputs and outputs will help in the understanding of the functionality of a system.

            However, failing to document a system or being unable to properly support it when it goes wrong, particularly when that subsystem is business critical is completely inexcusable.

            Often, when there is "that thing that nobody understands any more", grabbing a hot coffee, and having a good poke around in the system will yield a lot of information and arguably that forms the next level of decomposition of what that black box does - its actually a collection of 6 smaller black boxes or functions.

            One thing's for certain, ignoring the problem isn't going to make it go away.

            If it is sensitive to some form of edge case, then ensure that the upstream interfaces are specified and coded documented accordingly to prevent that edge case making it to the black box.

            None of these are complex engineering challenges. and often the problems come from a predecessor taking a short cut and leaving it for the next guy to sort out the mess. I'd prefer to be the other side of the fence, doing it right and breaking down some of the perceived barriers.

            I guess that's what makes me a bad fit for the new fangled "agile" make it up as you go along project delivery approach, but that fad will soon pass.

            1. tfb Silver badge

              Re: Its really not that difficult.

              I'm entranced by your naivety: I keep expecting you to start talking about formal proofs (although that's probably a different branch of the cult).

            2. Caff

              Re: Its really not that difficult.

              Almost certainly will not have access to "having a good poke around in the system" in a bank to figure out what is going on in a system. Especially if no-one there already understands how it works.

              You would first have to build a business case to allocate you time to documenting how it works then get access to the test system - hard as the test system will already be in use for production releases and patching. Likely you will have to apply to share the resources of the test system with devs who will not be happy if your poking around breaks it and delays releases.

          2. Ken Hagan Gold badge

            Re: Its really not that difficult.

            "Firstly any given system in any big corporate is a hodgepodge of 20-30 years of tech, many of which are complete black boxes with the techies responsible for them long gone."

            I hate to get all theoretical, but if a big corporate really finds itself in a position where it does not know how its systems work then it is no longer in control of whether they do actually continue to work, precisely because of your second and third points. The entire company could cease trading tomorrow and never be able to restart. Are the management and shareholders OK with that?

          3. vtcodger Silver badge

            Re: Its really not that difficult.

            Sadly, even very simple, clearly defined, systems can be quite difficult to understand. Check out this example. https://en.wikipedia.org/wiki/Feynman_sprinkler Large collections of digital logic are rarely either simple or clearly defined,

        3. TkH11

          Re: Its really not that difficult.

          Oh, Dwarf clearly knows all the theory but has little practical experience on large production systems of high complexity.

          Often, documentation is missing, it shouldn't be, but that's the real world. And even if the documentation is present, that's not the answer, at the end of the day, it's down to people and what they know about the system, and keeping information in their heads for fast recall. Understanding doesn't always come from reading a document, it comes from real world hands-on practical experience of a system.

          One one system on which I work, it has taken literally me several years to build the knowledge and understanding of the system of daily use, such is its complexity.

    2. Anonymous Coward
      Anonymous Coward

      Re: Its really not that difficult.

      I always have a Plan B. It's called another bank. If customers gave their business to the reliable delivery platforms this nonsense would stop.

  4. Anonymous South African Coward Silver badge

    Yay for outsourcing! Let's outsource some more!

    And when all the IT techs have done a Brexit and left for greener pastures somewhere else, where will they get local IT support from?

  5. Keith Oborn

    Barclays out was not just "online"

    Not only Barclays "online" stuff was out yesterday. My wife had to make an urgent payment on a property purchase. Ended up going to the local main branch. The were dead as well - Barclays branches have, of course, been largely automated in recent years.

    So there was no way to use any Barclays retail banking services yesterday. Luckily they fixed in late afternoon.

  6. Anonymous Coward
    Anonymous Coward

    It's getting to the point where the first T in TITSUP should be "typical".

  7. taxman

    Banks, banks and banks

    Today the RBS group of banks (that all use the same firewall with such a single point of failure?), Barclays yesterday, Lloyds not so long ago along with Halifax. And so the list of names goes on. Seems to becoming more prevalent - and at a time when King Cash is being threatened. It does make you wonder if somewhere in the world there is a rubbing of hands.

  8. Anonymous Coward
    Anonymous Coward

    Not much to show for our £550 billion bank bailout. Which shows that electing politicians that are clearly in the pockets of the big banks is daft.

  9. Ian Johnston Silver badge

    FlyBe's online check-in system is dead as well. Lotsa cross people queuing at airports.

  10. Hans 1 Silver badge

    We’re aware of some issues on our Anytime [...] Banking services

    Well then, why do they not call it Sometimes banking services ?

  11. chronicdashedgehog

    When I worked for HBOS, such critical changes were always done at 3am on a Sunday morning.

    They were fairly good at IT risk mitigation but obviously fairly poor at mitigating financial risk.

  12. Oh Matron!

    Why would Elle....

    Have her date of birth in her twitter handle?

    1. Chris Evans

      Re: Why would Elle....

      Elle18910782

      I have seen an email address with the person DOB obviously included on the side of their builders van something like: johnsmith29051990@gmail but I see what you mean: Elle was born on the 28th July 1981!

  13. 10forcash Bronze badge

    "Pregnant and nearest branch is 3 towns away. Bloody brilliant customer service"

    I cant work out if that comment is criticism or praise...

    1. Jonathon Green
      Coat

      Presumably somebody couldn’t make a withdrawal...

      1. Korev Silver badge
        Coffee/keyboard

        Brilliant -->

  14. David Roberts Silver badge
    Unhappy

    Another nail in the coffin

    Since NatWest closed their local branch I've been looking at moving my account.

    This is more encouragement.

    Test driving the Nationwide to see if they are any good before deciding if I should switch.

    As you get older, though banks that you have...errr....loved and lost get to be the majority.

    Barclays, Halifax, Santander, all screwed me over to a greater or lesser extent in the past.

    Not sure how long to hold a grudge, but I haven't run out of banks yet. Probably best to have funds in at least two banks if you can afford it. I have more than one credit card via different suppliers (and a mix of Visa and Mastercard) so in theory the only choke point is when they are paid off at the end of the month.

  15. Anonymous Coward
    Anonymous Coward

    What? No Test Network?

    This is not hard to avoid entirely if you have an identical test network to check what firewall rule changes might do.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019