back to article Finance CIOs sweat as regulators prepare to probe aging mainframes

Could the watchful eyes of regulators soon come to rest on the old and often creaking IT systems that run the back offices of the UK’s leading banks? Among CIOs in the sector, there’s a palpable concern that they will. It’s no secret, after all, that most retail banks rely on decades-old technology for their core banking …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    This show a lack of technical knowledge. Mainframes are by far the most reliable systems both in terms of hardware and system software.

    Some companies that I have worked for have used a single mainframe and it provided much higher uptime than clustered Unix and far more than Windows clusters (down for patching every few weeks!).

    Critical sites will use clustered mainframes which have the highest possible uptime in the industry.

    Any problems will be due to poor application software (which is probably buggier on newer apps) or human error compounded by remoteness and /or inexperience.

    1. TrishaD

      Lack of Technical Knowledge?

      Well ... yes, up to a point. I think that the reliability of mainframes is unquestionable but its mostly about age of operating systems and applications and about a declining skill base capable of keeping them running.

      As I recall, the RBS fiasco concerned the CA7 scheduling package. The last time I worked on CA7 must have been at least 15 years ago and it was a very good product as long as you were prepared to get to know it inside out. Otherwise, it was a complete camel. You could probably say the same thing for a number of legacy apps that run on, or control, mainframes.

      When I first started working on RACF, I was considered a bit of a youngster. I'm now 60 yrs old and havnt touched a mainframe product in donkey's years. Most of the real mainframe hotshots are now long-retired and those skill bases arent being renewed. Maybe it's time I dusted down my JCL ....

    2. Steve Todd
      Stop

      You don't seem to understand

      the difference between reliability and resiliency. The mainframe may be able to achieve higher up time than minis (mostly because of mature and well tested software stacks), but if it goes down for some reason (and that may be nothing to do with the machine, a problem at the data centre can take it down just as easily as a mini) then the ease with which you can transfer the load to another system is far more important to your customers.

      The other problem is that requirements change. If, through regulatory requirements, you have to upgrade your mainframe software you have much more of a problem getting qualified staff who can make the change.

      1. Gordon 10
        FAIL

        Re: You don't seem to understand

        "then the ease with which you can transfer the load to another system is far more important to your customers"

        This is fundamentally irrelevant for anything other than hardware issues. Fact it if its software at fault you cant even know if its SAFE to transfer to another system without hours/days of analysis.

        1. Steve Todd

          Re: You don't seem to understand

          In my experience most big outages are down to hardware or environment issues (I've been on the end of a flooded machine room for example). Software changes normally can be rolled back and should only happen outside of core hours.

    3. Peter Simpson 1
      Linux

      Mainframe reliability

      What OS are these mainframes running that makes them more reliable than clustered UNIX?

      I can't argue that mainframes are probably more reliable than individual rack servers (better cooling if nothing else), but UNIX is pretty damn reliable. I would expect mainframes running AIX to be right up there in reliability.

      1. KierO

        Re: Mainframe reliability

        IBM System i OS. - Often cited as the most reliable operating system in the world.

      2. Anonymous Coward
        Anonymous Coward

        Re: Mainframe reliability

        z/OS, which has been around in one form or another for a hell of a lot longer than UNIX.

        It makes even the best UNIX look like a flakey piece of cack by comparison.

        Of course, when I worked at RBS, the Tandem guys would lord it over the z guys about "proper" reliability.

        1. keithpeter Silver badge
          Windows

          Re: Mainframe reliability

          "IBM support for z/OS 1.5 ended on March 31, 2007. Now z/OS is only supported on z/Architecture mainframes and only runs in 64-bit mode."

          Above from relevant wikipedia page. If it is accurate, can I conclude that most of the major banks have refreshed their mainframe operating systems in the last 8 years or so? If that is the case are they running legacy applications using the various modes that z/OS appears to allow?

          1. OzMainframer

            Re: Mainframe reliability

            Really, Wiki is your source of technical knowledge ?

            Z/OS 1.13 is what we're currently working on - and on the latest hardware (both of which crap on the rest of the toy computers in the ridiculous server farm we have running to support the GUI)

    4. Dominic Connor, Quant Headhunter

      Are mainframe *people* more reliable ?

      The RBS clusterfuck seems to have been mostly human error, if your hardware is solid and your software holds up then human error is going to be your biggest source of problems.

  2. This post has been deleted by its author

  3. KierO

    Coming from someone who has worked in a bank (well two banks actually) it is not so much the mainframe vs Windows vs Unix/Linux argument that is the problem. There is no argument about the uptime of enterprise mainframe systems, you can't really beat 99.9999999% can you?!

    The problem is a lack of business interest in keeping up to date with hardware (Yes even an AS/400 needs replacement from time to time), software (particularly software updates to core banking systems which cost a kings ransom) and disaster recovery.

    The first bank I worked for refused to spend a single penny on replacement of core network switches, despite an email from myself (The Network Administrator) entitled "THE NETWORK IS GOING TO FAIL!". One month later a core switch in the server room died.......it took ONE. WHOLE. MONTH for the CEO to sign off the purchase order for new switches.

    Hell, even my own department manager came out with some priceless sayings as "The Network switches aren't important, if we have to we'll go out and buy some old IBM dumb terminals"

    Any business unit that doesn't make money is seen as a "drain on resources" . I frequently struggled to get essential, core-infrastructure updates approved despite obvious technical evidence (explained to management in very easy no-technical ways) that important systems where on the verge of collapse.

    The phrase most often used in banks is "if it ain't broke, why should we pay to fix it?"

    Until the management teams and board members of banks understand that you need to constantly invest in IT in order to maintain it, then major outages such as the one at RBS will continue to happen.

    1. RW

      > Any business unit that doesn't make money is seen as a "drain on resources" .

      The situation is analogous, perhaps very closely so, with the problem of deteriorating infrastructure in the US. The people who control the purse strings don't acknowledge the importance of the highway network, for example, in keeping business running throughout the country. Hence highways, bridges, and all else are allowed to gradually deteriorate to the point of uselessness. [Actually, this isn't entirely true, but it approximates the situation usefully.] Maybe the IT systems of a bank don't make money that's visible as a separate line item, but if you take them away by virtue of flood, earthquake, tsunami, fire, riot, revolution, or even simple human error, the entire bank will suddenly stop making money.

      A better motto for the world at large is "if you build it, you have to maintain it, and you will have to replace it entirely by a certain age, even if it's still usable." This applies to web pages as well as physical infrastructure, and it certainly applies to software systems.

    2. Mark 65

      Whilst I agree with you and have seen this in action I also believe that we in IT must shoulder some of the blame. I have been on both sides of this fence. Often the message from those on the ground and in the know are misrepresented or poorly translated to those in the business so that it often seems as like a demanding child asking for the latest toy. It would work a lot better if we considered our position to be somewhat similar to an old arcade game - you only have so much health/credits to utilise so you need to use them well as when they're all gone the game is over. That's the way it often is from the business perspective - you only have so much credit, don't waste it. In your case I'm guess your credit got used up by some middle-management carbon waste touting something worthless he'd read in some buzzword riddled middle-management IT rag.

      1. KierO
        Facepalm

        Oh if only that were true. I wish I could say that money was wasted by someone else, truth is it never was.

        We would start the year with an approved budget (with an approved list of projects), but once each project was started it AGAIN had to be validated and approved, with the answer being (more often than not) "There's no money available" or "We don't think we need this right now".

        It was always good planning, but when push came to shove the managers always preferred to rely on our ability to keep things running on a shoestring whilst selfishly hoarding the companies resources so that the balance sheet looked better at the end of the year. After all an underspent budget was always considered to be good whilst a fully spent budget was always seen as bad.

        The ultimate form of short-term gain for long-term loss.

    3. Tom 13

      Never been a bank employee

      But one of my very first service calls as a wet behind the ears tech riding with the boss during my first week of work was to a bank.

      It seemed their Fed Funds computer had a drive crash late in the afternoon the previous day. So they'd missed settling with the Fed once already and if it wasn't fixed by CoB that day they'd be looking at the sorts of violations and fines that get CEO/CFO full attention.

      So the boss arranged for a vendor to drop ship an IDE drive to the bank. They called and we headed over. We were escorted to the cubby hole where the computer sat. It had an amber screen. I think it was a Compaq, complete with metal case and those annoying Torx screws. As I pulled the heavy lid off I said "Boss, I think this is a 386. Might even be older." After we peeled the compressed layer of dust off the motherboard (it came away in one contiguous sheet) we determined it was in fact a 286. With an MFM hard drive. Which meant even though we had a PCI adapter for the IDE drive, there weren't any slots for it. So we decided to try copying off the data so we could fdisk the MFM drive to see if that fixed the problem. At which point we discovered the drives were low density 5.25s. With the old style connector, so that even though we had a high density 3.5, we could use that one either. Luckily the Boss still had some 5.25 in his case and we were able to copy everything onto the disks he had. The fdisk fixed the drive problem and we were able to reinstall DOS and get the system working again.

      We told the bank manger that we couldn't guarantee how long it would stay up, so he needed to arrange to replace the system soon even if it did mean spending more on the new hardware encrypted communications card than they would for the new PC.

      If they let a system as important as that go for so long without an upgrade, I can't imagine how difficult it is to get really expensive ones upgraded.

      1. KierO

        Re: Never been a bank employee

        Yep, sounds about right.

  4. Pete 2 Silver badge

    It's the redundancy, stoopid!

    > there will be “no tolerance” for service outages

    The key to not having service outages is good system design and a proper DR plan. Some companies have one, but few have both.

    Almost every outage of significance is caused by someone changing something. It may not be the central server, itself that causes the inversion of tits. Yo could have the world's most reliable hardware running your central servers (and a lot of companies do). However if the peripheral systems: the firewalls, networking infrastructure, web servers and directory services are either flaky, too complicated to ever understand, or managed by idiots then the reliability of the core systems is irrelevant. They could still be up while your business is off the air for days when someone missed a "." off a network configuration or didn't properly test a config change.

    In those circumstances, the "plan" calls for switching over to the backup system. Great in theory, hardly ever works in practice. Either the same change has been applied to the DR systems, or they were really just ornamental - to satisfy regulatory requirement s and were never intended to be used, or the switch over process is so involved that it's never been tried - either in a controlled fashion or in anger.

    For most companies, the simplest way they could improve the reliability of their IT systems is to stop anyone from touching them. Sure, there is a need for provision of "consumables" such as adding terabytes, but apart from that any additional activities adds risk. The quandry is whether the risk of your own people screwing the system is greater than the risk of an external agent hacking their way through an unpatched security hole. While companies get vilified for security breaches, none get criticised for low-quality staff futzing about in software that is well beyond their competency.

    1. KierO
      Meh

      Re: It's the redundancy, stoopid!

      "For most companies, the simplest way they could improve the reliability of their IT systems is to stop anyone from touching them" - Please don't say that!!!!

      Yes most companies will not have a good enough DR plan, or there systems are too complex to be able to implement it easily.

      But to say that "change" is what causes most of the issues?....I think that will only compound the attitudes of management already. "It works at the moment, so don't change it" is an a mentality that almost every non-technical (and some technical) people in banks already hold dear, please don't add fuel to that fire!

      Aside from the technical side, I had staff in my operations department that would "override" important system messages, because "I was told to X years ago and that's the way it has always been done, I don't know why"...A very poor attitude to have.

    2. Anonymous Coward
      Anonymous Coward

      Re: It's the redundancy, stoopid!

      I hate to be picky but you can have the best DR plan in the world, if you are using it, you've still got an outage.

      Outages are inevitable, you will never be able to design them out of a system. I used to design financial services storage and would often find myself in meetings explaining that "no, I can't guarantee that if you spend the {very large amount of money} that I've told you it will cost that there won't be any outages. In fact the only thing I can be certain of is that there will eventually be an outage." It's how you design the system which affects the outage. There are two ways to go, a simple system which is easy to troubleshoot, but would tend to be taken down more easily, or a more complex system which will have greater uptime but tend to be more difficult to troubleshoot.

      The only option is to (as banks are required) test your DR again and again and again. Make sure your documentation is up to spec and never allow yourself to think "it's probably going to be ok." Of course this turns you into a paranoid who has a UPS for his fish tank and multiple offsite backups, but there you have it!

      1. Pete 2 Silver badge

        Re: It's the redundancy, stoopid!

        > Outages are inevitable

        Actually good systems design does allow for hardware faults. It will mandate hot-swappable kit. Have a fault in a router panel and you can switch it for a replacement without any downtime (though possibly with reduced performance, depending on your load balancing or the amount of redundant capacity you built in).

        Likewise with servers. These days virtualisation means that a (again: load balanced) server can be bounced in a very short time and it's not outside the realms of possibility to migrate live environments to different hardware if you need to physically get your hands dirty. What no design can account for is human error. Making the right change to the wrong system, restoring the backup onto the wrong instance or having holes in the testing strategy (or changing inherently untestable systems, such as payments to third parties).

        Also, a properly designed system won't have any single points of failure - either physical SPOFs or in the software / services.

        So far as DR goes. Invoking it is terribly disruptive. Even if the DR site can be brought online all you've done is delay the problem. Few organisations actually have everything on "hot" standby and if they do, then it's likely that software config faults have already been propagated through to it - making the whole concept worthless. However the biggest issue with invoking DR is when the emergency is over and you try to reconcile all the work that's taken place on the DR system during the primary outage, with the "primary" production systems. While some companies simply have an A - B toggle, with neither system being inherently a primary, testing the fallback procedures always involves risk and disruption and is therefore rarely tested in full.

        1. Anonymous Coward
          Anonymous Coward

          Re: It's the redundancy, stoopid!

          I'll state my case again: The only thing that you can be sure of with a system is that the system will fail at some point.

          It doesn't matter how much redundancy you've got, it doesn't matter if you're running live/live datacentres with seamless uptake of the load and multiple redundant network links. It doesn't matter if you've got UPSes and are using multiple power supplies from different grid sections, the system at some point will still fail.

          Now you can make sure that the system won't fail at the first hard disk ceasing to work, or if a memory module goes bad, or if a fibre fails, but some time that system will fail and it'll be something you didn't think about or couldn't mitigate against. It's all about how you deal with the outage, making sure that you don't lose data and getting it back online quickly.

          I remember working with a triple redundant high bandwidth network link which ran from Edinburgh to London, three different suppliers supplying each redundant circuit. They all failed at exactly the same time, because two of the companies didn't bother to adhere to the requirement to make sure they weren't using the same tier 1 routes as the others. As I recall a floor tile was dropped through a tile in Leeds somewhere.

          1. Tom 13

            Re: is that the system will fail at some point.

            Yes indeed. Usually multiply compounded my incompetent management. Case in point:

            A former employer had moved into a new building. As part of the move in and DR plan, a generator was in place in case of power failure. One day we came into work and found there was a leak in the restrooms on the floor above the one with the servers. At the time, nobody was particularly concerned that in the path of the leak was the switchbox for the server room. And at some point in the morning the inevitable finally happened: all the magic smoke escaped from the switch panel. At which point we think everything switched over to the UPSes. As we exited the building, they made sure to turn on the generator. After management finally determined we wouldn't be allowed to return for the rest of the day the Sr. Sys Admin made the long trek home to remote in and manually shutdown the servers. There shouldn't have been a problem as the generator was rated for 24 hours service and had a full tank. When he got home all the servers had already gone down hard. It seems that with the fire department finally on the scene, the building maintenance guys decided it was too dangerous to have juice flowing to the server floor and manually threw the switch for the generator to OFF. Upon return we discovered yet another problem. When the specialist had been in to setup the SAN array, he didn't write the configuration to the BIOS, so nobody knew what the actual configuration was supposed to be. So everything had to be restored from backup.

            I've never understood why the water wasn't a critical issue first thing in the morning.

      2. ChrisM

        Re: It's the redundancy, stoopid!

        Offsite backup for fish tank..... You mean a pond?

  5. Gordon 10
    Gimp

    Falling for the old "bait and switch"

    Mr Regulator : What caused your outage?

    Mr CIO : Legacy technology blah blah blah (whilst doodling on the latest off shoring framework agreement).

    Mr Regulator : Whats your plan

    Mr CIO : Funding new kit *cough* via more offshoring. *cough*

    Mr Regulator : Ooooo SHINEY

    1. BoldMan
      FAIL

      Re: Falling for the old "bait and switch"

      Sadly that really is what this sounds like

  6. Anonymous Coward
    Anonymous Coward

    Monolithic systems; I hate them!

    This sounds like a problem with systems which have been layered on monolithic dated old cruft with no thought as to modularisation, progress or risk; effectively Ponzi systems development. Similar crises occurred in Software Development with Waterfall development and verbose methodologies, because they took so long and were so inflexible that development feasibility discoveries and changing requirements caused cracks to form, sometime enough for a system to become too fragile to salvage, so have to be abandoned.

    I see this in an old major software system at work; too many damned incestuous interdependencies, and disregard of best practice (which is not static), because people did not critically look at the whole system, and dare to design and make the necessary incremental changes to modularise and refactor the system, so that it did not become fragile and start to crack; they will have to soon, or the system is likely to be phased out, with the loss of a lot of jobs!

    Computer systems should not be looked at as a cost, but as a lubricant for business, which if not renewed and improved, will cause you to fall behind the competition and eventually die. The idea that you can go back to older technology and survive for long is a conceit which inflexible old hands should be ashamed of, and phased out for. You can't fight evolution; it's adapt or die!

    1. Anonymous Coward
      Anonymous Coward

      Re: Monolithic systems; I hate them!

      It wasn't a monolithic system, that was the whole problem. For each newly required bit of functionality there was an added system, with a dataflow and scheduling/batch requirements. It's the opposite of monolithic, but you have to have proper enterprise scheduling to make it work and the team running your enterprise scheduling need to know how it works and not arse up the database as happened at RBS.

  7. Shagbag

    So for the past 20+ years the UK clearers have not had much of an issue.

    Along comes RBS after a 'successful' round of offshoring and BAM, the Regulator's knee jerks and all of a sudden the sky's falling in because the IT equipment could be faulty - despite it working for the last 20+ years.

    Yes, the personnel has been replaced by clowns in India, Phillipines or whatever IT backwater (where they like to watch Eastenders and catch the company bus at 5:00:00pm) but those strategic decisions are why senior management get paid the big bucks. It's called capitalism. Survival of the fittest. Who gives if some patsy in Bangalore wants to get paid 1/3 of what they could get paid. Fuck 'em, I say. Idiots.

    1. Anonymous Coward
      Anonymous Coward

      This is not an issue of "IT Backwater" it's an issue of training. RBS' CA-7 guys could have been in Newcastle or Cornwall, but if they had the same experience as the ones in India the same would have happened.

      The regulator are throwing their weight around because banks like RBS are classed as critical national infrastructure.

      The tone of the last paragraph in your mail is rather unpleasant as well, in my opinion.

    2. Gordon 10
      Thumb Down

      Nasty

      Oh nicely borderline racist sir!

      If you had actual worked with Offshore teams you would be aware that whilst there is a lot to critise about the offshored environment their work ethic isnt one of them. In fact 99% of them I have known have bust their asses off marking do with poor training and poor handovers of badly documented processes and systems.

      1. OzMainframer
        FAIL

        Re: Nasty

        It's nothing to do with where they are, it's ALL to do with the fact that they have next to zero experience in these systems and the DO NOT WANT to be working in these systems.

        Those people working in the latest "outsourcing backwater" ars you call it are all very smart people, but the fact is I would NEVER hand over support of my critical legacy apps to a bunch of IT graduates from ANYWHERE - it take a special kind of dumb f#ck MBA somewhere who's looking at costs on a spreadsheet and looking for his next bonus to make a stoooooopid decision like that. And then another one to pay the bonus !!

  8. Velv
    IT Angle

    Make It So

    What is your business? Are you a bank, supermarket, manufacturer, or IT Business?

    In reality there is no longer such a business as a Bank. You are now an IT business. Don't believe me? Look at your personal money. How much each month do you take out in cash? I bet you move at least ten times as much around every month electronically. Welcome to running an IT BUSINESS.

    Now scale it up to an Enterprise with millions of customers who expect it to be available 24/7/365. It's not about legacy systems. It's not about resilience patterns. Or DR, or software currency, or agility, business continuity, process, policy, RBAC, security, or testing the plans or offshoring or outsourcing or follow the sun or any one thing.

    It is about EVERYTHING in IT working as one. Once your board of directors grasps the concept they are no longer a bank but an IT Enterprise then you stand a chance of survival.

    Make it so.

    1. OzMainframer

      Re: Make It So

      Agree 100% (I work for an Oz bank) - we are an IT organisation with branches - but try convince management.

      Their only interest is chasing cheaper development resources wherever they are, regardless of quality.

    2. Tom 13

      Re: Make It So

      Part of the problem there is getting geeks like us to talk bean counter. We tend to focus on "the router is no longer supported" or "the core switch was outdated when we bought it" and can't/don't convert it to "if you don't upgrade this we run the risk that when this fails, which it will, it will cost the business _______ dollars per hour/day/week. In addition we can also enhance support for _______ (fill in an appropriate current business objective) by ____________." Unfortunately, it's up to us to cross that divide, because the bean counters can't.

  9. Anonymous Coward
    Alert

    FAO Jessica Twentyman clarification requested

    Re Tony Prestedge, now chief operating officer at UK bank Nationwide

    Please can you confirm you mean Nationwide Bank in the US, or Nationwide Building Society (not a bank but it does look like one these days) in the UK

    Google thinks Prestedge works for the UK Nationwide Building Society

  10. Anonymous Coward
    Anonymous Coward

    IT experience has real value

    Yet I see many Execs treat IT engineers as 'commodity resources'. Then for cost saving they fire the people who built, ran,and UNDERSTOOD the systems in favour of offshored cheap labour.

    It wasn't old IT that killed the banks, it was the lack of skilled staff to run them.

    Whilst transformational IT projects to replace hardware and software are essential, and I'm the first in the queue to push for funding for this, you can't do transformation and operational support with untrained, unknowledgeable offshore staff'

This topic is closed for new posts.

Other stories you might like