Re: Roland6 Probable key factor in the outage - "IBM mainframes don't fail!"
Matt, upvoted because I agree with you.
A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night. A spokesman said an unspecified system failure was to blame after folks were unable to log into online banking, use cash machines or make payments at the tills for three hours on …
Matt, upvoted because I agree with you.
"Ignoring my own experience, this story goes to show you are completely and wilfully blind!"
Look at a System z data sheet. Every critical component is triple redundant. That certainty doesn't mean, in and of itself, that the system can't go down. It just means a hardware component failure is less likely to take down a system than a hardware failure in a system which is not fault tolerant. It is not an HA solution at all... which is why IBM created parallel sysplex, the HA solution.
"So the event never happened, it was all just a fairy tale, right? You know I mentioned stupid people earlier that said stuff like "forget DR, it's on an IBM mainframe", well please take a bow, Mr Stupid."
I didn't say this event didn't happen. I said that if parallel sysplex had been implemented correctly, it would be impossible for a hardware failure in a single mainframe to take down the cluster. It is possible that RBS did not have parallel sysplex on this application or that it was not implemented correctly. No individual system failure *should* ever take down a mainframe environment is what I wrote, that is assuming you have IBM HA solution in place. If you don't have the HA solution in place, sure, mainframes can go down like any other system... less likely than an x86 or lower end Unix server due to its fault tolerance, but it certainly can happen if it is stand alone. My point was, as this was clearly ultra mission critical, why wasn't parallel sysplex implemented? It should have been done as a matter of course, every other major bank that I know of runs their ATM apps in parallel sysplex, most in geographically dispersed parallel sysplex.
"And we all know that RAC clusters never, ever, go down. It's amazing that Oracle even bothers to sell support for them isn't it?"
Yes, it is possible to have some error in the clustering software which takes down the entire cluster, be the clustering software RAC, Hadoop, or Parallel Sysplex. *But that is not where RBS said the issue occured* If they had said, "a parallel sysplex issue" and not a "hardware failure" then it is possible that they had the right architecture but the software bugged out or was improperly implemented. My point is: This application clearly should have been running in parallel sysplex as an ultra mission critical app. That is the architecture for nearly all mainframe apps. Therefore, saying a "hardware failure" caused their entire ATM network and all other transactional systems to go down makes no sense. They were either not running this in sysplex, in which case... why not, or they did not report the issue correctly and it was much more than a "hardware failure."
"Look at a System z data sheet......" AC, the data sheet is just part of the IBM sales smoke and mirrors routine - "it can't fail, it's an IBM mainframe and the data sheet says it is triple redundant". You're just proving the point about people that cannot move forward because they're still unable to deal with the simple fact IBM mainframes can and do break. The data sheet is just a piece of paper, the RBS event is reality, you need to understand the difference. Fail!
"The data sheet is just a piece of paper, the RBS event is reality, you need to understand the difference. "
You need to understand the difference between hardware level fault tolerance and high availability. Two different concepts.
IBM mainframes run most of the world's truly mission critical systems, e.g. banks, airlines, governments, etc. To my knowledge, these all run in parallel sysplex without exception. If anyone thought that a mainframe didn't go down just because the hardware was built so well/redundant, there would be no point in all of these organizations implementing parallel sysplex. Even if you have a 132 way redundant system, it will still likely need to be taken down for OS upgrades or another software layer upgrade that requires an IPL. Not having a hardware issue because of hardware layer redundancy is only one small part of HA.
".....You need to understand the difference between hardware level fault tolerance and high availability. Two different concepts....." No, what YOU need to understand is both mean SFA to the business, what matters to them is that they keep serving customers and making money. The board don't give two hoots how I keep the services running, be it by highly available systems or winged monkeys, they really don't give a toss as long as the money keeps rolling in. RBS had a service outage, reputedly because of a mainframe hardware issue, and it cost them directly in lost service to customers and indirectly in lost reputation, simple as that. You can quote IBM sales schpiel until you're blue in the face, it doesn't mean jack compared to the headlines. Get out of the mainframe bubble and try looking at how the business works.
NW online banking is down again this morning
got in then it told me I'd pressed the back page key and kicked me out
let's see what today's excuse will be . . . let's get going with that break up HMG eh?
So they didn't have:
Some kind of HDR (high availability server) to seamlessly swap over to?
A SDS secondary shared disk server to failover to?
An Enterprise replication machine somewhere?
A RSS Remote standby machine, in another location or even the cloud?
Probably Only Informix does this well (and would give me some work), Oracle tries but it's problematic.
I wonder if it were database / application or what. Doesn't sound like a network Issue. SANs are bulletproof as well. I would assume there is a more sinister reason given the state of banking.
Mainframe - no wonder it fails, like almost nobody is alive who knows how to maintain the OS. I've tried and it's super command line unfriendly.
Banks are sadly expertless, I know loads with terrible DBA's runing mickey mouse systems with no failover. The more you know the more you wonder how it works AT ALL.
The operators were probably unwilling to make any failover call following the almighty bollocking they will have received after last years fubar.
They will (quite rightly) have kicked the decision up the chain to those earning the salary for having the responsibility.
Fair amount of tosh being written - such as ALL failovers are not immediate. I ran some of the world's largest realtime systems (banking, airlines) for 15yrs and it's imperative an immediate seamless failover is there the second you need it. Realtime loads were switched from one mainframe complex to another on a different continent in less than five seconds - with zero downtime.
See "TPF" on Wikipedia. (aka Transaction Processing Facility)
I am a broadcast engineer, not an IT guy and I've seen some IT guys spectacularly fail under pressure. Accidentally cut of services to an entire country? Fine, don't run around like a headless chicken, get it working and then you can stress, not the other way around. Also, sometimes the answer isn't to fix the problem but to just get the system working, you are providing a critical service to the public, you can fix the problem later. Sometimes getting it working does involve fixing the problem, sometimes you just need to patch around it and schedule the fix. It isn't amateur bodging, it is maintaining a critical service at all costs.
I previously worked for a major broadcaster's technology division, the broadcaster wanting to reduce its headcount, talk of "leveraging" etc. and we were sold to a major IT outsourcing company. Now, although the sale was supposed to buy the IT and phones they saw "broadcast communications" and someone wet themselves with excitement. Massive connectivity infrastructure, lots of racks of equipment, 24x7 operation with flashy consoles and most importantly of all high margin contracts, an IT directors wet dream (it was cool). So they asked the broadcaster if they could also take that department in the same purchase, "Are you sure?... Okay." What the IT outsourcing people didn't realise was that with valuable contracts came great responsibility. We never had *any* measurable outages, changeovers happened in a flash. Hardware resilience: n+1? no thanks we'll have 2n or at least 3a+2b. Resilient power? Grid, Gas turbine, Diesel & UPS, plus manual bypass changeover switches!
The thing was, some of this isn't unfamiliar to IT people who do real DR, but what created the biggest fuss? They refused to acknowledge that the IT response time for some users (the 24x7x365 ops team) had to be less than 4hours. Surely you can wait 4 hours to get your email back? Surely you can do without your login for a few hours? Your Exchange account has zero size and can't send mail? Can you send us a mail to report the fault?
If the people supporting you don't understand you then how can you be effective.
"Also, sometimes the answer isn't to fix the problem but to just get the system working, you are providing a critical service to the public, you can fix the problem later"
Having been at the sharp edge (in a certain mega large organisation) I can tell you that 99% of teccies would like to take this course of action but 99% of the time they are stopped by [glory hunting] managers.
We have a name for them - "Visibility Managers".
They didn't want anything happening until very senior managers had seen them involved so they could take all the credit. Once the very senior managers had disappeared (fault worked around or fixed etc.) the "Visibility Manager" would very quickly become the "Invisibility Manager" and f**ked off.
I worked at a place that ran their databases from a tier 2 storage array. This had redundant everything, dual controllers, power supplies, paths to disk, paths to the SAN etc.
We had disk failures that the system notified us and we hot replaced with the array re-laying out the data dynamically. We had a controller failure that we were notified about and the engineer came to replace, again without an outage.
We then had two separate incidents that caused complete outages. The first was a disk that failed in a way that for some reason took out both controllers. It shouldn't happen but did. The second was down to a firmware issue in the controllers that under a particular combination of actions on the array caused a controller failure. With both controllers running the same firmware the failure cascaded from one to the other and took out the array.
So, whilst its trendy to be cynical, these complex redundant systems aren't infallible and when they do fail it can take a while to work out what has happened and what needs to be done to get things operational again.
Definitely, I think people are confusing fault tolerance with high availability. There is overlap, but they are different concepts.
Fault tolerance just means a bunch of extra hardware is in place so if a NIC, or whatever, fails, another will pick up for it. It says nothing about down time other than you have added protection in the single category of hardware failures. If you need to upgrade the OS, even in the most fault tolerant system known to man, it will likely require an outage. That is why you need an HA solution in place, if no downtime is a requirement. A high availability solution will be running a parallel system with real time data integrity so that it can immediate pick up I/O if another system in the HA environment goes down, either scheduled or unscheduled. For instance, Tandem NonStop was supremely fault tolerant, but not necessarily highly available. Google's home brew 1U x86 servers have zero fault tolerance, but their Hadoop cluster makes it a highly available environment. IBM mainframe has both. It is fault tolerant hardware, but you can also add parallel sysplex which provides high availability.
"I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out. What kind of Mickey mouse setup do they have that a hardware failure can take down their core services for hours?"
From reading the comments, I'm concerned about the total lack of any real knowledge of real world enterprise computing demonstrated by many and hence the above comment would seem to be the sub-text to many comments.
Setting up and running an IBM Parallel Sysplex, with only 6 zSeries in it distributed across 3 sites was complex, let along 14+. Plus I suspect that not all systems were running at capacity, mainly due to the hardware and software licensing costs (believe it or not some software you pay, not for the cpu it actually runs on, but on the TOTAL active cpu in the Sysplex), hence it would have taken time to call the engineers out, bring additional capacity on-line, move load within the Sysplex and confirm all is well before re-opening the system to customers; that is assuming the fault really was on a mainframe and not on a supporting system. Also it should not be assumed that the mainframe that failed was only running the customer accounts application, hence other (potentially more critical applications could also have failed). From companies I've worked with 2~3 hours to restore the mainframe environment to 'normal' operation, out-of-hours, would be within SLA.
Yes with smaller systems with significantly lower loads, operating costs and support system's requirements different styles of operation are possible to achieve high-availability and low failover times.
While it is costly and complex, you can definitely have a real time fail over with parallel sysplex even with extreme I/O volumes. PS was built for that purpose. Do you mean 2-3 hours to restore sysplex equilibrium while the apps stay online or 2-3 hours to take the system completely down anytime after hours?
Biting the hand that feeds IT © 1998–2018