back to article Visa fingers 'very rare' data centre switch glitch for payment meltdown

Visa has said a “very rare” partial network switch failure in one of its two data centres led to the fiasco earlier this month that caused millions of transactions in Europe to be declined. The outage, which lasted for about ten hours on Friday, June 1, sent panic among European pub-goers, as apparently about 10 per cent of 51 …

  1. Anonymous Coward
    Anonymous Coward

    Failure rate

    The numbers they quote are nonsense, since a number of large chain retailers gave up and simply put signs up saying "cash only". The true failure rate would be much higher because it would need to include the "unattempted but originally intended" transactions. Like my Friday night shopping.

    1. Annihilator

      Re: Failure rate

      By the same token they won’t include the successful transactions that would have been in the “unattempted but originally intended”, so it is probably safe to assume similar failure rates.

  2. Anonymous Coward
    Anonymous Coward

    Unreserved Apology

    Does anyone else remember the good old days when the only place one found an unreserved apology was in a resignation letter?

    1. Anonymous Coward
      Anonymous Coward

      Re: Unreserved Apology

      "Does anyone else remember the good old days when the only place one found an unreserved apology was in a resignation letter?"

      Not me, no. Can you put a specific date on those "good old days" and show actual real-life examples, rather than vague, nostalgic misremembrances?

  3. Anonymous Coward
    Anonymous Coward

    How's that Cashless-society looking now Big-Tech?

    ~~~

    Sales Pitch:

    ~~~

    https://www.rte.ie/news/business/2018/0612/970015-moneyconf-dublin/

    ~~~

    Versus Reality:

    ~~~

    https://www.bbc.co.uk/news/business-43645676

    https://www.bloomberg.com/news/articles/2018-02-18/-no-cash-signs-everywhere-has-sweden-worried-it-s-gone-too-far

    https://www.bloomberg.com/view/articles/2018-06-11/maybe-dollars-should-be-digital

    https://www.irishtimes.com/opinion/letters/why-cashless-society-is-a-dangerous-idea-1.2869212

  4. Doctor Syntax Silver badge

    If it was the backup switch then presumably the primary has already failed unless the backup was firing out packets that interfered with the rest of the network.

    1. This post has been deleted by a moderator

      1. Doctor Syntax Silver badge

        "In this instance, a component within a switch in our primary data centre suffered a very rare partial failure which prevented the backup switch from activating."

        So the primary switch failed and took out the backup rather than t'other way about.

        1. tfewster Silver badge
          Facepalm

          So the primary switch "hung" - Yes, it can happen. The failover switch didn't have a fence mechanism to power off the primary completely, so failover could work as designed. And noone had the balls to pull the plug for hours - Presumably they didn't have any faith in the failover mechanism either.

          It actually sounds worse when they explain it ;-)

          1. Voland's right hand Silver badge

            The failover switch didn't have a fence mechanism to power off the primary completely,

            What f*** fence mechanism? Routing protocols and L2 failover do not have any of that.

            Partial failure, especially on high bandwidth optical interfaces is NOT rare. It has been years since I have dealt with large SP operations, but I have seen tens of those. There is really f*** all you can do in such case except having a well trained ops team which can deduce what is going on from the stats (as you may not see this a normal fault) and go in and KILL the erroneous interface (and later the whole card proceeding to switch or router if need be). It also needs to have the authority to do so. Which is what I suspect is the issue here. The ops team did not have the authority to go in with kill orders and by the time it was authorized it was too late - there was a gigantic backlog.

            1. demonwarcat

              It has been a few years since I retired so I don't know if carrier grade ethernet has actually appeared yet. I worked in transmission and one of the things that was very noticeable as ethernet bearers replaced TDM was the limited nature of ethernet fail over. We actually had situations were optical protection switchs were specified rather than end point protection because of their guaranteed 50ms switchover. This despite it introducing a single point of failure (the protection switch) and that ops only switch on loss of signal, not signal degrade which is standard in TDM. I suspect that the intermittent nature of ethernet compared to the continues nature of TDM will prevent the implementation of the type of protection schemes used in TDM environments. Unfortunately ethernet is cheaper and even when I retired we were providing customers with 100Gbs ethernet circuits. So far as I am aware the highest specified TDM client rate is 40Gbs and this is rarely implemented.

            2. Aitor 1

              Well trained staff

              These days it is rare to have well trained staff that can take the decision to take out a router. Expensive, and risky for them in case they make a mistake.

              And of course, partial failures are quite common.

            3. SImon Hobson Silver badge

              Partial failure... is NOT rare

              ... having a well trained ops team ... It also needs to have the authority to do so.

              So much truth, and only one upvote allowed.

              I've only been in a very low level of networking - even I've seen more than one instance of such partial failures, switch has failed to switch packets properly but still looks to be OK. There's a limit to how much you can automate for such situations, but as you say - a well trained ops team with the right monitoring and the authority could have dealt with this in a timescale that would have made it into a "Visa had a blip yesterday, nothing to see here" in the next days back pages instead of the major incident it was.

              As I wrote in a comment to one of the earlier reports on the problems - the problem users saw was not due to a hardware failure, it was due to an organisational failure to properly plan for forseeable problems and put the right measures in place.

        2. Just a geek

          I've seen something like this myself. I had a juniper switch crash in such a way that it stopped forwarding traffic but kept the heartbeat up which stopped the secondary switch taking on the load. I feel for VISA here, it's just one of those things - no tech is infallible.

          1. Steve 53

            Re: New???

            I would argue that "Good" design would mean you don't have HA pairs of switches and consider that a redundant solution. This stuff can and does break, hence you're much better with DCs which aren't attached at L2 (Which I presume is the case here). Better to use L3 or DNS - but of course this is an old design, and there may well have been good reasons to follow this model at the time.

        3. tfb Silver badge

          Presumably the primary failed but said that it was working, thus preventing the secondary from picking things up until someone realised the primary was confused about its state and killed it. This is a failure mode l've seen with SAN switches (so different case) and it can be hard to debug, especially given the terror associated with killing the thing and possibly finding out you killed the working one.

    2. diodesign (Written by Reg staff) Silver badge

      "If it was the backup switch then presumably the primary has already failed unless the backup was firing out packets that interfered with the rest of the network."

      It was a backup switch within the primary center that failed to activate due to a component fault in another switch.

      C.

      1. Brewster's Angle Grinder Silver badge

        "It was a backup switch within the primary center that failed to activate due to a component fault in another switch."

        So I guess it was the backup switch's psychic circuit that "failed". It should have ignored the status signals telling it the main switch was okay and deduced it was having a bit of a turn.

      2. earlyjester

        I remember from days gone by of a certain companies switch that had an issue with memory and that if it failed it was left in a state where it could answer to a poll to say it was functioning but it wasn't. This stopped the backup from taking over and all traffic being serviced.

        The actual details fail me. It would be good to know the manufacturer and age of the switch.

    3. Nick Stallman

      Partial failures like that typically mean the connection can no longer reliably carry traffic, but it still thinks the link is online so it never enacts the fail over procedure.

      So no prior failure is required, just the monitoring being told that something is up when it's actually down.

      These extremely rare failures actually happen all the time. Earlier this year servers I manage were also knocked offline by a partial failure which prevented automatic fail over.

      1. John Riddoch

        Yup, partial failures suck. I've seen a fibre path fail just enough to bugger up service but not quite enough for the OS to figure it needed to fail over to the 2nd path. Once we'd figured that out, it was just a matter of disabling the primary path and everything started working normally.

  5. Anonymous Coward
    Anonymous Coward

    VISA Crimes

    "Visa is migrating its European processing onto its global system, VisaNet - a process that is due to complete by the end of 2018."

    Ouch, will be traveling then... Better have a viable backup plan. Anyone notice Visa credit card charges going up in the past 2-3 years? Foreign exchange spreads etc... Or the difference between what we have to pay and what the mid-market commercial FX rate is (as found on sites like XE / Bloomberg etc). The spread seems to have doubled / trebled. Anyone know why that is? Withdrawing cash overseas got a lot more expensive!

    1. TheVogon Silver badge

      Re: VISA Crimes

      "Anyone notice Visa credit card charges going up in the past 2-3 years? Foreign exchange spreads etc..."

      If you mean interest rates as well as FX rates then those are controlled by the issuing institution, not VISA. And yes with historically low interest rates, those have presumably risen to compensate.

    2. katrinab Silver badge

      Re: VISA Crimes

      The actual interchange fee that Visa charge your retailer's bank is the same as Mastercard, and has gone down in recent years. What your bank charges you, that has nothing to do with Visa.

      The foreign exchange spread is about 0.1% higher than Mastercard, and Mastercard charge very close to interbank rate. Your bank might charge another 3% or so on top that, but that is down to them, not Visa. If you are paying in a foreign currency, use a card that doesn't add a margin on top of the network rate, and yes, Mastercard is better than Visa in this particular case.

    3. Anonymous Coward
      Anonymous Coward

      How do you get a menu or breakdown of all the actual charges?

      The thing is, my bank blames Visa. Visa blames intermediaries and so on. Its a vicious cycle. All I know is, I've been keeping a list of charges for about a decade. You have to, if you travel a lot, because of currency fluctuations its impossible to figure all of this out when you're back home.

      However, in the past 2-3 years especially I see a definite increase, an added 2-3% extra hit. But how do I find out whose got their hand in the cookie jar? Its like SWIFT banking... Do a few of those with reversals and try to actually figure out, who got what. There's no documentation!

      1. katrinab Silver badge

        Re: How do you get a menu or breakdown of all the actual charges?

        Here is an example:

        €100 converted into £ on 15th June 2018 using the following exchange rates:

        Bank of England - £87.39

        Mastercard - £87.55

        Visa - £88.22

        Your bank will take either the Mastercard or Visa rate depending on what your card is, and they might add additional charges on top of that. 3% margin + a non-sterling transaction fee + if relevant a cash withdrawal fee is quite common

        Visa are taking 83p (Mastercard take 16p). Your bank may well take another £5 or so, but that money does not go to Visa.

      2. Uberior

        Re: How do you get a menu or breakdown of all the actual charges?

        Surely if you do travel a lot you'd have an fx free Mastercard or an Amex International Currency Card?

        I certainly wouldn't be using a Visa, unless either Mastercard or Amex had limited acceptance.

    4. Stuart Moore

      Re: VISA Crimes

      I recently got a metrobank debit card for a trip abroad, and it made life a lot easier. No fee for transactions abroad, and they're a mastercard debit. I like having one each of visa and mastercard, with different banks, just in case this happens.

    5. monty75

      Re: VISA Crimes

      Try Revolut - mid market exchange rates and no fees for most casual users.

  6. John Robson Silver badge

    Global next

    "The firm has launched a number of reviews and is also in the process of migrating its European systems to a more resilient global processing system, VisaNet."

    Great - now we can stop processing transactions all over the world instead of just over here...

  7. Anonymous Coward
    Anonymous Coward

    VisaNet

    The machines rose from the ashes of the economic fire. Their war to exterminate cash has raged for decades, but the final battle would not be fought in the future. It would be fought here, in our present. Tonight.

    1. StuntMisanthrope Bronze badge

      Re: VisaNet

      Sounds exciting, bah dum bum dum! However, got this this strange feeling that they've blown it and cash is king after all. #medicischool

    2. Sgt_Oddball Silver badge
      Terminator

      Re: VisaNet

      It can't be reasoned with, it can't be bargained with...it doesn't feel pity of remorse or fear...and it absolutely will not stop.Ever....

      Unless it has a wonky switch as it turns out....

      (Still waiting for my Phased plasma rifle in the 40 Watt range.... wonder if they'll take cheques?)

  8. David Neil
    Mushroom

    Oh they are asking EY to have a look at the root cause

    The same place that is dumping all it's own IT off to TATA as fast as it humanly can

  9. cantankerous swineherd

    istr a Charlotte Hogg leaving the bank of England under a cloud? nice to see she's managed to get another gig if so.

  10. Anonymous Coward
    Anonymous Coward

    So if I've read correctly the fix was to turn it off and on again?

    1. phuzz Silver badge

      Yes, but the important part was knowing exactly which component to switch off.

      1. Ken 16 Silver badge

        and having the balls to approve doing it - that probably took hours of buck passing

  11. Keith Oborn

    Another case where regulators should require a detailed public report

    As per TSB and the BA failures. Where is the regulatory requirement that a detailed analysis and report be made available to all relevant bodies (all equipment and component suppliers, all their customers, all end users of the service and relevant regulators).

    Contrast with a major aviation accident. The entire industry gets told the full details, is required to make recommended changes, and the details are available for scrutiny by any interested party.

    Until the finance and it/networking industries are held to these standards, we will continue to suffer this sort of failure.

    One positive mark to Visa though, for at least offering a superficial but reasonable explanation with little delay.

    1. Herring` Silver badge

      Re: Another case where regulators should require a detailed public report

      Wouldn't that be a cool job? A sort of Quincy M.E. but for systems. Diagnosing what's wrong with other people's processes and practices would be much more fun than being trapped in your own.

  12. David Roberts Silver badge
    WTF?

    Still not understanding

    Why it took so long to disable the failing switch once it was identified.

    Assuming that if the switch had completely crashed the backup would have taken over, then why not just turn the damn thing off?

    Unless assumptions were made about the maximum size of the backlog/queues which could build up during failover, and the system just wasn't sized to recover from a massive backlog due to an undetected partial failure.

    This does sound quite likely, as the report talks about clearing out queues before switching to the backup switch. Perhaps the system couldn't recover if transactions were more than a certain age? Although you would expect that old transactions could be assumed to have failed (as was the case here) and been automatically recorded then purged.

    1. Joe Harrison

      Re: Still not understanding

      Why not just turn the damn thing off? The guy who knew how it worked and would have turned it off and on again has been made redundant unfortunately. His function has been right-shored to another time zone and the change control procedure for such a drastic action takes many hours to escalate through 25 levels of management in four countries..

      1. Korev Silver badge

        Re: Still not understanding

        I've no idea why you were downvoted (apart from "rightshore"), that sounds depressingly plausible.

    2. SImon Hobson Silver badge

      Re: Still not understanding

      Why it took so long to disable the failing switch once it was identified

      As already said, the guys that wold have been able to diagnose this AND do something about it have all gone. The people running it now will probably be junior techs on a different continent with a) manglement imposed limits on authority and b) culture imposed limits.

      The latter is important. For many of us in northern Europe it's seen as a good trait to be able to sit down, look at the evidence, and formulate a theory as to what is wrong - and formulate a plan for how to fix it. So as already said further up the comments, a good ops team would probably have had it fixed before many people realised there was a problem.

      But AIUI, in many of the places such functions are offshored to, there is a different culture - where individualism is frowned upon, and the techs are supposed to "just follow the flowcharts". In such a culture, to get the offending switch powered off would require the problem passing up many manglement levels, endless meetings, and above all - discussion of who takes the blame.

      A secondary factor is the modern disease of not supporting people to make decisions. So even if a techie did realise that "all it needs is to power cycle this switch" - it's a very secure person who can take on that decision and expect his manglement chain to support him in doing so. More normally, the "safe" option is to do nothing - it's not your fault the system failed. But go and do something that should fix it, but for some reason doesn't - well your head is on the block for doing it.

      Go and read some of the "the day I ..." stories in ElReg - and in particular the comments. Some of the best ones involve the person "doing something" but being supported by their managers on the basis that "the only person who never made a mistake was the one who never did anything".

  13. 36bells

    Cisco Cisco Cisco

    This is the same rogue packet that has been travelling the world taking down Blackberry, heathrow, and now visa. Only appears on Cisco switches

    1. nowster

      Re: Cisco Cisco Cisco

      "GNU Terry Pratchett"

  14. RobertsonCR7

    One in a million

    This sound like a good scenario for a movie

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019