back to article Australian Tax Office's HPE SAN failed twice in slightly different ways

The Australia Taxation Office's HPE SAN failed twice, in different ways, when causing its infamous December and February outages that brought down online tax services on which citizens and accountants rely. And despite the Tax Office telling the world on its blog that it had “commissioned a new Storage Area Network (SAN) which …

  1. J. Cook Silver badge
    Boffin

    I could potentially understand if an a disk shelf's IO controller went out and failed in a way that corrupted the data being written to the disks. The likelihood of that happening is.... Well, it's a large-ish number to one against. I can dig the IO controller rolling over, I've seen that a few times; the backup took over straight away, and the machine had an alarm light on it until the blown controller was swapped for a good one.

    (In the ~20 years I've been in the industry, I've seen two raid controllers blow themselves up; one did in fact take the array with it, the other failed in a manner that rendered the battery backed cache non-usable, which only crippled system performance.)

    1. mr. deadlift

      surely it's net-raided.

      so one shelf's gone and taken another one or they don't have another one in play?

      i cant believe it's been faffed so badly, would be great to read how this actually played out.

      either way someone at HPE is getting a bonus for selling all that kit.

    2. Anonymous Coward
      Anonymous Coward

      I guess it depends how much we believe the official line. One upon a time we had a cluster and at some point (without us realising) we had a HBA failure on one of the servers. And then sometime later on there was a controller failure on the storage. Oops. Admittedly some human error/oversight on that one. Unfortunately this was made worse by the HBAs and the storage being on their last legs and out of support. By the time this happened we'd already sunk to the depths of eBay to find suitable replacement HBAs!

    3. Nate Amsden Silver badge

      about 7 years ago(before HP acquired 3PAR) I had a big outage on one of my 3PAR arrays at the time(took about a week to recover everything(actual array downtime was about 5 hrs) as the bulk of the data was in a NAS platform from a vendor that went bust earlier in the year and we had not had time to migrate off of it), in short from the incident report said

      "Root cause has been identified as a single disk drive (PD94) having a very rare read inconsistency issue. As the first node read the invalid data, it caused the node to panic and invoke the powerfail process. During the node down recovery process another node panicked as it encountered the same invalid data causing a multi-node failure scenario that lead to the InServ invoking the powerfail process.

      [..]

      After PD94 was returned, 3PAR’s drive failure analysis team re-read the data in the special area where ‘pd diag’ wrote specific data, and again verified that what was written to the media is what 3PAR expected (was written by 3PAR tool) confirming the failure analysis that the data inconsistency developed during READ operations. In addition, 3PAR extracted the ‘internal HDD’ log from this drive and had Seagate review it for anomalies. Seagate could not find any issues with this drive based on log analysis. "

      Since then the Gen4 and Gen5 platforms have added a lot of internal integrity checking (Gen5 extends that to host communications as well), the platform that had the issue above was Gen3(last of which went totally end of support in November 2016, I have one such system currently on 3rd party support).

      The outage above did not affect the company's end user transactions, just back end reporting(which was the bulk of the business, so people weren't getting updated data, but consumer requests were fine since they were isolated).

      I was on a support call with 3PAR for about 5 hours that night until the array was declared fully operational again(I gave them plenty of time for diagnostics). It was the best support experience I have ever experienced(even to today).

      I learned that day that while striping your data across every resource in an array can give great performance and scalability, it also has it's downsides when data goes bad.

      At another company back in 2004 we had an EMC Clariion CX600 suffer a double controller failure which resulted in 36 hrs of downtime for our Oracle systems. I wasn't in charge of storage back then, I don't know what the cause of the failure though the guy who was in charge of storage later told me he believes it was his fault for misconfiguring something that allowed the 2nd controller to go down after the first had failed. I don't know how that can happen as I have never configured such a system before.

      3PAR by default will distribute data across shelves so you can lose an entire disk shelf and not have any loss of data availability (unless that shelf takes out enough I/O capacity that it hurts you).

      That was by far the biggest issue I have had on 3PAR arrays as a customer for the past 11 years now, but they handled it well and have done things to address it going forward. I am still a (loyal) customer today, I have had other issues over the years, nothing remotely resembling that though.

      I realized over the past decade that storage is really complicated, and have come to understand(years ago of course) why people invest so much in it.

      Certainly don't like to know there are still issues out there, but at the same time if such issues exist in such a widely deployed and tested platform it makes me even more weary to consider a system that would have less deployment or testing(naturally would expect this on smaller scale vendors).

      At that same company we had another outage on our earlier storage system provided by BlueArc (long before HDS bought them). Fortunately that was a scheduled outage and we took all of our systems offline so they could do the offline upgrade. However where BlueArc failed is that they had a problem which blocked the upgrade(and could not roll back) and they had no escalation policy at their company. So we sat for about 6 hrs while the on site support guy could not get anybody to help him back at BlueArc. My co-worker who was responsible for it finally got tired of waiting(I think he wasn't aware on site support couldn't get help) and started raising hell at BlueArc. They fixed the issue. A couple months later the CEO sent us a letter apologizing and said they had implimented an escalation policy at that time.

  2. Anonymous Coward
    Anonymous Coward

    Web based training....

    Perhaps the HP engineer fell asleep during his web-based installation course the night before the install :)

    I've seen cases where incorrect back-end cabling can take down the entire system....

    But why speculate - when PwC with their vast storage expertise can get to the bottom of all this ?

    It'll probably come down to Sun Spots ?

    1. Rob Isrob

      Re: Web based training....

      Maybe my all time favorite "Zinc Whiskers!"

      Remember kids when eBay would go titsup on occasion and Sun tried to blame it on Zinc Whiskers when it wasn't? http://www.forbes.com/forbes/2000/1113/6613068a.html

      "When the crashes began over a year ago, Sun believed the problem was caused not by its boxes but by some flaw in customer data centers." [Zinc Whiskers]

      You'll have to dig elsewhere to see Sun peddling Zinc Whiskers. All part of ancient history now, isn't it?

  3. Anonymous Coward
    Anonymous Coward

    And also not entirely forthcoming on the SAN failed in the first place, while it waits for a review by PwC

    Why do people insist on paying money to these jumped up little accountant practices full of junior low-paid prats that dream of one day being a partner? I have never encountered anyone from one of these firms performing any consulting role apart from accountancy that has the faintest fuck of a clue what they are on about. They just right meaningless barely surface scraping reams of fluff and then recommend a system from "large vendor X" or "client's preferred choice". They are consultant arse-coverers, that is all. Merely there so that someone can show a signed-off report that says they made a reasonable choice.

    1. mr. deadlift

      you pay it to get the result you want. i thought everyone knew that.

    2. Tom Samplonius

      "Why do people insist on paying money to these jumped up little accountant practices full of junior low-paid prats that dream of one day being a partner?"

      PwC has 223,000 employees, so small they are not. PwC is who you call when the need advice that is beyond question, but their bill will be beyond belief as well. In fact, typically when a gov't agency brings in a high powered consultant to investigate some fiasco, the consultants bill will be higher than the cost of the damages. But its the only way to be sure.

      1. Anonymous Coward
        Anonymous Coward

        PwC has 223,000 employees.... which love watching Suits.

      2. Aristotles slow and dimwitted horse Silver badge

        Really? REALLY REALLY?

        "PwC is who you call when the need advice that is beyond question."

        Stop it please... you'll give me a pair of collapsed lungs from my laughing so hard at your comment. You're either (A) pitching for a job with them, or (B) have never really been at the coalface of one of their "managed" technical implementations or "consulting" exercises in regards enterprise programmes and/or combined technical delivery.

        Despite what their marketing materials may say, or how may "consultants" they have speaking at meaningless IT seminars or the like, PwC in my considerable experience are totally and utterly f**king useless at providing solutions or providing deeper level root cause analysis for anything other than "power off, reboot, try the same thing again, and again.. and again..." in the hope that problem is identified or fixed.

        The definition of madness if sometimes defined as doing the same thing over and over and over - but expecting a different result each time. This is (as I said) in my considerable experience of PwC the way they work at both the Project/Programme (management) consulting level, and also at the implementation (technical) level.

  4. TRT Silver badge

    Bloody hell...

    TWO unprecedented failures not seen by any HPE client internationally. Or is it three? Or four? I've heard the KCL one was down to a faulty IO controller wrecking data combined with a disk failure, does that count as two?

    Now, I could understand a common mode failure from which there's a learning outcome leading to a product improvement, that's almost expected, but they make it sounds like there's a number of vulnerabilities in the design.

    1. Anonymous Coward
      Anonymous Coward

      Re: Bloody hell...

      "TWO unprecedented failures not seen by any HPE client internationally" Either there are some deep hidden flaws with the system or they are configuring it in a away that no body else is. If it's reconfigured in a way no else is, it could either be a mis configuration or a config that HPE never tested.

      1. TRT Silver badge

        Re: Bloody hell...

        These guys offer an end-to-end service. Including some platinum service, I think it's called, which includes pre-emptive maintenance.

  5. Calleb III

    > Silly us: it turns out that when the ATO wrote “commissioned” and then announced it had restored services, it had actually started work on the new SAN and left the old one in place.

    Silly you, for thinking that migration of TBs of data from a mission critical old storage array that has just been patched up with "spit and duct tape" to new kit can happen in 2-3 months...

  6. acheron

    Each time you call a storage array a SAN, a fairy dies somewhere. I guess...

    Honestly, I don't get it. I never heard anybody call a fileserver a LAN. So why do people insist on this SAN-thing?? I know that FibreChannel, SAN and all that stuff scare a lot of people, but wtf?!?

    We try to educate our server-guys here, and as soon we see a change for the better, BOOOM, they read stuff like this and we have to start all over...

  7. Nate Amsden Silver badge

    quite possible software bugs

    HPE sent me this this morning about an urgent patch required on one of my 3PAR arrays(Gen5) that addresses problems involving controller restarts and downtime

    http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=c05366405

    "This array is vulnerable to unexpected restarts and data unavailability without this critical patch installed. This critical patch includes quality improvements to 3.2.2 MU2 and 3.2.2 EMU2 that prevent unexpected array or controller node restarts during upgrade, service operations and normal operation."

    Looks like the release notes were written in December, not sure if the patch is that old and only recently got escalated to urgent or if the patch is new and just completed testing.

    My other arrays(Gen4) run older software so I guess are not affected by the issue though I have been planning on upgrading them so they are running the same code across the board where possible.

  8. HA guy

    HA anyone

    Keep what's useful and change what's needed.

    I reckon a couple of DataCore nodes in front of the HP storage would have prevented this and still could fix it or stop it from happening again, coz it sounds like there's the start of a track record.

    I should mention that I work for DataCore.

    1. marky_boi
      Facepalm

      Re: HA anyone

      I wouldn't trust HPE to configure my microwave. Had supposed SAN experts contracted to vet a firmware update procedure for a very complex solution for a very large sum of money. They forgot to mention that the SANs would take a moment to 'reconverge'. took out our network for 3 minutes..... well, dealing with those clowns we now know to ask 10,000 question about the config.. Hope the ATO has good IT staff to ask the same type of 10,000 questions.......

    2. Anonymous Coward
      Anonymous Coward

      Re: HA anyone

      What exactly would datacore add that HPE doesn't already support at scale in many environments ? If your thinking replication or storage clustering they can do that natively but both need to be part of the needs assessment and so require the buy in of the customer.

      3Par has always had the ability to survive entire enclosure meltdowns it's just part of an availability policy, same for raid, multi parity protection ha s been there forever. Similarly premeptive / proactive support it'll always be offered but many customers will still opt for basic warranty or break fix.

      Although not suggesting any of this was ignored here as the real info just isn't available. But I'd hazard a guess the failure was probably multifaceted and much more complex than most are assuming.

  9. storageer

    Ever heard of Virtual Instruments

    Storage and SAN/NAS performance monitoring products are the focus on Virtual Instruments and their VirtualWisdom products. It's what major players like AT&T, Sprint, T-Mobile, PayPal, eTrade, MetLife, Nationwide, Salesforce, Expedia and many US Government agencies (including the IRS) use to proactively avoid outages by being alerted to problems before they become outages. Yes, I do work for VI, but 90% of SAN/NAS-storage related issues can easily be avoided as nearly 400 of our customers know well.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019