back to article Twice-crashed HPE SANs at Oz Tax Office built for speed, not strength, and turned off error reporting

Oz taxation commissioner Chris Jordan has revealed that the Australian Taxation Office (ATO) has reached a commercial settlement with HPE over the two outages to its online services caused by 3PAR storage arrays. In remarks made to a Senate Committee today, Jordan said: "The turnkey service of data storage as per the 3PAR SAN …

  1. mr. deadlift

    So, whoever set it up, or tampered with it really fcsked it up.

    Correct me here but in the past someone from Texas/3Par/HPE usually called re: the monitoring and such. Asking to get it optimized or that an alert was triggered or couldn't be seen, and so on. Or at least i recall they used to do this.

    Or has the service taken such a slide?

    1. Anonymous Coward
      Anonymous Coward

      No its still there, they call, ask about the arrays that aren't reporting. If you tell them its in a secure location and cant report they will mark it as such and not ask about it.

      If it is reporting in and you have a problem they will notify you, asking to fix it. If you have remote management enabled they will do it remotely or if you dont they will do it via your desktops.

      If its a hardware fault they will send an 'engineer', this is where is doesnt go too well in my experience as where i am it is outsourced and the company handling it doesnt know much about 3pars.

    2. CJames

      Built for speed but performance not monitored

      Never underestimate the importance of monitoring data centre performance. In a millisecond world why is it that monitoring every few minutes, or not monitoring at all, is acceptable. The world has changed, application performance rules any on-line activity, so application and infrastructure owners need to continuously monitor performance in real-time, not just availability, or agree a performance based SLA with their supplier. Modern Infrastructure Performance Monitoring (IPM) platforms work at an application centric level and can proactively prevent slow-downs or outages before they happen by alerting before end users are impacted.

    3. cswilson1976

      they still do that, I have several 3par arrays I am responsible for at my employer

  2. Anonymous Coward
    Anonymous Coward

    TWELVE DRIVES failed?

    OK I accept that the 'call home' functionality was apparently broken so the drive failures would pile up, but who was responsible for administrating this array? Doesn't it show alerts in the GUI if even one drive has failed, let alone twelve???

    I wonder if the GUI has some little red triangle to indicate a fault condition, but the fault condition for "drive failure" looks the same as something stupid like "AC voltage dip < 220 volts" or whatever so there's ALWAYS a fault/warning indicator which is why the drive faults were (apparently) ignored?

    1. Anonymous Coward
      Anonymous Coward

      Re: TWELVE DRIVES failed?

      Based on the comments about "drive software", my guess is this problem was caused by drive a firmware issue. Likely a large number of early life drive failures which may not have been actual drive failures at all, but were determined to be such by the drive firmware. If this is the case, the drives may have failed fast enough that multiple drive failures occurred in a single RAID set before spare drives could be allocated and RAID rebuilds complete.

  3. Anonymous Coward
    Anonymous Coward

    Shelf failure?

    12 drives + FC cable issues....could that be an entire shelf failure?

    1. Nate Amsden

      Re: Shelf failure?

      At one point there was a report of someone trying to move shelves around while the array was online(stretching cables etc). So maybe no actual shelf failure but perhaps an induced failure from doing stupid things.

      Good that HP accepted responsibility for it. Also obvious lesson don't store your backups on the primary storage array, obviously budget issues caused that. The original report made it sound like multiple full array failures(not impossible just unlikely).

      I got the real story as to what happened(as well as what happened in the UK) but will stick to what is public. Short story short i have complete confidence in my 3par arrays.

      (3par customer since 2006)

      1. Griffo

        Re: Shelf failure?

        I think most 3Par shelves take 24 Drives. So they lost half a shelf.

        Perhaps someone slid out a shelf for some reason and poor cable management meant that the FC cables fell out of the rear on one side.

        Why that resulted in a situation that wasn't easily recoverable is the real question.

        1. thondwe

          Re: Shelf failure?

          Can take up to 24 drives (in the new HPE models) - so could well just be a part populated shelf going down. If the fibre cables are messed then taking a controller (either 2 or 4 controllers in these) off line could easily lose contact with the shelf...

          Sounds rather like a "cheap" 3PAR config not setup properly - not enough shelves/spindles etc. No monitoring seriously!? Configured for performance - so RAID 1 not RAID 5/6.

          Killing both SANs would I suspect be easy if they replicate...

        2. Phil Kingston

          Re: Shelf failure?

          The accidental unplugging due to manual shelf movement has been a rumour.

          Coupled with the comments about FC cable management then I reckon we do have a winner - sounds very much like (aside from the other failures and poor data management) a shelf was moved. And I think I remember reading that remote hands were attempting to move kit from one rack to another. Oh dear.

      2. Scuby

        Re: Shelf failure?

        I've been running 3PAR arrays for 10 years, through multiple generations of the Product.

        Something is definitely amiss, and like Nate, I also have complete confidence in my 3PAR arrays. (All of which are still in Production, E200s/F400s/T400/7200s and 8440s.

  4. Your alien overlord - fear me

    This wasn't setup/administered by Plutus contractors by any chance?

  5. Tim99 Silver badge
    Alert

    Settlement?

    “We have reached a commercial settlement with HPE, the detailed terms of which are subject to contractual confidentiality. The settlement recoups key costs incurred by the ATO, and provides additional and higher grade IT equipment giving the ATO a world-class storage network.”

    So does that mean that HPE just discounted their service visit bill, replaced the disks and the connectors, and gave them another shelf so that proper fault tolerance could be used? Or did they replace the whole steaming pile with kit that was fit for purpose, at no cost, and pay for downtime etc?

    1. Adam 1

      Re: Settlement?

      Er, you mustn't be familiar with the recent history of Oz government IT systems.

      Exclusive to el reg: We have the transcripts from the confidential negotiations....

      Gov: You stuffed up big time. We will sue for $500 million in losses.

      HPE: counter offer. We will pay you $1 and you can say it was our fault.

      Gov: Even better. I'm hungry, who wants lunch?

  6. Anonymous Coward
    Anonymous Coward

    Problems?

    1. My fiber cables are probably not "optimal" either, but they are secure, visible and untangled.

    2. What? I can read/write all my disks fine. They may not be on the latest firmware, but that does not mean they're unreadable. What does this actually mean?

    3. I've never enabled the back-to-base reporting. It wouldn't make it through our firewall anyway. Everything - and I mean Informational and above - is reported to the people responsible for the SAN 24 hours a day.

    Either the storage was treated like a cesspit and run by morons, or this is a stitch-up to shift the blame onto the maintainers. I too could have a gold-standard install with perfect compliance and reporting, as long as I was given unlimited space and budget for the physical system, unlimited and unfettered downtime to keep it all at latest firmware, and cart-blanch on security to allow for remote management.

  7. Anonymous Coward
    Anonymous Coward

    More detailed at least...

    Than KCL's 3PAR failure analysis.

  8. James Pond
    Happy

    Common guy interpretation?

    Let me see if I can interpret this for the common person.

    1- The fibre optic cables feeding the SAN were not optimally fitted -

    How is this possible? There should be a "click" sound when the LC connector is fitted in. It's always been "insert" or "not inserted". I don't recall any instance where a FC cable can be halfway inserted. Maybe the cables were "bent".

    2- Disk drives on the SAN had software bugs that made the stored data on the drives inaccessible or unable to be read -

    If I can't read data on a disk, blame it on a firmware bug instead of a software bug.....

    3- Some monitoring features were not activated, including a "back-to-base" tool to report operating errors -

    When setting up - "ATO is a very very very secure site, I DON'T WANT CALLBACK! Disable callback! My peons will monitor this!"

    Post mortem - "How did 12 disks fail without anyone knowing??? Didn't you promise callback??? Why is it not working????"

    4- SAN design has overemphasis on performance features rather than stability or resilience -

    When setting up - "I am on a tight very budget. Can you give me X TB usable by configuring RAID 5?"

    Post mortem - "WHY ARE THE DISKS ON RAID 5??? These data are critical! I WANT THEM ON RAID 6 OR RAID 1!!!"

    1. mr_souter_Working

      Re: Common guy interpretation?

      "1- The fibre optic cables feeding the SAN were not optimally fitted -

      How is this possible? There should be a "click" sound when the LC connector is fitted in. It's always been "insert" or "not inserted". I don't recall any instance where a FC cable can be halfway inserted. Maybe the cables were "bent"."

      at a place I worked previously, we had an issue with some servers with FC attached storage arrays that took forever to start back up - eventually I went to each of the DC's and discovered that years earlier (when they were installed) someone had attached the first FC card in the server to the input on the storage array primary controller and then the output of the same controller on that storage array to the output of the second card in the same server - it took me a while of head scratching before i finally figured out where the cables were supposed to go (they were all properly seated, and to a casual glance they appeared fine - and the servers started and worked, just VERY slow to boot), but that caused repeated array issues for years that nobody had ever really bothered with - we found that all of the servers were connected the same way to their external storage (6 servers in all). luckily it was only the Exchange system, and it was a fully redundant system (active/passive nodes in primary DC with offsite passive in DR location).

    2. Anonymous Coward
      Anonymous Coward

      WHY ARE THE DISKS ON RAID 5???

      RAID-5 is not a typical high-performance configuration (especially if the system is write-intensive). A 12 disk failure may easily take out both disks of a RAID-1 array (or a RAID-10 one, if you want performance).

      I've seen SAN software and disk firmware updates marked CRITICAL because there were bugs that could lead to disks losing data or becoming inoperative. Just you usually need to dig the support site to find them.

      1. Anonymous Coward
        Anonymous Coward

        Re: WHY ARE THE DISKS ON RAID 5???

        My guess is the performance vs. availability trade-off was RAID-5 vs. RAID-6.

        RAID-5 has been good enough for performance workloads for at least 15 years now. Dedicated hardware based XOR engines (ASICS) and large write caches to minimize read-modify-write cycles pretty much solved the earlier performance challenges with RAID-5. For those without hardware based XOR engines, general CPU performance has increased to the point storage controllers can easily handle the parity calculations.

        However, there are some legacy storage systems out there which perform well with RAID-5, but poorly with RAID-6. It is usually systems with hardware based XOR engines which are optimized for RAID-5. On these systems, RAID-6 either requires two cycles of the XOR engine, or it is done on the main controller processor instead. On these systems, the controller processors are often not sized for parity calculations. 3Par is one of those which calculates RAID-5 parity in an ASIC, not on the main controller processor. I do not know how 3Par calculates RAID-6 double parities.

        There are a lot of other systems which do not use hardware XOR engines (NetApp FAS, Nimble, Tintri, XtremIO, Pure, Tegile, and others) which can do RAID-6 with little performance difference with RAID-5.

        1. Scuby

          Re: WHY ARE THE DISKS ON RAID 5???

          HPE now state that all critical volumes on SAS should be either R1 or R6 due to larger Disk Capacities (it has always been the case for NL 7k Drives.)

          All RAID calculations, regardless of level are performed using the ASICS.

          RAID-6 calculation uses the XOR engine in the ASIC but must calculate two distinct parities (R5 only needs one parity) and need to compute parities over more data since RAID sets are larger.

          Most writes require only two parities. However, a fraction (1/3 for step size 8), of the data blocks are used to compute 3 parities, so updating those blocks requires reading/updating 3 parities, hence the odd number 6 2/3 back end IOs for RAID-6.

        2. Anonymous Coward
          Anonymous Coward

          Re: WHY ARE THE DISKS ON RAID 5???

          With the larger sizes of current hard drives, recovering from a failed drive is a pretty risky prospect now. The chances of hitting a URE during the rebuild greatly increases as drive capacities get larger. You need to be able to read every sector of every remaining drive in the array for it to rebuild successfully. If a single sector experiences a URE you are dead in the water.

          I can't think of a single enterprise use case where RAID5 or any other single parity solution should be considered acceptable.

    3. Down not across

      Re: Common guy interpretation?

      1- The fibre optic cables feeding the SAN were not optimally fitted -

      How is this possible? There should be a "click" sound when the LC connector is fitted in. It's always been "insert" or "not inserted". I don't recall any instance where a FC cable can be halfway inserted. Maybe the cables were "bent".

      One situation I have seen is when someone with OCD had zip tied fibres (and other cables) to the rack posts. Yes rather tight. Explained why we had some storage and network issues.

      Yeah granted looked very neat and tidy, too bad it didn't work too well.

      1. Fortycoats

        Re: Common guy interpretation?

        Zip-Ties + Fibre Cables = Bad Idea

        Maybe the cable might have been slightly damaged. Not enough the cause the port to go offline, but just a weak signal. Sometimes that's hard to pick up without appropriate monitoring parameters on the SAN switch (the default thresholds might not have picked it up).

        Also, the disk firmware issue can affect multiple disks at once. I saw one advisory that said after a certain amount of running time (about 3 years), some flash disks could shutdown and restart themselves. Since all disks in a SAN Array were started at the same time, it could have caused the affected pool to go offline. Prevented by installing a new disk firmware before the 3 years were up.

  9. ThePhantom

    A long time ago...

    Many years ago, a major bank was trying to move off of IBM to Tandem NonStop (now HPE NonStop). The disks kept failing in the middle of the night, so yours truly was dispatched to work the night shift to figure out what was going on.

    In a nutshell, I caught the IBM night operator opening the backs of the washing machine sized 30MB drive cabinets and loosening the cables. When the nightly close fired up at 2AM, the shaking of the drives caused the connections to become intermittent, crashing the systems.

    Screws tightened, operator fired, problem solved.

    1. Hazmoid
      FAIL

      Re: A long time ago...

      Was he fired because he was caught or because he couldn't come up with a solution that was obvious sabotage? ;) If he hadn't been caught he probably would have ended up as the head of IBM

  10. This post has been deleted by its author

  11. pmitham

    Operator error....

    This seems very much like both a setup issue. The vendor (or installer, whomever) needs to take responsibility for the fiber cables being poorly routed, but poor routing should not cause fiber to fail by itself. More importantly how the h$ll do 12 drives fail without anyone noticing? Its extremely unlikely that all 12 failed at the same time. Even without "call home features" turned on, the storage admin should have seen the component failures in the management interface, unless...they we're in the habit of not managing their environment (lazy) I've never worked in an environment where the storage admin just let the drive failures rack up before dealing with it! it was always repaired ASAP regardless of how many more drives could fail before there was an issue.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like