back to article HPE's Australian tax failures may have been user error

Users of HPE's 3Par kit say the company's private explanation of the twin outages at the Australian Taxation Office suggest the hardware problems may have been caused by user error. One 3Par user, who requested anonymity, told The Register his HPE account manager told a story of serious physical damage to the array. A second …

  1. Nate Amsden

    by default

    3PAR systems will protect against an entire shelf failing(they call it "cage level availability") but it does restrict what types of RAID you can use. e.g. if you have only 2 shelves of disks you can only use RAID 10, not RAID 5 or RAID 6. If you have 8 shelves you could use RAID 6+2 if you really wanted RAID 6. If you wanted RAID 5 the minimum would be 3 shelves for RAID 5 2+1). There are also minimum numbers of disks required on each shelf as well as minimum numbers required for adding to shelves(e.g. if you have a 4 shelf 3PAR system and you want to add more disks/SSDs the minimum SSDs you can add is 8).

    Otherwise the admin of the array can change the default behavior to not protect against a shelf failing (3PAR calls it 'magazine level availability', although the concept of magazines is no longer in the Gen5 8k/20k hardware systems they kept the term for now). Though changing this behavior has no effect on minimum numbers of drives per shelf or minimums on upgrades per shelf.

    You can also customize having some volumes cage level availability and others magazine level, just like you can have some volumes on RAID 5,some on RAID 6, some on RAID 10, all while sharing the same spindles/SSDs (or you can isolate workloads on different spindles/SSDs if you prefer, and you can of course move stuff around between any tier or media type without app impact).

    Back in the early days of 3PAR for eval systems they would encourage the customer to unplug a shelf, or yank a live controller to demonstrate the resiliency(provided it was configured for default/best practice of cage level availability).

    In the 11 years I have had 3PAR I have yet to have a shelf fail, though I do stick to cage level availability for all data wherever possible. The only time I have moved a 3PAR array was at a data center that had weight restrictions per cabinet, so we had installed steel plates to distribute the weight more, and needed to get the array up on the plates. My 3PAR SE+ 1 or 2 other professional services people at the time came on site, we shut the system down, removed all of the drive magazines, and moved the cabinets up onto the steel plated platform and re-inserted everything, re cabled everything and turned it back on. Those 3PAR systems could get up to be 2,000 pounds per rack fully loaded, and I think the data center had a limit of 800 pounds per cabinet or something(in a highrise 10+ floors up).

  2. CanadianMacFan

    Certainly was user error

    And that was to go with HPE in the first place. I haven't used any of their 3PAR hardware but if it's anything like their first generation of blade computers which I have first hand working knowledge of then I wouldn't want to touch their hardware.

    The only good thing I ever had to say about those blades was that if you ever needed to warm up and you were in the data centre then you could stand behind those. We had to put in special cooling to handle the heat coming off of them. And the shutting down of the whole rack to change a power supply was just nuts.

    1. a_yank_lurker

      Re: Certainly was user error

      There seems to be a strong possibility of some very elementary design errors also. Depending who actually did the design and the paper trail the fickle finger of infamy will find a juicy target. I tend to think some outside insultant did an incompetent design so becomes who hired the insultant.

    2. Nate Amsden

      Re: Certainly was user error

      How can you compare the 1st generation of a blade system vs a 5th generation (ASIC wise, system wise it could be 6th or 7th or more) generation system that has been maturing for at least 14 years now. It would be like saying don't deploy current HP blade system because the original ones many years ago were bad.

    3. Anonymous Coward
      Anonymous Coward

      Re: Certainly was user error

      Certainly your certainty cannot be so certain. I too have "first hand experience using gen1 blades", and I wonder if you're talking about the PDUs not the actual PSUs, as they were hot swappable even in the very first Gen 1 chassis.... let alone Proliants from many years prior.

      Extra cooling on blades/chasis', over top of the Boeing level fans they include? The only user error there would be poorly designed server room/air-conditioning, I would have thought..

  3. Adam 1

    > The Register has filed a freedom of information request with the ATO, seeking documents explaining the nature of the outages

    HAHAHAHAHA. You must be new here. Where even the Attorney General's appointments on a specific date range are "too much effort" to respond to a FOI, I really don't like your chances...

  4. Pompous Git Silver badge

    "edcision"

    Is that where 'eds will roll?

  5. Richard 26

    In related news

    HPE engineers on Scarif warn that attempting to service thir tape autoloader whilst their equipment is live can lead to a catastrophic failure.

  6. druck Silver badge

    KCL too

    What with the Kings College London failure/fiasco, 3PAR is in the news for all the wrong reasons.

    1. David Roberts

      Re: KCL too

      Didn't KCL have the same problem of backup resources on the live drives?

      Once may be accidental, but.....

  7. Anonymous Coward
    Anonymous Coward

    Let me guess...

    Some ATO beancounter didn't like that the HPE engineer left a gap and wasted a Rack Unit.

    The decision was made to shift the kit live so they could put in a blank spacer in the bottom of the rack which is aestecially more pleasing.

    1. Anonymous Coward
      Anonymous Coward

      Re: Let me guess...

      "Some ATO beancounter didn't like that the HPE engineer left a gap and wasted a Rack Unit."

      I hate it when idiots (telco people in my experience) leave a 2/3 unit gaps or have installed the patch panels starting from rack hole #2 or something similar. Or the rackable 1U router/switch is put on a 4U shelf instead of using rack mounting. Or cabling short distances "temporarily" with the 5m cables because the short cables weren't available. Or cabling over other equipment without even trying to route the cables. Or installing servers without cable management and the servers can't be pulled out because the cables are stuck and way too short.

      Worst case was at a client where a server was rack mounted and several servers were piled on top of it without rack rails. And the bottom server of course needed some physical service.

      1. John Brown (no body) Silver badge

        Re: Let me guess...

        "Worst case was at a client where a server was rack mounted and several servers were piled on top of it without rack rails. And the bottom server of course needed some physical service."

        I once had a similar experience. Customer had a 4 hour response time contract for engineer visits. I go on site and found the server almost totally hidden behind patch cables draped all over the place, back and front. A complete mess. I was told in no uncertain terms that cables must not be unplugged or I'd be in breach of contract. Took a couple of photos and emailed them to my boss. He told me to leave site and emailed their boss and basically told him they were in breach of contract by not providing access. He demanded that they remove the server and install it on a bench before we would attend again. It took two weeks for them schedule enough downtime to get that bloody server out of the rack. Apparently it had to be done in stages and took about eight 1-hour windows over those two weeks.

        The chain of director level emails I got copied in on was quite fun to read too :-)

      2. Nate Amsden

        Re: Let me guess...

        Kind of related here but my first all flash 3PAR system the installer wanted to install it in the middle of what was left of the rack for expansion purposes. I have been a 3PAR customer for a long time and knew what I wanted, I needed/wanted it right where I told him to put it. He racked it wrong and I had him unrack it it and fix it. He said that was a bad idea for upgrades I told him with the size of SSDs and the number of available slots in the system it is extremely unlikely we will ever add another shelf to that array for it's lifetime(almost 2 and a half years in and the system is 30% full today, it started at 16% populated, maybe it gets to 50% in the next 18 months).

      3. Vic

        Re: Let me guess...

        I hate it when idiots (telco people in my experience) leave a 2/3 unit gaps

        I had to do that a while back.

        We had 1U rack-mounted kit, but the airflow was side-to-side, not front-to-back. Putting in units at every unit space meant that they overheated.

        The recommended way to mount them was to leave a 1U gap between each unit - but if you did that, some monkey would come along and slot his unit in the gap, and you'd only find out when the overnight tests failed because the kit had shut down...

        Vic.

        1. Anonymous Crowbar

          Re: Let me guess...

          Turn the rack?

  8. JohnMartin

    Was a Failure Domain analysis done before implementation ?

    I thought it was meant to be replicated to another array .. none of the explanations cover why the replica at the D//R site wasn't available .. either way .. having your backup infrastructure in the same failure domain as your production system is unforgivable, it's a fundamental design principal of data availability ..

    I've done long documented lists of failure domains and had to analyse them using things like reliability block diagrams and detailed the impact of every possible failure scenario, including RPO and RTO for government departments who were reputedly less paranoid about these things than the ATO, and that was just for an RFI .. not a production environment.

    You have to assume that people will do stupid things, operator error including stuff like "oops that wasn't the test instance I just dropped" is the leading cause of data loss and downtime, so if you don't factor that into the design, you're failing in your duty of care.

    Rather than blaming "the dumb users", maybe someone should be asking who designed the system in this way, and who wrote the operational procedures, and why weren't they tested.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like