Twice-crashed HPE SANs at Oz Tax Office built for speed, not strength, and turned off error reporting • The Register Forums

Tuesday 30th May 2017 01:13 GMT mr. deadlift

So, whoever set it up, or tampered with it really fcsked it up.

Correct me here but in the past someone from Texas/3Par/HPE usually called re: the monitoring and such. Asking to get it optimized or that an alert was triggered or couldn't be seen, and so on. Or at least i recall they used to do this.

Or has the service taken such a slide?

0 0 Reply

Tuesday 30th May 2017 06:40 GMT Anonymous Coward

No its still there, they call, ask about the arrays that aren't reporting. If you tell them its in a secure location and cant report they will mark it as such and not ask about it.

If it is reporting in and you have a problem they will notify you, asking to fix it. If you have remote management enabled they will do it remotely or if you dont they will do it via your desktops.

If its a hardware fault they will send an 'engineer', this is where is doesnt go too well in my experience as where i am it is outsourced and the company handling it doesnt know much about 3pars.

0 0 Reply
Tuesday 30th May 2017 17:35 GMT CJames

Built for speed but performance not monitored

Never underestimate the importance of monitoring data centre performance. In a millisecond world why is it that monitoring every few minutes, or not monitoring at all, is acceptable. The world has changed, application performance rules any on-line activity, so application and infrastructure owners need to continuously monitor performance in real-time, not just availability, or agree a performance based SLA with their supplier. Modern Infrastructure Performance Monitoring (IPM) platforms work at an application centric level and can proactively prevent slow-downs or outages before they happen by alerting before end users are impacted.

0 0 Reply
Wednesday 31st May 2017 17:54 GMT cswilson1976

they still do that, I have several 3par arrays I am responsible for at my employer

0 0 Reply

Tuesday 30th May 2017 01:27 GMT Anonymous Coward

TWELVE DRIVES failed?

OK I accept that the 'call home' functionality was apparently broken so the drive failures would pile up, but who was responsible for administrating this array? Doesn't it show alerts in the GUI if even one drive has failed, let alone twelve???

I wonder if the GUI has some little red triangle to indicate a fault condition, but the fault condition for "drive failure" looks the same as something stupid like "AC voltage dip < 220 volts" or whatever so there's ALWAYS a fault/warning indicator which is why the drive faults were (apparently) ignored?

2 0 Reply

Tuesday 30th May 2017 21:06 GMT Anonymous Coward

Re: TWELVE DRIVES failed?

Based on the comments about "drive software", my guess is this problem was caused by drive a firmware issue. Likely a large number of early life drive failures which may not have been actual drive failures at all, but were determined to be such by the drive firmware. If this is the case, the drives may have failed fast enough that multiple drive failures occurred in a single RAID set before spare drives could be allocated and RAID rebuilds complete.

0 0 Reply

Tuesday 30th May 2017 01:36 GMT Anonymous Coward

Shelf failure?

12 drives + FC cable issues....could that be an entire shelf failure?

2 0 Reply

Tuesday 30th May 2017 02:55 GMT Nate Amsden

Re: Shelf failure?

At one point there was a report of someone trying to move shelves around while the array was online(stretching cables etc). So maybe no actual shelf failure but perhaps an induced failure from doing stupid things.

Good that HP accepted responsibility for it. Also obvious lesson don't store your backups on the primary storage array, obviously budget issues caused that. The original report made it sound like multiple full array failures(not impossible just unlikely).

I got the real story as to what happened(as well as what happened in the UK) but will stick to what is public. Short story short i have complete confidence in my 3par arrays.

(3par customer since 2006)

3 0 Reply
1. Tuesday 30th May 2017 03:24 GMT Griffo
  
  Re: Shelf failure?
  
  I think most 3Par shelves take 24 Drives. So they lost half a shelf.
  
  Perhaps someone slid out a shelf for some reason and poor cable management meant that the FC cables fell out of the rear on one side.
  
  Why that resulted in a situation that wasn't easily recoverable is the real question.
  
  0 0 Reply
  1. Tuesday 30th May 2017 07:42 GMT thondwe
    
    Re: Shelf failure?
    
    Can take up to 24 drives (in the new HPE models) - so could well just be a part populated shelf going down. If the fibre cables are messed then taking a controller (either 2 or 4 controllers in these) off line could easily lose contact with the shelf...
    
    Sounds rather like a "cheap" 3PAR config not setup properly - not enough shelves/spindles etc. No monitoring seriously!? Configured for performance - so RAID 1 not RAID 5/6.
    
    Killing both SANs would I suspect be easy if they replicate...
    
    3 0 Reply
  2. Wednesday 31st May 2017 05:22 GMT Phil Kingston
    
    Re: Shelf failure?
    
    The accidental unplugging due to manual shelf movement has been a rumour.
    
    Coupled with the comments about FC cable management then I reckon we do have a winner - sounds very much like (aside from the other failures and poor data management) a shelf was moved. And I think I remember reading that remote hands were attempting to move kit from one rack to another. Oh dear.
    
    0 0 Reply
2. Tuesday 30th May 2017 10:18 GMT Scuby
  
  Re: Shelf failure?
  
  I've been running 3PAR arrays for 10 years, through multiple generations of the Product.
  
  Something is definitely amiss, and like Nate, I also have complete confidence in my 3PAR arrays. (All of which are still in Production, E200s/F400s/T400/7200s and 8440s.
  
  3 0 Reply

Tuesday 30th May 2017 05:20 GMT Your alien overlord - fear me

This wasn't setup/administered by Plutus contractors by any chance?

2 0 Reply

Tuesday 30th May 2017 05:26 GMT Tim99

Settlement?

“We have reached a commercial settlement with HPE, the detailed terms of which are subject to contractual confidentiality. The settlement recoups key costs incurred by the ATO, and provides additional and higher grade IT equipment giving the ATO a world-class storage network.”

So does that mean that HPE just discounted their service visit bill, replaced the disks and the connectors, and gave them another shelf so that proper fault tolerance could be used? Or did they replace the whole steaming pile with kit that was fit for purpose, at no cost, and pay for downtime etc?

1 0 Reply

Tuesday 30th May 2017 08:23 GMT Adam 1

Re: Settlement?

Er, you mustn't be familiar with the recent history of Oz government IT systems.

Exclusive to el reg: We have the transcripts from the confidential negotiations....

Gov: You stuffed up big time. We will sue for $500 million in losses.

HPE: counter offer. We will pay you $1 and you can say it was our fault.

Gov: Even better. I'm hungry, who wants lunch?

8 0 Reply

Tuesday 30th May 2017 07:37 GMT Anonymous Coward

Problems?

1. My fiber cables are probably not "optimal" either, but they are secure, visible and untangled.

2. What? I can read/write all my disks fine. They may not be on the latest firmware, but that does not mean they're unreadable. What does this actually mean?

3. I've never enabled the back-to-base reporting. It wouldn't make it through our firewall anyway. Everything - and I mean Informational and above - is reported to the people responsible for the SAN 24 hours a day.

Either the storage was treated like a cesspit and run by morons, or this is a stitch-up to shift the blame onto the maintainers. I too could have a gold-standard install with perfect compliance and reporting, as long as I was given unlimited space and budget for the physical system, unlimited and unfettered downtime to keep it all at latest firmware, and cart-blanch on security to allow for remote management.

4 0 Reply

Tuesday 30th May 2017 08:49 GMT Anonymous Coward

More detailed at least...

Than KCL's 3PAR failure analysis.

0 0 Reply

Tuesday 30th May 2017 09:11 GMT James Pond

Common guy interpretation?

Let me see if I can interpret this for the common person.

1- The fibre optic cables feeding the SAN were not optimally fitted -

How is this possible? There should be a "click" sound when the LC connector is fitted in. It's always been "insert" or "not inserted". I don't recall any instance where a FC cable can be halfway inserted. Maybe the cables were "bent".

2- Disk drives on the SAN had software bugs that made the stored data on the drives inaccessible or unable to be read -

If I can't read data on a disk, blame it on a firmware bug instead of a software bug.....

3- Some monitoring features were not activated, including a "back-to-base" tool to report operating errors -

When setting up - "ATO is a very very very secure site, I DON'T WANT CALLBACK! Disable callback! My peons will monitor this!"

Post mortem - "How did 12 disks fail without anyone knowing??? Didn't you promise callback??? Why is it not working????"

4- SAN design has overemphasis on performance features rather than stability or resilience -

When setting up - "I am on a tight very budget. Can you give me X TB usable by configuring RAID 5?"

Post mortem - "WHY ARE THE DISKS ON RAID 5??? These data are critical! I WANT THEM ON RAID 6 OR RAID 1!!!"

4 0 Reply

Tuesday 30th May 2017 09:31 GMT mr_souter_Working

Re: Common guy interpretation?

"1- The fibre optic cables feeding the SAN were not optimally fitted -

How is this possible? There should be a "click" sound when the LC connector is fitted in. It's always been "insert" or "not inserted". I don't recall any instance where a FC cable can be halfway inserted. Maybe the cables were "bent"."

at a place I worked previously, we had an issue with some servers with FC attached storage arrays that took forever to start back up - eventually I went to each of the DC's and discovered that years earlier (when they were installed) someone had attached the first FC card in the server to the input on the storage array primary controller and then the output of the same controller on that storage array to the output of the second card in the same server - it took me a while of head scratching before i finally figured out where the cables were supposed to go (they were all properly seated, and to a casual glance they appeared fine - and the servers started and worked, just VERY slow to boot), but that caused repeated array issues for years that nobody had ever really bothered with - we found that all of the servers were connected the same way to their external storage (6 servers in all). luckily it was only the Exchange system, and it was a fully redundant system (active/passive nodes in primary DC with offsite passive in DR location).

0 0 Reply
Tuesday 30th May 2017 13:54 GMT Anonymous Coward

WHY ARE THE DISKS ON RAID 5???

RAID-5 is not a typical high-performance configuration (especially if the system is write-intensive). A 12 disk failure may easily take out both disks of a RAID-1 array (or a RAID-10 one, if you want performance).

I've seen SAN software and disk firmware updates marked CRITICAL because there were bugs that could lead to disks losing data or becoming inoperative. Just you usually need to dig the support site to find them.

0 0 Reply
1. Tuesday 30th May 2017 20:35 GMT Anonymous Coward
  
  Re: WHY ARE THE DISKS ON RAID 5???
  
  My guess is the performance vs. availability trade-off was RAID-5 vs. RAID-6.
  
  RAID-5 has been good enough for performance workloads for at least 15 years now. Dedicated hardware based XOR engines (ASICS) and large write caches to minimize read-modify-write cycles pretty much solved the earlier performance challenges with RAID-5. For those without hardware based XOR engines, general CPU performance has increased to the point storage controllers can easily handle the parity calculations.
  
  However, there are some legacy storage systems out there which perform well with RAID-5, but poorly with RAID-6. It is usually systems with hardware based XOR engines which are optimized for RAID-5. On these systems, RAID-6 either requires two cycles of the XOR engine, or it is done on the main controller processor instead. On these systems, the controller processors are often not sized for parity calculations. 3Par is one of those which calculates RAID-5 parity in an ASIC, not on the main controller processor. I do not know how 3Par calculates RAID-6 double parities.
  
  There are a lot of other systems which do not use hardware XOR engines (NetApp FAS, Nimble, Tintri, XtremIO, Pure, Tegile, and others) which can do RAID-6 with little performance difference with RAID-5.
  
  0 0 Reply
  1. Thursday 1st June 2017 15:22 GMT Scuby
    
    Re: WHY ARE THE DISKS ON RAID 5???
    
    HPE now state that all critical volumes on SAS should be either R1 or R6 due to larger Disk Capacities (it has always been the case for NL 7k Drives.)
    
    All RAID calculations, regardless of level are performed using the ASICS.
    
    RAID-6 calculation uses the XOR engine in the ASIC but must calculate two distinct parities (R5 only needs one parity) and need to compute parities over more data since RAID sets are larger.
    
    Most writes require only two parities. However, a fraction (1/3 for step size 8), of the data blocks are used to compute 3 parities, so updating those blocks requires reading/updating 3 parities, hence the odd number 6 2/3 back end IOs for RAID-6.
    
    0 0 Reply
  2. Friday 2nd June 2017 08:11 GMT Anonymous Coward
    
    Re: WHY ARE THE DISKS ON RAID 5???
    
    With the larger sizes of current hard drives, recovering from a failed drive is a pretty risky prospect now. The chances of hitting a URE during the rebuild greatly increases as drive capacities get larger. You need to be able to read every sector of every remaining drive in the array for it to rebuild successfully. If a single sector experiences a URE you are dead in the water.
    
    I can't think of a single enterprise use case where RAID5 or any other single parity solution should be considered acceptable.
    
    0 0 Reply
Tuesday 30th May 2017 21:06 GMT Down not across

Re: Common guy interpretation?

1- The fibre optic cables feeding the SAN were not optimally fitted -

How is this possible? There should be a "click" sound when the LC connector is fitted in. It's always been "insert" or "not inserted". I don't recall any instance where a FC cable can be halfway inserted. Maybe the cables were "bent".

One situation I have seen is when someone with OCD had zip tied fibres (and other cables) to the rack posts. Yes rather tight. Explained why we had some storage and network issues.

Yeah granted looked very neat and tidy, too bad it didn't work too well.

1 0 Reply
1. Wednesday 31st May 2017 08:11 GMT Fortycoats
  
  Re: Common guy interpretation?
  
  Zip-Ties + Fibre Cables = Bad Idea
  
  Maybe the cable might have been slightly damaged. Not enough the cause the port to go offline, but just a weak signal. Sometimes that's hard to pick up without appropriate monitoring parameters on the SAN switch (the default thresholds might not have picked it up).
  
  Also, the disk firmware issue can affect multiple disks at once. I saw one advisory that said after a certain amount of running time (about 3 years), some flash disks could shutdown and restart themselves. Since all disks in a SAN Array were started at the same time, it could have caused the affected pool to go offline. Prevented by installing a new disk firmware before the 3 years were up.
  
  1 0 Reply

Tuesday 30th May 2017 14:20 GMT ThePhantom

A long time ago...

Many years ago, a major bank was trying to move off of IBM to Tandem NonStop (now HPE NonStop). The disks kept failing in the middle of the night, so yours truly was dispatched to work the night shift to figure out what was going on.

In a nutshell, I caught the IBM night operator opening the backs of the washing machine sized 30MB drive cabinets and loosening the cables. When the nightly close fired up at 2AM, the shaking of the drives caused the connections to become intermittent, crashing the systems.

Screws tightened, operator fired, problem solved.

5 0 Reply

Friday 2nd June 2017 07:43 GMT Hazmoid

Re: A long time ago...

Was he fired because he was caught or because he couldn't come up with a solution that was obvious sabotage? ;) If he hadn't been caught he probably would have ended up as the head of IBM

0 0 Reply

This post has been deleted by its author

Tuesday 17th July 2018 17:38 GMT pmitham

Operator error....

This seems very much like both a setup issue. The vendor (or installer, whomever) needs to take responsibility for the fiber cables being poorly routed, but poor routing should not cause fiber to fail by itself. More importantly how the h$ll do 12 drives fail without anyone noticing? Its extremely unlikely that all 12 failed at the same time. Even without "call home features" turned on, the storage admin should have seen the component failures in the management interface, unless...they we're in the habit of not managing their environment (lazy) I've never worked in an environment where the storage admin just let the drive failures rack up before dealing with it! it was always repaired ASAP regardless of how many more drives could fail before there was an issue.

0 0 Reply

Topics

Special Features

Vendor Voice

Resources

COMMENTS

Built for speed but performance not monitored

TWELVE DRIVES failed?

Re: TWELVE DRIVES failed?

Shelf failure?

Re: Shelf failure?

Re: Shelf failure?

Re: Shelf failure?

Re: Shelf failure?

Re: Shelf failure?

Settlement?

Re: Settlement?

Problems?

More detailed at least...

Common guy interpretation?

Re: Common guy interpretation?

WHY ARE THE DISKS ON RAID 5???

Re: WHY ARE THE DISKS ON RAID 5???

Re: WHY ARE THE DISKS ON RAID 5???

Re: WHY ARE THE DISKS ON RAID 5???

Re: Common guy interpretation?

Re: Common guy interpretation?

A long time ago...

Re: A long time ago...

Operator error....

POST COMMENT House rules

Enter your comment

Add an icon

Other stories you might like

HPE bakes LLMs into Aruba as AI inches closer to network takeover

UK tech titan Mike Lynch's US fraud trial begins today

Singtel loses $260 million tax case in Australia

HPE blames GPU shortage for contributing to unexpected sales slide

HPE boss Neri bags 15% pay hike in 2023 as targets ticked

Juniper sued over HPE buyout after allegedly ginning up execs' wallets

HPE seeks $4B in damages from Autonomy boss Mike Lynch and his ex-CFO

HPE joins the 'our executive email was hacked by Russia' club

Competition is decreasing in enterprise IT – and you’ll be poorer and dumber for it

HPE's updated Spaceborne Computer-2 ready to hitch another ride to the ISS

HPE said to be moving in on $13B deal for Juniper Networks

Official: Hewlett Packard Enterprise wants to swallow Juniper Networks in $14B deal

About Us

Our Websites

Your Privacy