Dodgy disk excuse invalidate use of HPE storage arrays.
HPE has blamed a problem with solid state drives for its dual and very disruptive outages at the Australian Taxation Office (ATO). A spokesperson for the company told The Register that “We believe the disruption started when a solid state drive used by major storage vendors failed. HPE and the drive vendor have determined that …
SAN design added a layer of flash drives for "speedy access", didn't mirror it due to cost considerations (hence HPE's dig about redundancy), and then the DBA said "Ooh, look at all that fast disk, I'll just optimise the database layout to put my hardest hit and most critical tables into that flash layer"..... Cue data loss when that hard hit flash disk rapidly exceeds the wear limit and can't find a good sector of flash to make a write to.
Unless HP's wear levelling across the drives is REALLY bad on 3PAR (and I don't think it is), then the chances of a drive wearing out within a few years is vanishingly small. More likely is a firmware related problem either within the 3PAR or the drives themselves reporting a false drive failures across two drives simultaneously in a RAID-5 config. Actual failure is again very unlikely, solid state drives are spectacularly reliable from an electronic component point of view.
More likely is a firmware related problem within the drives themselves - Bingo and no amount of raid 5, raid 6 or even triple parity is going to guarantee survival in that situation, in fact the longer you hang on trying to correct the situation probably the bigger the hole you're digging.
Recently saw an urgent advisory from EMC for SSD firmware updates - apparently the drives themselves (Samsung I believe) had an internal memory leak in firmware that caused them to die after 700 days of continuous uptime. No amount of redundancy in a storage array can protect against shelves of SSDs going at once.
Wouldn't be surprised if this is related...
I remember the old WD Raptors, that had a glitch every 57,6 days of being powered on. Mirror sets failed synchronously. History repeats itself. At that time, WD would not acknowledge the issue. They did later on, on a private basis, and never made it public. But then you could find the relevant information by googling.
I was thinking the same thing... Here's the old Reg article.
It could be some specific config scenario on the ATO array that triggered a firmware bug in several drives at once. Assuming that's it happens on the Flash tier where some DBA probably stored their indexes, you can see how things woudl rapidly go downhill - mirrored arrays or not.
But I wouldn't be surprised if they were Samsung drives. Out of 300+ Crucial SSDs in our organization, we've had 1 failure. Out of the 40 or 50 Samsung SSDs we have, we've had 5 documented failures and a few we're keeping an eye on due to occasionally getting "no boot device found" and other oddities.
I'll second that.
We have a system that has a PCI-E SSD on the Mobo. Made by Samsung. Split into two partitions and the second one formatted as NTFS.
50% of the time when the system boots off the same physical device, the second partition does not appear to get mounted.
Then another 30% of the time it appears in Windows Explorer and then disappears. This happens when two USB-3 devices are plugged in.
The same OS runs fine on another PC with a non Samsung PCI-E device. The system is cloned from the other one. I have had 4 USB-3 drives plugged in and everything is ok.
go figure eh?
Ring some bells of a similar issue ( SSD Media error) ~ 2 years ago with one of the Big! SSD vendors
The issue is when you have a failure you need to rebuild the Raid , that usually means more reading from cold area at the remaining SSD, meaning.. increase the probability of hitting another media error causing another failure ... causing more reading..in the most undesired time ...think you got the point :(
So insure you storage vendor have a background task that do health check on your entire media once in a while , its his duty to guarantee your data safety no matter what !
Otherwise you will end up as a nice Tax Office I heard about
What a bunch of b..... Disks were failing since they were invented.
I cannot remember where, but I did read glorious story by HPE marketing how they squized more storage juice from SSD than other vendors; but cannot remember where I saw that document. And they claimed they did it in cooperation with SSD vendors.
What a shame; but I doubt anyone care. I also doubt anyone will get fired except poor contract SSD bits miner.
It never helps that the way the drives are enumerated in the system, are not the same way they present them in the drive change gui. They swap Cs and Ds for 0s and 1s. So you don't know if your bad drive was for 16d10 or 16dd0, depending if the low paid india help desk dispatching the ticket. Real fun when its the opposite, and its the wrong drive type was shipped.
"... fault injection." I remember the time where engineers would receive proper training (which included fault injection) as well as proper system QA before shipment.
These days "fault injection" is the respnsibility of third party contractor installing the system and the customer who has a Synology SAN at home.
From the article:
Or it could be HPE trying to
throw a vendor under a busapportion blame for the outages.
Doesn't matter who the MSD (Mass Storage Device, using it to indicate either HDD or SSD - or even tape - devices) manufacturer was, in the same way with the ABS/Census debacle where IBM (the prime contractor) is trying to blame one of its subcontractors.
When an Enterprise purchases a solution like HPE 3PAR, it isn't Samsung, WD, Seagate, Fujitsu, Toshiba (or whoever is going to buy the HDD division from that rotting carcass) that the vendor is buying MSDs from, it is HPE they are buying from. It is HPE's responsibility, and no one else's, to qualify, verify, and supply the device.
They are another company that lost the plot after the MBA clan got to them. My BB doesn't leak data to anyone (other than my local gov't). My Apple and Android devices do so much side channel stuff that I can't keep track of it all. I would love to have the option of "this thing wants $FOO, we can lie to it".
My biggest problem is there isn't a green and red button on the phone. The damn thing has buttons, stop making me use the stupid touch screen.
Oh My, I remember discussing the potential flaws of using Solid State Disks in mission critical roles, with a very senior storage engineer, back in.... 1986.
We don't learn, do we...
The irony is the almost parallel discussion in the reg of the potential demise of Storagetek under Oracle's watch. For goodness sake, can someone let the grownups be back in charge...
Er, how are the potential failures of MODERN SSD storage potentially any more risky than other storage devices? Bad firmware = bad firmware, no matter what the storage substrate.
I remember having to replace 300+ Hitachi drives in the early 2000s - good old spinning rust, manufacturing fault with the actuator or actuator arm, I can't remember which. Ok, most of that was preventative replacement, once Dell finally admitted the problem, but a 1/5 failure rate was pretty noticeable prior to that.
This is just dumb. No one is buying a 20 year old system. Calling a 3PAR "legacy" is like saying a 2017 Mercedes is 70 years old. Literally zero logic to the statement. What do you think HPE has been doing since buying 3PAR? Nothing? And people are buying it why? Nostalgia? Give us all a break.
I cannot believe such a critical piece of infrastructure did not have an active-active DR attached to it? Why does the Government throw money at incompetent multi nationals when there are many competent nationals around. When it comes to critical infrastructure, money should not be the deciding factor.
Wasn't there an article a couple of days about about the Kings College London 3PAR implosion that also said "device fail mumble firmware mumble triggered cascade failure mumble" or something like it.
I'll be interested in the PwC report because we also use 3PAR, also with multiple tiers including SSDs. How much of a timebomb am I sitting on here?!
Replication and mirroring is NOT a full data protection strategy - the purpose is for "quick and dirty"
(ie during the day) recovery - It's not a protected copy
With mirror and replicate:
Viruses can corrupt both copies
Ransom-ware can corrupt both copies
Programming errors can corrupt both copies
Firmware or controller upgrades (H/W and S/W) can corrupt both copies
Accidental deletion or DB admin hiccup can corrupt both copies
Disgruntled/incompetent IT staff (surely not in Australian Public Service!!) can also corrupt data
Since 1950 the Data Protection industry has had an extremely robust methodology:
3 copies of Data (if you don't have 3 copies, it's not worth keeping)
2 different mediums (ie do NOT rely on just disk - example disk and tape) - cloud is mostly disk that on someone else's premises (with a fancy name that gets CIO's excited!)
1 copy offline and offsite (None of the list above can attack or corrupt offline, offsite data)
Once the data is gone - it's gone - accept that all technology will have errors, just as humans and machines will make mistakes.
For example here are only TWO (maybe 3 if you're pedantic) spinning disk manufacturers left - what happens if there is a firmware bug or security hole (and there have been!) and half the spinning drives in the world start corrupting data.? The chips in the flash and SSD drives in all manufacturers equipment come from very few Fabrication facilities (regardless of who's name is on the tin)
Data corruption or loss is an inevitability - Plan for it - 3-2-1 !!!
Biting the hand that feeds IT © 1998–2019