back to article HPE blames solid state drive failure for outages at Australian Tax Office

HPE has blamed a problem with solid state drives for its dual and very disruptive outages at the Australian Taxation Office (ATO). A spokesperson for the company told The Register that “We believe the disruption started when a solid state drive used by major storage vendors failed. HPE and the drive vendor have determined that …

  1. mr. deadlift

    Or

    Dodgy disk excuse invalidate use of HPE storage arrays.

  2. TRT Silver badge

    What could destroy flash?

    Ming the merciless?

    1. Anonymous Coward
      Anonymous Coward

      Re: What could destroy flash?

      He'll save every one of us!

    2. eldakka Silver badge

      Re: What could destroy flash?

      Ming the Merciless has been provably incapable of destroying Flash.

      The other way around, however....

  3. Anonymous Coward
    Anonymous Coward

    Guinea Pig

    "....HPE and the drive vendor have determined that the condition was triggered by a rare issue under a set of circumstances that have never previously been encountered.”

    Dear ATO, you knew you'd be the guinea pig. That was implied in the heavy discounts we gave you.

    1. EarthDog

      Re: Guinea Pig

      Testing? We've heard of it.

      1. Anonymous Coward
        Anonymous Coward

        Re: Guinea Pig

        "... a unique set of circumstances ...."

        Translation: "...no other customer is running the same config and code..."

  4. Matt Bryant Silver badge
    Facepalm

    I'm guessing....

    SAN design added a layer of flash drives for "speedy access", didn't mirror it due to cost considerations (hence HPE's dig about redundancy), and then the DBA said "Ooh, look at all that fast disk, I'll just optimise the database layout to put my hardest hit and most critical tables into that flash layer"..... Cue data loss when that hard hit flash disk rapidly exceeds the wear limit and can't find a good sector of flash to make a write to.

    1. JohnMartin

      Re: I'm guessing....

      Unless HP's wear levelling across the drives is REALLY bad on 3PAR (and I don't think it is), then the chances of a drive wearing out within a few years is vanishingly small. More likely is a firmware related problem either within the 3PAR or the drives themselves reporting a false drive failures across two drives simultaneously in a RAID-5 config. Actual failure is again very unlikely, solid state drives are spectacularly reliable from an electronic component point of view.

      1. EarthDog

        Re: I'm guessing....

        FYI, last I heard 3Par US was being sent overseas.

      2. seven of five

        Re: I'm guessing....

        certain high capacity batches from another, well known vendor have been observed to fail within days. End of last year, right here in Germany.

      3. Anonymous Coward
        Anonymous Coward

        Re: I'm guessing....

        More likely is a firmware related problem within the drives themselves - Bingo and no amount of raid 5, raid 6 or even triple parity is going to guarantee survival in that situation, in fact the longer you hang on trying to correct the situation probably the bigger the hole you're digging.

  5. Tim99 Silver badge

    "We'll know more in March, when the PwC report into the incident emerges"

    Really? Did you forget the sarcasm flag Simon?

    1. Mark 65

      Re: "We'll know more in March, when the PwC report into the incident emerges"

      I'd have to think that, for this level of technical investigation, PWC would be on the letterhead but they'd have drafted in an outside SME to consult on the issue and write the contents.

    2. Simon Sharwood, Reg APAC Editor (Written by Reg staff)

      Re: "We'll know more in March, when the PwC report into the incident emerges"

      The sales team are in town and we went to ... erm ... dinner last night. That's a better excuse than 'alt.right trolls hacked my Twitter', I hope.

    3. CrazyOldCatMan Silver badge

      Re: "We'll know more in March, when the PwC report into the incident emerges"

      PWC will report whatever HPE/OzGov paid them to..

    4. Adam 1 Silver badge

      Re: "We'll know more in March, when the PwC report into the incident emerges"

      I've seen an early copy of the PWC report. Turns out the real cause of the issues is the wind farms in South Australia.

  6. Steve Knox
    Coat

    "a set of circumstances that have never previously been encountered.”

    Clearly they didn't realize that SSDs spin the opposite direction Down Under.

  7. Anonymous Coward
    Anonymous Coward

    SINGLE POINT OF FAILURE!

    Why don't they say a dog chewed on the power cable?

    The thing they're trying to avoid saying is SINGLE POINT OF FAILURE!!

    1. eldakka Silver badge
      Coat

      Re: SINGLE POINT OF FAILURE!

      At least an IBM solution could have blamed the ATO for not jiggling the cable right, no such excuse for HPE tho.

  8. Loud Speaker

    Spanish inquisition?

    Probably a live demonstration of "mirroring is not backup".

    This is the 21st century - no one expects tape backup! (or indeed a major country to actually employ sysadmins who know their job).

  9. Black Rat
    Devil

    Fake News Alert

    Hackers not getting the blame for a government computer failure? pull the other one!

    1. eldakka Silver badge
      Joke

      Re: Fake News Alert

      That could get too messy.

  10. ntevanza
  11. Anonymous Coward
    Anonymous Coward

    Recently saw an urgent advisory from EMC for SSD firmware updates - apparently the drives themselves (Samsung I believe) had an internal memory leak in firmware that caused them to die after 700 days of continuous uptime. No amount of redundancy in a storage array can protect against shelves of SSDs going at once.

    Wouldn't be surprised if this is related...

    1. Kurgan

      Like the old WD Raptors

      I remember the old WD Raptors, that had a glitch every 57,6 days of being powered on. Mirror sets failed synchronously. History repeats itself. At that time, WD would not acknowledge the issue. They did later on, on a private basis, and never made it public. But then you could find the relevant information by googling.

      1. Griffo

        Re: Like the old WD Raptors

        I was thinking the same thing... Here's the old Reg article.

        https://www.theregister.co.uk/2009/09/28/velociraptor_49day_bug/

        It could be some specific config scenario on the ATO array that triggered a firmware bug in several drives at once. Assuming that's it happens on the Flash tier where some DBA probably stored their indexes, you can see how things woudl rapidly go downhill - mirrored arrays or not.

    2. Anonymous Coward
      Anonymous Coward

      I know with Spinning rust, you'd mix up your drive batches to avoid total array loss, but I guess it's impossible if the whole lot are screwed.

      Only way I can think of is mixed vendor drives, but not sure how well they would play together.

  12. David Roberts Silver badge
    Paris Hilton

    I'm sorry darling

    That has never happened before.....

  13. Unicornpiss Silver badge
    Meh

    Heresy...

    But I wouldn't be surprised if they were Samsung drives. Out of 300+ Crucial SSDs in our organization, we've had 1 failure. Out of the 40 or 50 Samsung SSDs we have, we've had 5 documented failures and a few we're keeping an eye on due to occasionally getting "no boot device found" and other oddities.

    1. Steve Davies 3 Silver badge

      Re: Heresy...

      I'll second that.

      We have a system that has a PCI-E SSD on the Mobo. Made by Samsung. Split into two partitions and the second one formatted as NTFS.

      50% of the time when the system boots off the same physical device, the second partition does not appear to get mounted.

      Then another 30% of the time it appears in Windows Explorer and then disappears. This happens when two USB-3 devices are plugged in.

      The same OS runs fine on another PC with a non Samsung PCI-E device. The system is cloned from the other one. I have had 4 USB-3 drives plugged in and everything is ok.

      go figure eh?

  14. Anonymous Coward
    Anonymous Coward

    Media errors

    Ring some bells of a similar issue ( SSD Media error) ~ 2 years ago with one of the Big! SSD vendors

    The issue is when you have a failure you need to rebuild the Raid , that usually means more reading from cold area at the remaining SSD, meaning.. increase the probability of hitting another media error causing another failure ... causing more reading..in the most undesired time ...think you got the point :(

    So insure you storage vendor have a background task that do health check on your entire media once in a while , its his duty to guarantee your data safety no matter what !

    Otherwise you will end up as a nice Tax Office I heard about

  15. ultrix
    Facepalm

    First disk failed ever

    What a bunch of b..... Disks were failing since they were invented.

    I cannot remember where, but I did read glorious story by HPE marketing how they squized more storage juice from SSD than other vendors; but cannot remember where I saw that document. And they claimed they did it in cooperation with SSD vendors.

    What a shame; but I doubt anyone care. I also doubt anyone will get fired except poor contract SSD bits miner.

  16. ultrix
    Paris Hilton

    Failed disk. Never heard of that.

    I thought it was EMC DMX feature. Those beasts liked to kill themselves when replacing failed disks and pushing microcode update.

    1. Anonymous Coward
      Anonymous Coward

      Re: Failed disk. Never heard of that.

      It never helps that the way the drives are enumerated in the system, are not the same way they present them in the drive change gui. They swap Cs and Ds for 0s and 1s. So you don't know if your bad drive was for 16d10 or 16dd0, depending if the low paid india help desk dispatching the ticket. Real fun when its the opposite, and its the wrong drive type was shipped.

  17. IanMoore33

    No system is fully fault tolerant

    Unless they have done extensive hardware and software fault injections on every subsystem any system can crash.

    1. Anonymous Coward
      Anonymous Coward

      Re: No system is fully fault tolerant

      "... fault injection." I remember the time where engineers would receive proper training (which included fault injection) as well as proper system QA before shipment.

      These days "fault injection" is the respnsibility of third party contractor installing the system and the customer who has a Synology SAN at home.

  18. eldakka Silver badge
    Flame

    The drive supplier to HPE is irrelevant

    From the article:

    Or it could be HPE trying to throw a vendor under a bus apportion blame for the outages.

    Doesn't matter who the MSD (Mass Storage Device, using it to indicate either HDD or SSD - or even tape - devices) manufacturer was, in the same way with the ABS/Census debacle where IBM (the prime contractor) is trying to blame one of its subcontractors.

    When an Enterprise purchases a solution like HPE 3PAR, it isn't Samsung, WD, Seagate, Fujitsu, Toshiba (or whoever is going to buy the HDD division from that rotting carcass) that the vendor is buying MSDs from, it is HPE they are buying from. It is HPE's responsibility, and no one else's, to qualify, verify, and supply the device.

    1. mark 177

      Re: The drive supplier to HPE is irrelevant

      Yup! That's why prime contractors were invented, and why they charge such eye-watering prices!

  19. -tim
    Flame

    Was it a deep hack?

    Did anyone look into magic firmware that was made just for the ATO?

    The Ruxcon and Breakpoint security conferences have been showing these sorts of hacks for years. I would think having that type magic in the ATO's disk system would be worth a fair amount of coin.

  20. -tim
    Big Brother

    They had something until their MBAs went feral

    They are another company that lost the plot after the MBA clan got to them. My BB doesn't leak data to anyone (other than my local gov't). My Apple and Android devices do so much side channel stuff that I can't keep track of it all. I would love to have the option of "this thing wants $FOO, we can lie to it".

    My biggest problem is there isn't a green and red button on the phone. The damn thing has buttons, stop making me use the stupid touch screen.

  21. Peter Quodling

    Oh My, I remember discussing the potential flaws of using Solid State Disks in mission critical roles, with a very senior storage engineer, back in.... 1986.

    We don't learn, do we...

    The irony is the almost parallel discussion in the reg of the potential demise of Storagetek under Oracle's watch. For goodness sake, can someone let the grownups be back in charge...

    1. Trixr Bronze badge

      Er, how are the potential failures of MODERN SSD storage potentially any more risky than other storage devices? Bad firmware = bad firmware, no matter what the storage substrate.

      I remember having to replace 300+ Hitachi drives in the early 2000s - good old spinning rust, manufacturing fault with the actuator or actuator arm, I can't remember which. Ok, most of that was preventative replacement, once Dell finally admitted the problem, but a 1/5 failure rate was pretty noticeable prior to that.

  22. Adam 1 Silver badge

    not Samsung!

    Samsung failures would have been notable by the presence of 100 fire engines at the data centre.

  23. Flammi

    Bad software!

    True reliability comes not from reliable hardware, but from reliable software that makes up for unreliable hardware. Obviously 3PAR has big problems with their software. Maybe because it's almost two decades old... #legacysystem

    1. Anonymous Coward
      Anonymous Coward

      Re: Bad software!

      I'm sure you've heard the term ? "hardware eventually breaks and software eventually works"

      Legacy vs who ? certainly not EMC, Netapp who have 25+ year old code out in the wild, if you're talking about the startups then I'd refer you back to the above statement.

    2. Anonymous Coward
      Anonymous Coward

      Re: Bad software!

      This is just dumb. No one is buying a 20 year old system. Calling a 3PAR "legacy" is like saying a 2017 Mercedes is 70 years old. Literally zero logic to the statement. What do you think HPE has been doing since buying 3PAR? Nothing? And people are buying it why? Nostalgia? Give us all a break.

      1. Anonymous Coward
        Anonymous Coward

        Re: Bad software!

        Legacy you say ?

        Coming Clean: The Lies That Flash Storage Companies Tell

        https://vimeo.com/159280013

  24. sujayv

    Disaster Recovery

    I cannot believe such a critical piece of infrastructure did not have an active-active DR attached to it? Why does the Government throw money at incompetent multi nationals when there are many competent nationals around. When it comes to critical infrastructure, money should not be the deciding factor.

  25. Anonymous Coward
    Anonymous Coward

    Singular Drive eh?!

    It does say drive (not array, Raidset otherwise) thats enough to promote havoc and pulses racing (well for me anyway) !

  26. Anonymous Coward
    Anonymous Coward

    Wasn't there an article a couple of days about about the Kings College London 3PAR implosion that also said "device fail mumble firmware mumble triggered cascade failure mumble" or something like it.

    I'll be interested in the PwC report because we also use 3PAR, also with multiple tiers including SSDs. How much of a timebomb am I sitting on here?!

  27. Anonymous Coward
    Anonymous Coward

    Replication and mirroring is NOT a full data protection strategy - the purpose is for "quick and dirty"

    (ie during the day) recovery - It's not a protected copy

    With mirror and replicate:

    Viruses can corrupt both copies

    Ransom-ware can corrupt both copies

    Programming errors can corrupt both copies

    Firmware or controller upgrades (H/W and S/W) can corrupt both copies

    Accidental deletion or DB admin hiccup can corrupt both copies

    Disgruntled/incompetent IT staff (surely not in Australian Public Service!!) can also corrupt data

    Since 1950 the Data Protection industry has had an extremely robust methodology:

    3-2-1

    3 copies of Data (if you don't have 3 copies, it's not worth keeping)

    2 different mediums (ie do NOT rely on just disk - example disk and tape) - cloud is mostly disk that on someone else's premises (with a fancy name that gets CIO's excited!)

    1 copy offline and offsite (None of the list above can attack or corrupt offline, offsite data)

    Once the data is gone - it's gone - accept that all technology will have errors, just as humans and machines will make mistakes.

    For example here are only TWO (maybe 3 if you're pedantic) spinning disk manufacturers left - what happens if there is a firmware bug or security hole (and there have been!) and half the spinning drives in the world start corrupting data.? The chips in the flash and SSD drives in all manufacturers equipment come from very few Fabrication facilities (regardless of who's name is on the tin)

    Data corruption or loss is an inevitability - Plan for it - 3-2-1 !!!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019