Some high-end NetApp FAS6000 arrays are suffering failures that cause them to halt and restart. NetApp is fixing the problem. NetApp Flash Cache NetApp Flash Cache PCIe card El Reg understands that several FAS6000 customers in Europe have discovered that their arrays stop working while under heavy load and abruptly restart. …
"....Suspicion has fallen on this chip...."
So, if this is a problem where one chip on one flash card can take down the whole array, regardless of the number of flash cards in use, that implies there is a single point of failure in any FAS6000s using the flash cache cards. Major whoops! If it restarts, do you lose write data in the cache? It would be interesting to know if the chance of the problem increases linearly with the more cards you have, which would seem reasonable if it's a random number of dodgy chips, or whether the problem is software/firmware related and the chances of it happening are much greater the more flash cards you have in operation at once.
What are the legal implications if you have a FAS on support but don't agree to the confidentiality clause, can NetApp really refuse to supply a fix? I know Sun tried this with us in the past and we told them to take a hike, any brave NetApp users out there not toeing the line?
Re: "....Suspicion has fallen on this chip...."
Since the chip is an integral part of the FlashCache card, and FlashCache cards are PCIe cards directly connected to the storage controller's PCIe bus, I would assume that a failure would cause controller failure (and failover to it's peer). That's assuming that NetApp's hardware doesn't have an ability to isolate and power down a PCI slot non-disruptively, something I'm pretty sure they can't do.
The massive slowdown from a failure of the FPGA is most likely caused by the controller failover, doubling the load on the surviving controller and requiring a re-warming of the FlashCache.
Assuming all this is true, the more FlashCache cards you have the higher the likelihood of encountering the problem. But I'm speculating.
As a co-worker pointed out "solid-state chips that form a read-write cache" is not correct. Flash Cache is read only, it has no write cache capability.
Re: Incorrect info
Um - Even if a Flash Cache is used as a "read-only" device, how do the blocks get into the cache to be read only.....
Seems that there has to be a write taking place in Flash somewhere at sometime....
Re: Incorrect info
In the case of NetApp Flash Cache, the data is "copied" into Flash from disk (or possibly from DRAM cache in some cases I believe). The "read-only" moniker applies to FlashCache because 1.) accelerates reads, but not writes and 2.) since FlashCache does not directly receive incoming host writes, it never contains dirty pages. The second point is why the card failure does not cause data loss. Incoming writes and Dirty pages are handled in NetApp's DRAM cache and NVRAM.
There is nothing perfect in this world.
I think it's fair to say, there is no such thing as the perfect storage array in this world, or any product for that matter.
To this day, I've not yet seen a product which is free from bugs and faults, however, thankfully, most of these faults are isolated to very few environments.
Now, find me an array which has not had a fault at some time.
Given the amount of NetApp FAS arrays deployed, this may be isolated to only a few select cases and that only a few will be affected and will most likely be a possible batch fault with the Pam card in high densities (read hotter chassis), possible with issues where the wear leveling of the card may be suffering due to high change rates.
Hopefully they'll have it sorted soon.
As for the gag orders, well, that's a bit off, but they all do it.
Not only a NetApp problem
This is a chipset issue affecting many I/O-bound Intel systems (i.e. storage arrays). It is not restricted to NetApp arrays. EMC's various flavors, HP's, IBM's and others all suffer from this underlying issue. NetApp's high concentration of Unified Systems simply makes the issue more visible.
Re: Not only a NetApp problem
EMC does not use PCIe based flash technologies for extended cache, that is provided by SSD/EFD drives. That being the case, their arrays would not be susceptible to this specific issue caused by a faulty chip on the PCIe card found in the NetApp filers.