got to be careful with high automation
trying to get the most efficient, and automated system possible, there will always be gaps, and sometimes big ones that people don't expect.
If you want a "real" backup then the backup cannot be connected to the primary (e.g. real time replication or clustering). A tightly integrated backup protects against many failure scenarios but obviously cannot protect from all.
I endured a similar event on a 3PAR system close to 7 years ago now, I learned a lot during the process. The support(at the time) was outstanding (since HP took over it has been closer to adequate than outstanding), and made me a more loyal customer as a result. That case at the time 3PAR determined was a one off as well(at least at the time). The backups I had at the company were limited to small scope tape backups due to limited budget. Fortunately I was able to pull some miracles out of my ass and bring everything back online in a few days (storage array itself was back online in a few hours). After all of that the company axed the disaster recovery budget I worked on for a month in order to give the funds to another project that they massively under budgeted for. I left a couple of weeks after that.
I was part of another full array failure data loss event more than a decade ago on an EMC system, that was an interesting experience as well, I wasn't responsible for that system at the time(I supported the front end apps). Maybe 35 hours of downtime, and we were recovering from the occasional corrupted data thing in Oracle for the next year or two that I was at the company.
The key is of course to realize no system is invincible. There are bugs, there are edge cases, and in highly complex environments those can be nasty. It's certainly very unfortunate that this customer got hit by one of those, but it wasn't the first, and it won't be the last.
The biggest outages I have been a part of have been application-stack related.
Some of the more recent management I work with freak out when shit is down for an hour or two, oh my they have no idea how bad things can get.
This kind of thing has also kept me more in HP/3PAR's court(customer now for almost 11 years), because if this kind of thing can happen to a storage system that is roughly 10 years old then I can only imagine the issues that can happen with the startups. These big 3PAR boxes get a lot more testing and more deployments etc.
But it's also probably indication that HP won't ditch Hitachi for the ultra high end just yet(where they have 100% guarantees).
In general perhaps I am lucky, or maybe just lazy that I don't encounter more issues because I tend to not leverage much of the functionality of the systems I use. Take 3PAR for example some people are surprised that I haven't used the majority of the software available for the system(e.g. never used replication). Part of that is budget, part of that is I know there are more bugs in the more complex things(on any platform).
Same with VMware, I file on average 1 ticket with HP/VMware support per year over the past 4 years, currently running almost 1,000 VMs. Runs smooth as hell, very few issues, and again much of the more advanced stuff (even though we use enterprise+) goes unused (but we do use distributed virtual switches and host profiles that are in ent+). I have seen lots of complaints over the years about vmware bugs that I honestly have never seen, I guess because I just don't have a need for those features. The only crashes I have gotten have been because of hardware failures (maybe 6 in the past 5 years, and none in the 6 years before that at least while I was at the companies). And no - no plans for vsphere 6 anytime soon.
Same goes for my ethernet switches, the feature set I need on those hasn't changed in a decade. List goes on...
at the end of the day you have to realize what you are protecting against. Right now I am trying to get a tape system approved (with LTFS over NFS) for offline backups. What I am protecting against there is someone breaking into our systems and deleting our data AND our backups. Having offline tape(stored off site) is a good tried and true method of protecting data. I don't expect to use it ever, we use HP StoreOnce for backups & off site backups, but still someone could delete data from those just as they could delete data from an API-based cloud system.
Co-ordinating someone to return all of our tapes and delete them is a far bigger task.
Dealing with tape directly isn't fun, I am hoping that LTFS over NFS will make it pretty easy since all of our backups write to NFS as is(on StoreOnce), so adapting them to LTFS should not be difficult. Certainly am aiming to avoid working directly with fancy tape backup software at least.
It would be really cool if StoreOnce could automatically integrate with tape, so I could write to NFS to storeonce and then have it write it to tape on the backend. It would remove some steps I will otherwise have to do myself. I know there is 3PAR->Tape automation but that is too low level and relies on use cases that don't cover what I do for the most part.