back to article Storage array firmware bug caused Salesforce data loss

Salesforce.com has revealed that a bug in the firmware of its storage arrays was behind last week's data loss incident. The mess started in the company's Washington data centre on May 9th, when admins noticed “a circuit breaker responsible for controlling power into the data center had failed.” “The team engaged the circuit …

  1. Anonymous Coward
    Anonymous Coward

    Bugs, eh?

    Formal methods,son! FORMAL METHODS. Allied to CORRECT ARCHITECTURE.

    Sadly, even in 2016, most developers would try to keep as far as possible from any modal first-order logic formula.

    I recently came across this lectures, btw. Pretty fun:

    SPIN 2016 - On Verification Challenges at the Large Hadron Collider - Tim Willemse.

    mCRL2 (an example of Process Algebra) is used to check the LHC Control system. IT seems to slowly emergy from the hacker cave into the light of engineering. Sadly, I missed most of this.

    1. Anonymous Coward
      Anonymous Coward

      Re: Bugs, eh?

      > FORMAL METHODS. Allied to CORRECT ARCHITECTURE

      Or at least use provably correct algorithms, like the Raft protocol for replication of state.

      It's amazing how many vendors stick two head ends on a system and say "this is redundant", on the basis that if A fails then B is there to take over. But every possible scenario of node A and node B running slow or going offline and coming back online *will* eventually happen; and usually the result is split-brain data corruption.

      Here they were replicating a block store. Fine, except a filesystem is a data structure which extends over multiple blocks; if you have replicated some blocks but not others at the time you cut over, your data structure is toast. And even if the filesystem itself is in a consistent state, the individual files which your application reads and writes may be in an inconsistent state which is useless if the application were to restart at this point (think: VM image files, database files)

      So you need all your applications *and* your OS to generate consistent checkpoints, and replicate only at a checkpoint state. Alternatively, snapshot the entire running system state including RAM (which contains the application state and VFS block cache) and replicate that, which needs to integrate VM and storage layers.

      This is much harder than people think; and of course if you get it wrong everything appears to be working tickety-boo, until the day it fails and you really needed it to be right.

      If Salesforce are using Oracle under the hood, I'd expect them to switch from block replication to RMAN replication. Each database transaction is either replicated, or it isn't.

      1. Down not across

        Re: Bugs, eh?

        If Salesforce are using Oracle under the hood, I'd expect them to switch from block replication to RMAN replication. Each database transaction is either replicated, or it isn't.

        That does not make sense. Recovery Manager (RMAN) is used for backups. Perhaps you meant DataGuard where you have the option of physical or logical standby. In any case the article didn't clarify whether the blocks were disk blocks or oracle (or other database) data blocks. If they were disk blocks, then that does open up a window for possible corruption. At least with transactionsl (from database perspective) you should in theory only lose uncommitted transactions.

        You could also use third party (well, was until Oracle bought them) product like GoldenGate for transactional replication if you feel so inclined

        Disclaimer: Freudian or not, I initially typed GoldenFate ..make of that what you will

    2. TheVogon

      Re: Bugs, eh?

      "Circuit breakers broke bad"

      Is that meant to read "badly" or is this written in American?

      1. Solmyr ibn Wali Barad

        Re: Circuit breakers broke bad

        Maybe they started to run a meth lab besides their daily job?

      2. Simon Sharwood, Reg APAC Editor (Written by Reg staff)

        Re: Re: Bugs, eh?

        It's a little homage to the TV show Breaking Bad. We like to slip in the occasional pop culture reference to show we're down with the kids

  2. Ken Moorhouse Silver badge

    Predicting failure modes

    ...is incredibly difficult to achieve. There will always be a combination of events that can occur that will screw up the calculations. That one in a million possibility is the chink in the armour which then causes an unconstrained domino effect because it was down at the bottom-end of the risk assessment.

    I used to work for an organisation that had two of everything to ensure failure of one system meant that the other could be brought on-line in case of failure, thus minimising downtime. One day the switching mechanism between the two systems failed.

    At the end of the day you have to balance the cost of call-out teams to cope with one-in-a-million failures against day-to-day operations. Most companies would prefer the asterisk disclaimer at the end of their Up-Time Promise.

    1. Doctor Syntax Silver badge

      Re: Predicting failure modes

      "Most companies would prefer the asterisk disclaimer at the end of their Up-Time Promise."

      If you run your own services it's your data and ultimately your business at risk and you can decide what it's worth paying to protect it. If you decide to put the services on someone else's computer then from that someone else's point of view it's not their data and only the penalties in the SLA are at risk.

  3. The Islander

    What are the chances?

    In my experience, I have observed 2 cases where an external event caused a storage array to stop processing. In the mid nineties, one related to failure of the power when the banking organisation used 3 phase rather than 1 phase and eventually the array went TITSUP - but gracefully which maintained integrity of data.

    In a second case a few years later again in a banking organisation, an overly zealous security team targeted the processors in the array for vulnerability scanning. The array respectfully declined the attempts to poke it and again shut down while maintaining integrity.

    Each occurred using a different tier 1 vendor of the day. Equally anxious reaction from client bank as it was several hours before the data could be validated. Interestingly in both cases, the pressure to "just move to D/R" was intense but it would have achieved nothing except possibly caused a catastrophe. In each case, the vendor was able to diagnose the problem and enable return to normal service within a few hours.

    Picking up on the earlier Commentard, the overhead of formally analysing increasingly extended supply chains for weakness - indeed even for the much vaunted availability predictions of service organisations - is too great for many client organisations. They turn to the domino effect or the "one in a million" chance as their rationale for TD;JD (too difficult; just disregard)

  4. Anonymous Coward
    Anonymous Coward

    Cloud or local mirror

    A lot of execs seems to have forgotten that consolidation of data and services to a massive datacenter also means consolidating the risks.

    In this case they clearly didn't test or rehearse failover procedures properly since it took them hours to realize that not all data had been replicated to the other site as wel as identifying and fixing the firmware bug. They were doomed when they failed over since data was inconsistent and they had a classic split brain scenario in terms of data consistency.

    So to summarize the situation:

    - flawed design

    - bad test procedures

    - bad engineering

    - bad maintenance

  5. Anonymous Coward
    Anonymous Coward

    Failing over to DR

    Once upon a time one of our datacentres had some power problems. Up top someone decided that we should go into DR. At the time I was looking after a cluster of Sun boxes that used Veritas Cluster Server and Veritas Volume Replicator (amongst other things) to not only keep local clusters running but also their bretheren in another datacentre.

    I pointed out to the top IT management bod that failing over to DR was fine... but you have to be sure because failing back again was pretty nasty as you had to effectively reverse the volume replication and getting that wrong would effectively mess up the databases in both sites. So because of that we usually only did such work (and DR testing etc) at the weekend. This was on a Friday, so I pointed out we'd be running in reverse for at least a week.

    So management insisted we fail over, so we did and it worked fine. Which was a relief as it was the first time I'd had to do such a failover myself and my boss was on holiday. My problems then began when no other application failed over correctly and so we'd have to fail back. I then got to spend a rather fraught weekend failing everything back and hoping I hadn't messed up reversing the volume replication. Once again it all turned out OK in the end, but I was crossing my fingers quite a bit!

  6. Anonymous Coward
    Anonymous Coward

    Fragile database structure?

    Sounds like Oracle database structures on-disk are pretty fragile, if the recovery tools can't fix things even when provided with the a redo log.

  7. Anonymous Coward
    Anonymous Coward

    Only java skills needed for "architects"

    I went through two rounds of being recruited by Salesforce before I gave up on them.

    In both cases (and in numerous other cases I have heard about secondhand), the main skill they look for in their "architects" is java development and presentation skills. So it appears that SFDC are another organization that uses the title Architect to mean senior senior java developer (in common with a number of large clients I have worked with).

    Of course this leads to situations like the one described when everyone involved in the design is a code jockey and nobody has a hardware or network background.

  8. Nate Amsden

    zfs to blame?

    My experiences with zfs and corruption were very frustrating. This was on OpenSolaris. I was expecting a simple tool to just clean the corrupt blocks and mount the FS like any other fsck. But no it went to immediate kernel panic and a reboot loop.

    With Oracle i would expect it to roll back trx that were bad.

    I was at another company a long time ago where we had oracle corruption caused by both controllers in SAN failing at the same time.(the SAN admin blamed himself for misconfiguring the controllers to allow that to happen. To this day I can't imagine why a system would allow you to configure it in such a way) About 20hrs to recover from what I recall. Though we were still seeing the occasional ORA-600 or something error that indicated a corrupt part of the DB more than a year later. They had no really good backups either. They got budget for real standby servers after the incident though.

    The tools to deal with zfs corruption were very immature at the time(maybe 4 yrs ago). It seemed to be generally regarded as voodoo to repair zfs corruption. There were things i tried at the time i don't recall what it was a while ago. Nothing helped all data lost.

    Fortunately the data was a collection of backups so nothing really was lost other than downtime from the panics and tracing down the cause of the panics.

    The system was a nexenta HA cluster that went split brain. Both nodes tried to write to the same volume and bam it imploded. Nexenta support immediately said we were fucked and to restore from backup. They refused to offer advice on the zfs tools used to try to repair the system.

    After 2 or 3 incidents of this we disabled HA until we could find a replacement solution.

  9. STZ

    Firmware bug - or cost pressure ?

    Question remains why moving a processing instance to a backup data center also required to move the affected data. That data should have been already there, and there are data replication products from various vendors serving exactly that purpose. This would have prevented the database overload.

    However, such a strategy comes at higher cost as you need to allocate additional resources in preparation for failure. Optimizing resource utilization and the SLA fine print is cheaper ...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like