back to article Forget snowmageddon, it's dropageddon in Azure SQL world: Microsoft accidentally deletes customer DBs

The Azure outage of January 29 claimed some unexpected victims in the form of surprise database deletions for unlucky customers. The issue afflicted a number of Azure SQL databases that utilize custom KeyVault keys for Transparent Data Encryption (TDE), according to a message sent to users seen by The Register. Some internal …

  1. Marketing Hack Silver badge
    Windows

    Holy crap, Microsoft....

    A) Shouldn't this interaction of the restoration script and custom keys have been caught in testing at some point? If you are going to support KeyVault keys, shouldn't you have tested to see what happens to those customers in a failover/DR event?

    B) So you restore the database, missing 5 minutes of data, and you rename the database file? There goes any automation your affected clients might have in place. And of course that might be 5 minutes of important DB transactions that are missing.

    C) And it seems you took your sweet time letting customers know that this was going on. Meanwhile, they are freaking out because their databases are disappearing/reappearing and their own automation and various systems monitoring is breaking down.

    1. Richocet

      Re: Holy crap, Microsoft....

      C) Presumably this took MS a little while to figure out what had happened (as it seems to me to be an obscure issue), and then wake up someone senior enough to sign off the communication to customers + maybe get a lawyer to check the wording.

      B) 5 minutes isn't much for some customers - it is a lot for others. For perspective all the banks I worked in over the years decided that 24 hours was the maximum acceptable loss of data in the case of a database being restored from backup.

      A) Yes this should have been tested. I don't know a lot about sysadmin work, but my thinking would be that an automated script that deletes databases after such a short time would be risky. Maybe flag the database for deletion and give a chance for a person to see this on a report before it gets actioned. Hundreds of these suddenly appearing on a report unexpectedly would have a good chance of triggered an intervention before the deletions.

      1. Mark 110 Silver badge

        Re: Holy crap, Microsoft....

        24 hours!!!! For A BANK!!!!!!

        Grief. Its not brain surgery to put 15 mins in place and trivial to reduce that to 30secs if you have the bandwidth / processor / storage to backup the T-logs.

        24 HOURS!!!! SERIOUSLY!!!

        1. Crypto Monad

          Re: Holy crap, Microsoft....

          "Transaction logs" is the key point here.

          It doesn't matter if the backup was from 5 minutes ago or 24 hours ago - as long as you have all the transaction logs for the intervening period.

          Hence does this mean MS do not write transaction logs for their cloud SQL service? Or they were discarded along with the affected databases?

          It would be wise to keep transaction logs for a bit longer, methinks.

    2. Anonymous Coward
      Anonymous Coward

      Re: Holy crap, Microsoft....

      "A) Shouldn't this interaction of the restoration script and custom keys have been caught in testing at some point? If you are going to support KeyVault keys, shouldn't you have tested to see what happens to those customers in a failover/DR event?"

      You expect M$ to test?

      1. Nick Sticks

        Re: Holy crap, Microsoft....

        "You expect M$ to test?"

        This was the test.......

    3. error 13

      Re: Holy crap, Microsoft....

      B) you think would have been less bad if they restored a 5 minute old dB with the original name and let systems connect to it automatically and carry on processing - without knowing that the data is FUBAR?

  2. Nate Amsden Silver badge

    don't understand

    How is a DNS issue related to Century link (a telecom provider, and I guess colo too) ? Probably will never find out

    (not a customer of either, just confused what kind of DNS setup MS would have that would have their internal services reliant upon an external DNS provider).

    If my external DNS went down(Dynect) completely or got corrupted or whatever,the worst thing that happens is users can't resolve the names or resolve to the wrong place and end up not being able to use the services. Internal DNS has dedicated zones(even duplicates of a dozen or more external zones to override external IPs in some cases with internal), so nothing would be affected internally. Certainly wouldn't cascade database failures or data loss or anything remotely like that.

    1. Anonymous Coward
      Anonymous Coward

      Re: don't understand

      "(not a customer of either, just confused what kind of DNS setup MS would have that would have their internal services reliant upon an external DNS provider)."

      Considering they sell themselves as a cloud provider, that would tend to mean they've got oodles both of servers and internet connectivity.

      You would have thought they could spare a few servers to use for DNS.

    2. diodesign (Written by Reg staff) Silver badge

      Re: CenturyLink

      "How is a DNS issue related to Century link (a telecom provider, and I guess colo too) ? Probably will never find out"

      Microsoft uses CenturyLink as an internal DNS provider. It went down. From what I can gather, that meant internal systems relying on CL DNS couldn't work, which brought down various services.

      How that triggered a script that deleted DBs is still beyond me.

      C.

    3. phuzz Silver badge

      Re: don't understand

      It's simple, 90% of the time when you have a problem in a Windows network, it's a DNS issue.

      1. Hans 1 Silver badge
        Windows

        Re: don't understand

        It's simple, 90% of the time when you have a problem in a Windows network, it's a DNS issue.

        No, that is what the Windows Cleaner and Surface brigade think the problem is because IE tells them so ...

        1. Danny 14 Silver badge

          Re: don't understand

          even MS dont rely on their own DNS product. yokes.

  3. Anonymous Coward
    Anonymous Coward

    The fabulous cloud strikes again

    Yes, doing it in-house carries risks.

    But at least you don't have a whole litany of unknown (and seemingly un-necessary) interdependencies rearing their ugly head.

    1. ma1010 Silver badge
      Alert

      Re: The fabulous cloud strikes again

      THIS^^^^^^^^^^^^^^^

      After reading about these kinds of incidents (plus all the TITSUP incidents), I can't imagine why I would put any of my critical information in the "cloud" other than, possibly, a backup of data I had also backed up locally.

    2. Anonymous Coward
      Anonymous Coward

      Re: The fabulous cloud strikes again

      Errr dont tar AWS with this brush. Azure is around 20x less reliable and 2019 appears to be much more of the same...

    3. LDS Silver badge

      Re: The fabulous cloud strikes again

      Also, one size doesn't fit all. Probably those who decided DB should have been deleted if keys were not available for X hours may have had some reasons in some scenarios, but evidently not in every. But every customers is then bound to those designs.

    4. Pirate Dave
      Pirate

      Re: The fabulous cloud strikes again

      What makes this a whole magnitude more funny is that Microsoft apparently (according to diodesign's comment above) outsourced their _internal_ DNS to CenturyLink. So even Microsoft doesn't fully rely on Microsoft to provide Microsoft services to Microsoft systems that support Microsoft users. If MS doesn't even eat their own dog food, why should we?

      1. Danny 14 Silver badge

        Re: The fabulous cloud strikes again

        I was arguing with somen9ther admins over o365. I run our own exchange and SQL server. I was laughed at, i tried to explain that downtime is negligible as we cluster both services over VMs so in the intervening years downtime has been very low. They were arguing o365 is far more reliant - just as o365 went down primetime in a working day.

        uh huh. Ill stick to my clustered servers thanks.

  4. W.S.Gosset Bronze badge

    ACID

    Me, I'm thinking more about the horrific spaghetti that clients' DATA has turned into, if they have sequential/chronologically-dependent transactions (e.g, account balance-authorised changes such as bank withdrawals or credit sales) coming in on the as-of-5mins-ago db with 5mins missing.

  5. Herby Silver badge
    Joke

    Now where was.....

    That multi-million transfer from the Nigerian prince that was supposed to come through....

    Kinda a 'duh' for the icon...

  6. json

    Paranoia will never go out of style..

    That's why I'm hesitant to go with any cloud hosted, high-availability, high-performance database (or at least thats what the marketing blurb says).. perfectly happy configuring and clustering our own in cloud hosted instances.

  7. Nolveys Silver badge
    Windows

    Redmond is offering months of database service for free as compensation...

    That sounds like a good deal.

    1. seven of five

      Certainly. They even gave you an empty DB to start with.

    2. Ken Moorhouse Silver badge

      Re: Redmond is offering months of database service for free as compensation...

      To hijack a well-known phrase:-

      "If you've got nothing important to save, you've got nothing to lose."

    3. Anonymous Coward
      Anonymous Coward

      "That sounds like a good deal."

      Only if they're offering it in AWS...

  8. Dwarf Silver badge

    What bunch of cowboys

    .... this sort of event gives cowboys a bad name.

  9. Ken Moorhouse Silver badge

    5 minutes data lost...

    This time round.

    Next time round it could be more.

    You have been warned.

    1. Anonymous Coward
      Anonymous Coward

      Re: 5 minutes data lost...

      "5 minutes data lost...This time round.

      Next time round it could be more.

      You have been warned"

      Ah yes, but "lessons will be learned", in in line with the Company's policy of contnuous product and service de-improvement.

  10. Anonymous Coward
    Anonymous Coward

    Access has been re-established ...

    "Full access has been re-established for most of those customers already."

    Good, thanks ! And what about the data in my DB ???

  11. 2+2=5 Silver badge
    Joke

    Rapid response from Microsoft

    > the biz is at pains to explain “if TDE encrypted SQL databases lose access to the key vault because they cannot bypass the firewall, the databases are dropped within 24 hours.”

    The one time, the one time Microsoft actually does something quickly and efficiently - it's to delete your data. D'oh

    1. Anonymous Coward
      Anonymous Coward

      Re: Rapid response from Microsoft

      At least it goes to show they are not holding onto it forever and doing who knows what with all the information.

      I wonder how long it takes AWS or Google to drop the data? I would expect Google to hold onto it, use it, sell it, and re-purpose it, and reuse it and then resell it.

      1. Anonymous Coward
        Anonymous Coward

        Re: Rapid response from Microsoft

        If Trump can "keep more promises then he made", then it must also be a good thing that Microsoft can delete more data than you intended.

  12. Zippy´s Sausage Factory
    Trollface

    Good news...

    ...for Amazon web services and the minnows. (Oracle have a cloud service or something like that, right?)

  13. bwright72

    Little Bobby Tables...

    ... has graduated from school and now works at Microsfot?

    1. solv

      Re: Little Bobby Tables...

      Genius post...XKCD is like the Simpsons used to be....so suitable to quote in so many scenarios

  14. steviebuk Silver badge

    But...

    ...the cloud is never wrong

    1. Danny 14 Silver badge

      Re: But...

      Im sorry Dave, i cant let you commit that transaction.

  15. Patched Out
    Mushroom

    C.L.O.U.D.

    Customer Loses Own Use of Data

  16. phuzz Silver badge
    Unhappy

    This is terrible and all, but I have to admit that ten minutes of downtime, and restoration of data from only five minutes ago is better than any backup system I've been responsible for...

    1. TimR

      roll forward logs

      Back in the dim & distant pass when I had some contact with database administration, I seem to recall something about roll forward logs. Are these no longer a thing?

      1. error 13

        Re: roll forward logs

        or - top tip - before you drop a customer database, make damned sure you have a snapshot of it first? It's not that hard is it... single user / drop connections / snapshot / drop

        1. Deltics
          Boffin

          Re: roll forward logs

          From the information about the incident it seems this is more-or-less what they DID have, albeit the snapshot was at most 5 minutes older than the moment of droppage.

          But even if the snapshot were taken instantly before the DB was dropped, there would be an elapsed period between when the DB was not available and when the screw up was realised and the database restored from that snapshot.

          The statement that "5 minutes of data was lost" comes from El Reg and appears to naively extrapolate from the claimed oldest age of backup and ignore the question of just how long it was between the database ceasing to exist and the realisation of same and restoration of backup.

    2. Phil Endecott Silver badge

      > data from only five minutes ago is better than any backup system I've been responsible for..

      Not heard of live database replication?

      1. phuzz Silver badge
        Happy

        Replication to what? Another server? Well la di dah look at mr fancy over here with his multiple servers!

        We didn't have the budget for one database server, let alone two. It was all running on the same box as the application, and user files, and was probably a DC as well (in a cardboard box in the middle of the road, uphill through the snow both ways etc. etc.)

        1. Danny 14 Silver badge

          dont laugh. I inherited that scenario, two servers both DCs, one with IIS and SQL, the other with exchange (and KMS server, my god). First port of call was virtualise and migrate to individuals before something bad....

  17. TheBorg

    oops .... duff code written by people with no real sense of the real IT world. what happened to transaction logging and decent DR/BCP

  18. Ken Moorhouse Silver badge
    Coat

    Re: oops .... duff code written by people with no real sense of the real IT world

    I'd be careful with language like that.

    OOP'ers may wish to deny you your inheritance, abstract your limbs, encapsulate them in concrete and polymorph them into a piece of architecture.

  19. hoola

    All is well

    Like all these US based mega corporations, all that is needed is a quick "we are very sorry blah, blah" All is now sorted and everyone has a warm fuzzy feeling.

    No data is lost (from Microsoft's view) and if it is, then the customer can go back and find it.

    It happens every time, and appears to be happening with ever more frequency yet business still keep throwing more money at them.

    Sigh!

  20. braegel

    How they do a restore from encrypted data without private key?

    The question for me was, if the keys are gone from the KeyVault, how was ms able to restore the backup? the backup or transaction log are also encrypted by using a cert from the master db and the private key from the KeyVault using the encrypted decryption key saved in the db or backup file.

    1. braegel

      Re: How they do a restore from encrypted data without private key?

      It is now proofed the keys are never deleted, they where 'only' not available for 24h, because the known DNS problem. So the dB was dropped after this time by design. And the backup/restore must come from time -24,05h

  21. Martijn Otto
    Joke

    Must be a misunderstanding

    Instead of the more common Ctrl-Alt-Delete to restart a server with a BSOD, somebody accidentally hit only the Delete key, resulting in customer data deletion.

  22. Anonymous Coward
    Anonymous Coward

    Lack of a DR plan

    As usual total lack of a DR plan.

    A database with binary logs would allow you to restore the database and apply the logs (better still as others have mentioned also replicated to another locationI. Whilst this should not have happened lack of any sort of DR plan reflects badly on those affected. Even on premise you could have had a major failure such that you had to restore to last backups and recover.

    Continues to amaze me how many companies don't just sit down and think what happens if. It doesn't take much. Bandwidth is so cheap, VM's are so cheap there really no reasons not to replicate and have a plan to recover if the cloud/premise is nuked.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019