back to article Revealed: How Microsoft DNS went titsup globally on Xbox One launch day

Microsoft's major outage last week was caused by a policy rollout that derailed its own DNS servers – a blunder that also downed some of the tech giant's internal services. The outage hit on Thursday, during which key websites such as Xbox.com and Outlook.com were knocked over, connectivity to the Office 365 online software …

COMMENTS

This topic is closed for new posts.
  1. Sil

    Learning from mistakes

    Let's hope Microsoft learns quickly from its mistakes as many people depends on its cloud services one way or another.

    I would bet that more than a few of us got burned by a certificate a second too old or a small dns mistake. There's got to be a better way, more resilient and fault tolerant and easier to simulate and to test. Research in this field wouldn't be a waste of money.

    1. Anonymous Coward
      Trollface

      Re: Learning from mistakes

      Nope MS haven't learnt from their mistakes. They release alpha software and charge a fortune for the privilege of you testing it.

      1. Anonymous Coward
        Anonymous Coward

        Re: Learning from mistakes

        They release alpha software and charge a fortune for the privilege of you testing it.

        What is impressive is that they get away with it - many people have already forgotten about Vista..

        1. MrDamage Silver badge

          Re: Learning from mistakes

          Not to mention those who bought Windows Fister had completely forgotten about Windows ME, and Win95a, Windows 3.0, DOS 5.0.

          Microsoft's cloud services really do seem to be just vapourware at the moment given their current reliability.

          With more and more "services" being moved to the cloud instead of letting people run local copies of software they purchased, it wont be long before they start getting their arse handed to them for breach of various Consumer Laws.

          Dont start on about licence agreements. If it isnt presented prior to the point of sale, its null and void under Australian Consumer Law.

          1. AppleGuyTom
            Happy

            Re: Learning from mistakes

            windows "Fister"

            HaHaHah! : )

      2. sabroni Silver badge
        Thumb Up

        Re: They release alpha software and charge a fortune for the privilege of you testing it.

        Yeah! They should do a Google and release all their alpha software free. It might not work properly but at least you don't end up out of pocket as well!

        1. Anonymous Coward
          Anonymous Coward

          Re: They release alpha software and charge a fortune for the privilege of you testing it.

          "Yeah! They should do a Google and release all their alpha software free."

          They already do - just Bing 'CTP'

          1. Anonymous Coward
            Anonymous Coward

            Re: They release alpha software and charge a fortune for the privilege of you testing it.

            Bing ?

            WTF is Bing ?

            1. b166er

              Re: They release alpha software and charge a fortune for the privilege of you testing it.

              Google it

      3. Tom 13

        @ rm -rf /

        I beg to differ. They have learned, but not the lesson we techies would like. They've learned the same lesson the spammers have: there are enough foolish people out there that you can get away with it.

    2. Voland's right hand Silver badge

      Re: Learning from mistakes

      No they have not.

      DNS is one of those infrastructural technologies that should not have dependencies on other stuff because other stuff (like Active Directory) actually depends on it for its discovery. Having active directory control on an SME DNS server is an acceptable risk.

      Having active directory to do anything with the production systems of a multi-billion corporation providing online services is a stark raving lunacy.

      Disclaimer - in one of my past lives I used to manage DNS deployments bigger than MS so I am probably a bit biased here.

      1. Matt 21

        Re: Learning from mistakes

        I would tend to agree with Voland (right or left hand), DNS should be run separately and independently I'd even consider splitting it out amongst major functions or customers.

        Doing this has meant zero problems in my experience.

        Mind you I'd also have a change freeze around "hot spots" like Xbox One go lives.......

      2. Anonymous Coward
        Anonymous Coward

        Re: Learning from mistakes

        "Having active directory to do anything with the production systems of a multi-billion corporation providing online services is a stark raving lunacy."

        That's exactly what 95%+ of the FTSE 500 do....perhaps it is just you that is raving?

    3. Anonymous Coward
      Anonymous Coward

      Re: Learning from mistakes

      "Microsoft made DNS changes a single point of failure – and this needs to be dealt with."

      That's the design of DNS - single master - not much Microsoft can do about that....

      1. Mike Pellatt

        Re: Learning from mistakes

        "That's the design of DNS - single master - not much Microsoft can do about that...."

        Errr, no. How the database is setup is entirely separate from the client view of the domain. Single-master is the original design of the most-commonly used DNS server - ISC BIND - which is where you may have got that idea.

        To give them a little credit, the MS DNS server gets its data from AD, so the data is mirrored at each AD/DNS server. However, its tight integration with AD makes the validity of the AD structure a single point of failure, and reduces the speed at which rollback can take place. As shown by this outage, it would seem. It's also a bit of a security risk to run DNS and AD on the same server, but I presume MS have separate caching servers facing the outside world to manage that risk (Ahem....).

        Powerdns can use a range of backend databases, and using, for instance, PostgreSQL, can have a multi-master backend, with sub-minute rollback if set-up right.

  2. Fatman
    FAIL

    The "Cloud"

    From the article:

    The outage hit on Thursday, and knocked over key websites such as Xbox.com and Outlook.com off, axed connectivity to the Office 365 online software suite, and cut off multiple Azure cloud services from the outside world.

    I wonder how many corporate bean counters will have to explain why their employer's staff are just sitting there twiddling their thumbs while Mickeysoft gets its shit together.

    "We can save $$$ by going to Office 365!!!" was probably the rationale for jumping into that cesspool.

    And so the lemmings jumped.

    Why do I feel no compassion for bean counters who don't fully appreciate the stupidity of their decisions?

    Someone find me a cattleprod, I have some PHB's that need educating!!!!!!

    1. John Riddoch
      Thumb Down

      Re: The "Cloud"

      Because, of course, internal systems never, ever fail, do they? Provided your cloud solution is as reliable as an internal system, there's no issues. Just because all companies now have their outages at the same time, it becomes public knowledge and more widespread.

      1. Pascal Monett Silver badge
        Flame

        Re: Because, of course, internal systems never, ever fail

        That sort of stupid response needs to stop right now.

        An internal system can fail just as well as any other, obviously. The key difference is that and internal system failure affects only you and not hundreds of thousands, if not millions, of other people/businesses.

        On top of that, when the failure is internal, you have monkeys to whip until it gets back in shape. When you're on the cloud, you only have a telephone number, which may or may not respond, and you still have to wait until THEY corrrect the issue - which may take days - after which you are left contemplating the pieces and wondering about restores (as Azure users have already learnt, much to their chagrin).

        In an internal failure, if you don't have proper backups (that have been tested) or don't know how to rollback cleanly, you only have yourself to blame.

        So please cut it out with this internal failure nonsense. It has absolutely no bearing and is totally not comparable in any way to the nature of the issue.

        Thank you.

        1. Anonymous Coward
          Anonymous Coward

          Re: Because, of course, internal systems never, ever fail

          @Pascal Monet - If 50% of your company is out because of an internal failure, it's exactly the same as 50% of your company and a bazillion others being out because of an external cloud provider's failure. The advantage you do have is that the people at the cloud company will tend to be more skilled in fixing the problem as they can afford better engineers. You can also still beat up an external supplier to make things work - I've been doing it for years with network providers who are external to the companies I work for.

          1. A K Stiles

            Re: Because, of course, internal systems never, ever fail

            The other reason to do things internally and not rely on the cloudy services is that these sorts of changes can and probably should be timed to roll-out when any unforeseen issues will cause the least issue to your organisation. Persuade Microsoft to co-ordinate with all their clients so the timings aren't potentially painful?

          2. Anonymous Coward
            Anonymous Coward

            Re: Because, of course, internal systems never, ever fail

            "The advantage you do have is that the people at the cloud company will tend to be more skilled in fixing the problem as they can afford better engineers."

            That would be the more expensive, highly skilled engineers who borked the system in this case then?

  3. oldtaku Silver badge
    Facepalm

    Looks like they got bit by their own byzantine circus of GPO settings which interact in (often) mysterious ways and make life hellish on the Windows servers.

    I'm sure they even tested before pushing, but it only takes one obscure setting on the production servers, or something that only interacts with real requests coming from the real load balancers.

    Now that's eating your own dogfood.

    1. Anonymous Coward
      Anonymous Coward

      "GPO settings which interact in (often) mysterious ways and make life hellish on the Windows servers."

      They are not 'mysterious' if you are competent - and are far less hellish then managing the millions of text config files on a Linux estate....

  4. raving angry loony

    Ouch.

    I'm guessing that they'll probably blame the sysadmins. Again. Rather than blaming poorly designed software, or cheap-arse management, or any of the other dozens of real causes. Because sysadmins are easy targets. Just ask El Reg, they've done so recently themselves.

  5. Peter Brooks 1

    Maybe they should use LDAP instead...

    Would you get on an aeroplane if you knew it's fly-by-wire software was written by Microsoft?

    But then, would you prefer a company that makes money from selling licences or a company that makes money from selling things people want to buy?

    1. Trixr

      Re: Maybe they should use LDAP instead...

      What does DNS - name resolution for internetworked hosts, if you've forgotten - have to do with LDAP?

      If you are using LDAP for authorisation, authentication or directory services and If your name resolution isn't working, you're not going to reach your LDAP hosts either with DNS down. Unless you've hard coded the IPs.

      1. Anonymous Coward
        Anonymous Coward

        Re: Maybe they should use LDAP instead...

        Some knucklehead decided that a database is the place to store DNS records because BIND configuration files are just oh-so-complicated-and-tricky....but I guess its better than storing DNS data in the registry...

        I used to manage a VitalQIP farm of 14 or so dns servers, and while the data in QIP was stored in a database, the dns servers used machine-generated industry-standard BIND config files, so that in the event of the inevitable OHSHIT moment, you could manually log into the dns servers and fix things.

      2. Anonymous Coward
        Anonymous Coward

        Re: Maybe they should use LDAP instead...

        As someone has pointed out, MS uses LDAP (AD) to store DNS records.

        Now they also use DNS as the definitive way to "find" AD so you can get a chicken and egg problem whereby because DNS is buggered, your DNS servers can't find AD to load their zone files.

        Most people wont have this problem in general because they will have their AD DCs doing the DNS server role and hence will be able to at least find their zones locally.

        In a really big deployment you really want to avoid this sort of thing ...

        1. Anonymous Coward
          Anonymous Coward

          Re: Maybe they should use LDAP instead...

          "so you can get a chicken and egg problem whereby because DNS is buggered, your DNS servers can't find AD to load their zone files."

          Liar - DNS can ONLY run on a Domain Controller when in AD integrated mode, so it doesn't have to 'find AD'...

          .

          1. Gorbachov

            Re: Maybe they should use LDAP instead...

            But the AD instance running on the DNS server (not a master AD in a big setup) needs to be able to reach other AD servers. So, theoretically, if you push a borked DNS entry to the DNS servers you might lose connectivity and thus functionality. It's been a while since I had to deal with the AD monster but I do remember DNS being a pain to set up properly.

            The fact that the best AD engineers MS had, on a critical system, took 80 min to solve a configuration issue tells us something. If it was a big hardware event, fine, but this?

            OTOH kudos for letting us know what the problem was.

    2. Anonymous Coward
      Anonymous Coward

      Re: Maybe they should use LDAP instead...

      "Would you get on an aeroplane if you knew it's fly-by-wire software was written by Microsoft?"

      Yes - probably a lower risk than anything Open Source....

  6. Anonymous Coward
    Anonymous Coward

    Translation from MS BOFH speak to English...

    "At first, we're told, engineers tried to revert the Group Policy Object change, and started a forced refresh of group policy across DNS server infrastructure. "

    Translation: They rebooted.

    "No improvement was observed, and so at 11pm UTC they rebalanced their DNS server infrastructure."

    Translation: They rebooted more stuff.

    "This helped, and at 11.15pm they executed a script to reboot the balancing of DNS servers."

    Translation: They rebooted the rest of their DNS servers.

    "As this propagated, things got better."

    Translation: The servers finished rebooting.

    1. Snapper
      Happy

      Re: Translation from MS BOFH speak to English...

      Call me old-fashioned, but isn't that..........

      Switching it off and switching it back on again?

    2. Michael H.F. Wilkinson Silver badge
      Joke

      Re: Translation from MS BOFH speak to English...

      A BOFH would have issued some non-maskable interrupts to the groinal area of those responsible. Next time (and I do not doubt there will be a next time) Azure and Office 365 fail be on the look-out for heads of IT or beancounters showing signs of discomfort in said area

    3. Anonymous Coward
      Coffee/keyboard

      Re: Translation from MS BOFH speak to English...

      I think they probably manually changed the GPO back and started issuing gpupdate /force on one server at a time.

      BTW, don't use the escape key in windows, because it either does nothing or deletes everything you've typed.

    4. vonRat

      Re: Translation from MS BOFH speak to English...

      gpupdate /force

      gpupdate /force

      gpupdate /force

      ah booger it

      alt-f4...restart

  7. Anonymous Coward
    Mushroom

    I lol...

    I had, admittedly, performed a similar cockup many years ago.

    While moving our firewall policies from one GPO to another I had forgotten to disable the destination GPO first before applying the new firewall rules. And because the base policy was to block all inbound and outbound connections by default (and also because I was rather slow in setting the new firewall rules up) quite a number of systems inherited the new GPO before it was fully configured and were thus left in a state where traffic in both directions were being blocked.

    Fixing that wasn't fun as I couldn't automatically force all the affected workstations to acquire a new policy since I had successfully bricked their network connections.

    Lesson learned though is that I now set up new GPO's as very-freaking-disabled, configure them, export to test environment, test, and then enable them if all goes well in test.

    For Microsoft to have made a similar cockup though... hahaha.

    1. jjd90

      Re: I lol...

      Entrope,

      You hit the nail on the head. They pulled the proverbial ladder away.

      Things got better when they cleaned up a server at a time manually.

      At first servers were probably overloaded ( things got better ), until such time as they got enough resources back on line.

      Sorta self DOS.

      :-)

      J.

  8. Anonymous Coward
    Anonymous Coward

    Change control?

    Do Microsoft have a change control system that outlines what changes are going to be made, risk analysis and a rollback procedure in the event of a problem? Is this signed off by anybody? I would love to see this.

    1. Anonymous Coward
      Anonymous Coward

      Re: Change control?

      More to the point, why haven't they got a freeze on infrastructure updates on the same day as a major hardware launch? They know there's going to be hundreds of thousands of devices requesting updates. However technically stupid this DNS fuck up is (and it sounds pretty fucking stupid), I can't help thinking management should probably take some of the blame as well....

      1. Pascal Monett Silver badge

        Re: management should probably take some of the blame as well

        Hah.

        It'll be a cold day in Hell indeed before anything like that happens.

  9. Anonymous Coward
    Anonymous Coward

    Remind me,

    How many root servers run MS DNS and are controlled by AD?

    Oh, I remember now, none, maybe MS can learn from that.

    1. Pascal Monett Silver badge

      MS can learn from a number of things.

      Trouble is, MS never does learn. It just redefines the paradigm and calls it a feature.

  10. batfastad

    Ffffuuuuuuuuuu

    Their global infrastructure has to run (predominantly) on their software. It's a pretty big admission of defeat if it's not possible.

    But take a second and imagine that scale of global DNS and anything else, all being managed by AD/group policy. The only word that springs to my mind is... fffuuuuuuuuuuurrrrrggggghhhhhhhh!

    Still planning that Azure/Office365 migration?

  11. Catweazle666

    "Single point of failure"? Dear me.

    It seems "software engineers" have a lot to learn about the concept of 'graceful degradation'.

  12. jobst

    Why do you need Active Directory for DNS servers?

    I have absolute no clue why you would employ active directory on DNS servers - it over-complicates the service. Do they really have to have such fine granulate setups for a service like DNS? Do they really need to provide DNS lookups for some clients while they do not provide them to other customers? Wouldn't it be just efficient to have (like bind) internal and external lookups?

This topic is closed for new posts.

Other stories you might like