back to article You're flowing it wrong: Bad network route between Microsoft, Apple blamed for Azure, O365 MFA outage

Microsoft says last week's multi-factor authentication (MFA) partial outage, which hit its cloud-based services, was due to a dodgy network route between its servers and Apple's backend. According to a postmortem penned by the Azure team on Thursday this week, the whole thing kicked off at around 1330 UTC (0630 PDT) on Friday …

  1. Dan 55 Silver badge

    Design, we've heard of it

    Why aren't these elementary questions like "what do we do if APNs are down" being asked at design stage?

    1. Pascal Monett Silver badge

      I'm guessing because, at design stage, they're not supposed to go down, so nobody bothered to build a case for testing that.

      In truth, this cloud thingy is still pretty new and we're all learning the ropes. In a decade or two, when most of the bad situations have been encountered and resolved, we will then have a manual for proper design and rollout of a cloud infrastructure.

      Right now I think we're still feeling our way.

      1. Tilda Rice

        Pascal, i'm not used to reading fair, even handed, reasonable assessments in this comment section. You'e thrown me off a bit ;)

        They probably had 2 or 3 links that seemed resilient on paper, but real life has taught they might need 5 or 10 <shrug> Fair play to them being transparent about it.

      2. Dan 55 Silver badge

        First there was:

        Writing to disk -> What happens if I suddenly don't have permission/it's full/it disappears.

        Then there was:

        Connecting to the database -> What happens if it's down or won't let me connect?

        So really, this this not beyond the bounds of imagination:

        Connecting to an online service -> What happens if it's down or won't let me connect?

      3. Maximum Delfango
        Facepalm

        "this cloud thingy is still pretty new" - not really. I believe that networks, and the general ability to connect computers together, have have been around for some decades now. The new bit about the cloud is the poor standard of developers who bash all this stuff together.

      4. Martin M

        Not really, distributed computing has been around for a while. Deutsch et al’s eight fallacies of distributes computing were written in 1994-1997 and are all applicable to a cloud world - https://en.m.wikipedia.org/wiki/Fallacies_of_distributed_computing

        “The network is reliable” is fallacy #1. Not considering failure cases properly for a service run by a different organisation at the other end of a WAN is pretty ridiculous. Experienced designers/architects don’t trust the network between two of their own services in neighbouring racks in a physical datacentre.

        (Side note: Anyone running service in a cloud should assume they’re going to eventually see some truly weird failure conditions given the multiple levels of compute and network virtualisation stacked atop each other and run on heterogenous no-name hardware. If your application’s designed and monitored correctly, it shouldn’t matter that much.)

        I can understand why problems connecting to APNs would cause problems messaging and hence authenticating iOS users. What is harder to understand is why a backlog formed and caused further problems. Keeping a long backlog of remote API requests (or doing unbounded retries etc.) which are irrelevant after a few tens of seconds because they are feeding into an interactive system is not a desirable property...

      5. o p

        This is not cloud in general, this is Microsoft. I have never had an MFA failure with Google or AWS in the last 5 years, and it's enforced on all our accounts. And everytime I read this kind of article I know Azure and 358 are not even worth considering.

        1. Oh Matron!

          Indeed. APNS has been around longer than most push notification services and has been quite robust. Apple are PITAs whenever you ask for IP address ranges for such things, and they just tell you either 17.0.0.0/8 or use which domain it is. For good reason. A shame Mircosoft can't get it's act together...

    2. Maximum Delfango

      Re: Design, we've heard of it

      Why aren't these elementary questions like "what do we do if APNs are down" being asked at design stage?

      They don't even bother to ask ... "hey, just how would the work somewhere where they don't have the world's best fibre"?

    3. Anonymous Coward
      Anonymous Coward

      Re: Design, we've heard of it

      I think that Microsoft ignored Cloud for too long. AWS had a 7-year headstart before Microsoft realised Cloud was a direction they wanted to go. At that point they had to build a cloud platform super quickly and rushed to try to expand to keep up.

      Case in point, Availability Zones are a new Azure thing and they only cover a subset of Azure services and are only avalible in some regions. Amazon, had the luxury to design correctly from the start without competition, and all their services are built on Availability Zones.

      AWS was 7x more reliable that Azure last year. I am sure Microsoft will catch up, but they will have hicups along the way while they sort it out and fix the gaps they have.

      1. swm Bronze badge

        Re: Design, we've heard of it

        AWS had Amazon as a (sole) customer until they spun it off. This made AWS robust for the needs of their customer so when they went public most of the edge cases had already been experienced and addressed.

      2. The Dark Side Of The Mind (TDSOTM)

        Re: Design, we've heard of it

        Microsoft is pretty adept at ignoring trends. MSFT ignored the commercial Internet and spun out their own dubious and vague Microsoft Network that almost nobody used, only to rush to patch up a lot later in the game and doing a mess in the process. MSFT also ignored the clear and open standards for the web and birthed the most despised lineage of browsers ever. They also missed the start on mobile connected devices (even though their WinCE worked surprisingly well in many cases) only to shoot themselves in the foot with Windows 8/RT and enslavement of Nokia to the Evil Empire. They mocked the Open Source movement and achievements for decades only to embrace them towards the end of the second decade of this century...

        They are changing their coats those years, but the nasty habbits are still there. They have enough glamour and business prowess to attract some brilliant people from time to time, though (Sysinternals comes to mind first, but for sure there are many unsung heroes in the pot).

  2. Steve Davies 3 Silver badge

    I would imagine

    That episodes like this will make Apple reconsider using 3rd parties to host their cloud services from now on.

    These pretty regular outages are nothing for MS to be proud of.

    1. Giovani Tapini Silver badge

      Re: I would imagine

      Although this is an Apple problem that took Microsoft down. It also relates to Apple users consuming Microsoft services, not a corporate link at all.

      However, MS should still not be proud...

  3. viscount

    Not clear to me why a link to Apple being down stops MFA for other Office folks.

    1. Giovani Tapini Silver badge

      I refer you to the article itself

      A service loss at Apple caused too much buffering of traffic leading to pack loss and service degradation - more or less...

  4. OGShakes

    100% utilised redundant links

    This sounds like an outage I had, we had 2 links to our hosted 'cloud' telephony system provider, one from each building with a route between both buildings so that if one dropped all the traffic would route over the other link. What we did not realize (as the provider said they monitored all this) was that both links were running around 80% utilization, so when one dropped (thanks British gas for digging up the cable) we started dropping calls all over the place in both buildings.

    It took them 2 days to admit they had not noticed the high utilization and this was the cause, after making us go through every single switch and check the QoS was as it should be...

    1. Giles C

      Re: 100% utilised redundant links

      Basic network monitoring needed here.

      Solarwinds

      Mrtg

      Manageengine

      Any of these are not hat expensive and would alert you to potential problems like this.

      I once had a 1gb link that constantly ran around 99%, we got it upgraded to 10gb which then ran at 50%. Things were a bit quicker after woods. We knew what it was doing due to the solarwinds monitoring we were using.

      Besides management like these sort of things because they alert without needing someone sitting constantly monitoring services.

      1. Anonymous Coward
        Anonymous Coward

        Re: 100% utilised redundant links

        company I work for has a product that normalyl alerts us when O360 goes down before users notice...

        Enterprise Architecture 101. What's that? technical debt? Architectural runways? Not in this agile world, me laddo.

  5. Anonymous Coward
    Anonymous Coward

    Monitoring?

    A tiny VM or service each end and a ping every 10 minutes (following the same route!) how much would it cost?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019