back to article Microsoft reveals terrible trio of bugs that knocked out Azure, Office 362.5 multi-factor auth logins for 14 hours

Microsoft has delivered its postmortem report detailing the failures that led to unlucky folks being unable to log into its cloud services for 14 hours last week. Redmond said on Monday this week that there were three separate cock-ups that combined to cause the cascading mess that left Azure and Office 363 users unable to …

  1. Anonymous Coward
    Anonymous Coward

    This hit us bad.

    The company I work for switched from locally hosted Exchange and Notes to Office 365 specifically on claims it was more reliable and faster (coincidentally, they also could layoff about 12 sysadmins). For the most part, it works, though with much greater latency (this is a rural area). But when there is an emergency, where multiple departments need to coordinate, it's proven useless on multiple occasions. When we have to enter data regarding an emergency, almost in real time we need 24/7 up-time. We're communicating at all hours. I work 0000-0800 CDT, and need to know that my superior knows what happens before 0500, and for the past couple years, while everything has been sent through Office 362.5, the entire emergency services department I work in has taken to defaulting to Apple's iMessage for sharing secured emergency info (which also meant everyone needed an iPhone) to make sure it actually gets through. This is entirely unofficial, of course, and despite the Idiot Tax involved, it's saved lives and likely millions of dollars since I've been employed here. Sometimes "It just works" is just what we need.

  2. A.P. Veening

    Cheaper, faster, more reliable

    "claims it was more reliable and faster (coincidentally, they also could layoff about 12 sysadmins)."

    Cheaper, faster, more reliable: Pick any two out of three.

    Hint: Neither faster nor more reliable is actually guaranteed.

  3. Anonymous Coward
    Anonymous Coward

    Re: This hit us bad.

    Cheaper yes, faster no, better no.

  4. Ben1892

    Re: This hit us bad.

    You're using email for Emergency services, is your name Moss by any chance?

    https://www.youtube.com/watch?v=EzRFoO8wSVA

  5. DJO Silver badge

    Re: This hit us bad.

    Cheaper yes

    You wait until the majority of users are 100% committed and returning to in-house would be almost impossible due to "letting go" the staff with the necessary skills.

    Random CIO, Approx April 2020: "Ohh look, MS just jacked up the 365 subscriptions"

  6. Antron Argaiv Silver badge
    Alert

    Re: This hit us bad.

    Cheaper? ...for now

    Faster? (does anyone else remember diskless workstations, and before them, diskless X-terminals?)

    More reliable? From Microsoft? Surely, you jest...

    In theory, computing in the cloud should just work...multiple redundant servers, load balancing, unlimited storage and blindingly fast speed -- all those goodies. And The Internet hardly ever goes down, right?

    Thankfully, my company has not yet converted to O362.5, but the indications are that it will eventually happen -- Microsoft will force us to.

    We also have (an outstanding) in-house IT staff, but have recently been bought by a much larger corporation. As long as they don't outsource IT support, we'll probably be OK...

  7. Anonymous Coward
    Anonymous Coward

    Re: This hit us bad.

    No different to them hiking up Exchange licensing, or Windows licensing whenever they want. So what you gonna do? Stick to Pegasus Mail?

    We had one guy who did Exchange, and we have one guy who does 365, the same guy...

    Mail is one of the no-brainer things to SaaS, on-prem was a pain the arse. TB's of storage, TB more to backup...

  8. bombastic bob Silver badge
    FAIL

    DDoS'able logins - who'd a thunk it?

    Seems to me that having a login system that is _SO_ inefficient, and SO reliant on a single "provider", that a 30 second timeout on a login token is sufficient [under the right conditions] to create RACE conditions and other 'token expiration' related problems, that maybe... JUST maybe... the entire design needs to be COMPLETELY re-thought.

    All eggs: one basket. Yeah, THAT isn't a recipe for FAIL !!!

    It's COMPLETELY DDoS'able, as it only took "everyone flushing at once" (more or less) to cause the system to 'overflow' heh heh heh. Must've been REALLY fun in the basement bathrooms.

    MSDN has a somewhat 'paranoid' security model as well, one that expires a download link after about 4 hours. This means that very very large files over moderate connection speeds CAN NOT COMPLETE DOWNLOADING. When Micro-shaft's IIS servers did NOT follow the RFC's (a couple of years back), you couldn't even pick up where you left off - it was 'start from the beginning again' every time. Fortunately, they fixed that last part, eventually... [making it usable again with proper browser plugins or through-the-hoop jumping].

    NOW they're "at it again" with their "all eggs, one basket" approach to logins, and unrealistically short timeout periods on the tokens, not allowing for very busy networks, slow connections, or DDoS attacks.

    Wheeeee.

    this reminds me of a computer back in the late 70's that had an old-style 12" floppy drive connected to a serial terminal (access via serial and control chars on a shared serial line at 1200 baud). A grad student wrote an application in BASIC that allowed you to store things on it [inefficiently]. But, if the mini-computer had more than a handful of users on, when you tried to retrieve your stored files, you'd get buffer overruns and lost data. Often it was COMPLETELY unusable. I re-wrote a new version in assembly language that had proper buffering [and an actual file system on the disk]. I'd ask the drive for ONLY a track at a time (not 'flood me with all at once'), which fit nicely into the mini-computer's serial buffer, and no data was lost, even if the system was THRASHING because of too many users.

    Anyway...

  9. Anonymous Coward
    Anonymous Coward

    Re: DDoS'able logins - who'd a thunk it?

    "Seems to me that having a login system that is _SO_ inefficient,"

    I don't see any evidence it's not efficient. It must cope with tens of millions of concurrent logins.

    "SO reliant on a single "provider""

    Well that's cloud for you. Or your own for for you. Not many people use two solutions for 2FA.

    "at a 30 second timeout on a login token is sufficient [under the right conditions] to create RACE conditions"

    It was a bug, it happens.

    "All eggs: one basket. Yeah, THAT isn't a recipe for FAIL !!!"

    Sure, but email is not a BC1 application for most companies. You can spend more keeping it on prem if you need to.

    "It's COMPLETELY DDoS'able"

    As above this was a bug. And it probably isn't externally DOSable as you need to authenticate to get access to 2FA.

    "MSDN has a somewhat 'paranoid' security model as well, one that expires a download link after about 4 hours. This means that very very large files over moderate connection speeds CAN NOT COMPLETE DOWNLOADING. "

    Not true - links only expire to initiate new downloads - they dont kill downloads in progress.

  10. Picky

    Standard Operating Procedure

    They called the hell-line ..

    Microsoft would eventually solve the problem by turning the servers off and on again after applying mitigations.

  11. Mario Becroft
    FAIL

    What's in the fine print?

    How many 9's of uptime are you promised as a paying commercial MS customer? What is the recourse if they fail? This is basic due diligence you would do on any vendor. I wonder if MS etc. are perceived as "too big to fail," and the (arguable) convenience of SaaS leads organizations to move to services that simply have no assurance of service level.

  12. katrinab Silver badge
    Windows

    Re: What's in the fine print?

    I read the SLA. They promise one 9 of uptime (95%), so it should be renamed Office 347.

  13. TheVogon Silver badge

    Re: What's in the fine print?

    "How many 9's of uptime are you promised as a paying commercial MS customer?"

    In general, three 9s - for basic services and Office 365. Four 9s for certain cross availability zone clusters, etc.

    "What is the recourse if they fail?"

    Service credits.

  14. Dwarf Silver badge

    Yet more proof

    1. Code is riddled with bugs

    2. Insufficient in house testing

    3. Users are guinea pig testers

    4. Even with all that telemetry, they can’t see what’s happening

    I suppose somewhere, this will be used as a reason for yet more telemetry, rather than a review for better coding and testing methods.

    I wonder what the cost was to all the businesses affected by the outage ?

  15. LDS Silver badge
    Devil

    " Even with all that telemetry, they can’t see what’s happening"

    And do you believe they will address 2) or 3)?

    No, they will address only 4) adding even more telemetry - until latency in the telemetry cache will create race conditions in the telemetry processes that will bring down the whole system....

  16. Michael Habel Silver badge

    Re: Yet more proof

    I'm faily sure any sane Company out there, has stuck with their local copy of oriface, and are likely using Oriface362.8 as a colabertive (Out of the Office), effort. and, those who are thinking in the short term of... Well what do we need those Twelve Admins for? Will eventually find out...

  17. boltar Silver badge

    Re: Yet more proof

    5. Cloud services are not the always available panacea that various snake oil salesmen make them out to be.

    But then if a PHB actually understood technology he wouldn't be a PHB. Catch 22.

  18. N2 Silver badge

    Scaleable?

    Clearly not in its current manifestation.

  19. DJV Silver badge

    Re: Scaleable?

    Only the outages are scalable!

  20. Anonymous South African Coward Silver badge

    Reminds me of when OS/2 on a single CPU totally outperformed NT running on a quadprocessor setup.

  21. Anonymous Coward
    Anonymous Coward

    Reminds me of when OS/2 on a single CPU totally outperformed NT running on a quadprocessor setup.

    Nowadays that baton has been taken over by Linux. I cannot believe we're wasting so much processing power on, well, sh*t.

  22. Michael Habel Silver badge

    They had Cores in the 90's?! Single yeah to be sure, but I'm gonna presumbe you ment multi-socketed Procs running in SMP instead.

  23. Anonymous Coward
    Anonymous Coward

    >>Nowadays that baton has been taken over by Linux.

    You realise that Windows and Windows Server has tended to beat Linux in performance benchmarks for most things for years now?

  24. Anonymous Coward
    Anonymous Coward

    "Nowadays that baton has been taken over by Linux."

    No kidding. My desktop is the same machine I used to run WinXP on. It might run Win7, couldn't possibly run Win10, but is zipping along with Ubuntu 14.04 and latest Chromium, Thunderbird, FireFox, and LibreOffice, not to mention my collection of Windows games under Wine.

  25. HighTension

    Yes, all those Windows supercomputers in the Top500 sure are impressive!

  26. Martin hepworth

    3 root causes???

    No.. a single root cause of an overloading of the service which wasnt gracefully handled by separate cascading systems....

  27. Jay Lenovo
    Facepalm

    Re: 3 root causes???

    Dumb and Dumber to Dumberer

    (Harry and Lloyd found new employment).

    We're left tripping all over that mess, but like the sequel, not very funny.

  28. This post has been deleted by its author

  29. Anonymous Coward
    Anonymous Coward

    Re: 3 root causes???

    "Yes, all those Windows supercomputers in the Top500 sure are impressive!"

    Yes, not bad for just runing a script on a cloud:

    https://www.top500.org/site/50454

    You dont see many supercomputers runing Windows these days though - not because of any scalability limitation - it certainly scales and tends to beat say Linux with say wide band low latency performance such as Mellanox interconnects or 100GBe networking. The main reason is because it's licenced per core, and top end supercomputers can have more than 10 million cores!

  30. Anonymous Coward
    Anonymous Coward

    ha ha ha

    Thought the whole fucking point of cloud was scalability...

    Question first pops up is why didn't azure simply add more backend servers automatically thought this was the who point of that shit..

  31. Version 1.0 Silver badge

    Re: ha ha ha

    It worked, the outage scaled very well. I could go on with the old Claude Rains joke but let's face it - when you had all your data off to another company that what do you think will happen? Are they going to be concerned with maximizing their profits or yours?

    Edit: Damn Autocorrect.

  32. Anonymous Coward
    Anonymous Coward

    Reminds me of the Gerard Hoffnung "Bricklayer's Lament" monologue. Unfortunately on this recording the audience were anticipating the story and kept interrupting the flow with laughter.

  33. steelpillow Silver badge
    Happy

    "Unfortunately on this recording the audience were anticipating the story and kept interrupting the flow with laughter."

    Oh, some of us have been doing the same with Microsoft for a very long time.

  34. Dan 55 Silver badge

    In other news...

    Windows 1809 breaks Win32 apps, Windows Media Player and the iCloud client. MS has decided machines with the iCloud client installed won't receive the 1809 update for the moment, so you know what to do...

  35. Michael Habel Silver badge
    Trollface

    Re: In other news...

    I think the cure maybe worse then the desease. Beside I don't know any iTards who'd be using their JebusPhones / MaxiPads to make iMessage pay its own rent.

  36. regadpellagru
    Joke

    the gaps in telemetry ...

    In a Microsoft article, World has gone banana !

  37. steelpillow Silver badge
    Megaphone

    System engineering

    This is what happens when you don't do your system engineering properly before rollout. Every part of this multifaceted crap was foreseeable, testable and hence avoidable.

  38. Vulture@C64

    I used to be a WIndows Server advocate, it's on the whole been very stable (and easy to manage) even back to NT351, NT4, 2008R2, 2012 and now 2016 etc but MS have ruined it now - telemetry, update process, the memory is requires has increased despite MS saying it's decreased, the CPU resource it takes has also increased.

    Whilst Centos 7 has matured and developed into a fantastically stable OS, rock solid, fast, needs very few resources and has also become more manageable with a range of tools - the manageability of it was what put me off years ago.

    Microsoft are ignoring the very things which made them useful and leaving the door open to Linux to walk right in . . . how many new builds are now done on Windows ? None that I know of. Same with SQL Server - was a great product but cost is massive now on SPLA so PostgreSQL it is - another tick in the enterprise box.

    Bye Microsoft . . . it's been fun :)

  39. oldcoder

    That is part of the problem when you artifically tie so many things together...

    You can no longer separate them to reduce loads...

    All that happens is the load keeps getting bigger and bigger - with more and more bugs that can't be fixed without breaking the entire thing.

  40. Michael Habel Silver badge

    Perhaps this is the point? I was under the impression that under a post Balmer MicroSoft, they (i.e. MicroSoft), LOVE Linux now?

  41. Steve Foster
    Trollface

    @oldcoder

    OC, it's not entirely clear from your post, are you talking about Microsoft now, or systemd?

  42. Don Pederson

    The title references Office 362.5 and in the article it refers to it as Office 363 and 364. Need a proofreader?

  43. druck

    I just think it emphasises the unreliability.

  44. Cavehomme_

    The beginning of the end...

    ...goodbye MS, you’ve shot yourselves in your feet. Muppets.

  45. Fatman
    WTF?

    Re: The beginning of the end...

    <quote>...goodbye MS, you’ve shot yourselves in your feet nuts. Muppets.</quote>

    FTFY!!!

  46. steviebuk Silver badge

    I'm a bit slow

    I assume the 362.5 then 363 then 364 was a piss take?

    And I'll end with my usual. "But the cloud never fails, it will cleanly fall over to the next data centre. We need to be infrastructure free. It will save thousands because the cloud costs nothing".

  47. Steve Foster
    Holmes

    Re: I'm a bit slow

    "I assume the 362.5 then 363 then 364 was a piss take?"

    Oh yes.

    cf. previous articles on Microsoft cloud outages (there are too many to cite individually, of course).

  48. DuchessofDukeStreet

    Re: I'm a bit slow

    I was just impressed it was up all the way to 364...

  49. Sir Runcible Spoon Silver badge

    Re: I'm a bit slow

    If we adopt the 'days since last incident' approach, isn't it like O7 or something?

  50. BeerTokens

    Re: I'm a bit slow

    Can we have this as a banner on the reg homepage please?

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2018