back to article Microsoft Azure Europe embraced the other GDPR: Generally Down, Possibly Recovering

Microsoft Azure tumbled over in northern Europe – and services have effectively stayed down for unlucky customers for around five hours. The disruption was caused by problems within the cloud platform's storage and networking systems, we're told. Today, from 1744 UTC to now, at time of writing 2115 UTC, according to the …

  1. Anonymous Coward
    Anonymous Coward

    More Microsoft CloudFog

    Its been said many times before on here.... Cloud means someone else's computer. But you can't deny Cloud convenience is great when everything is working... However you've got to wonder, how many *if any*, senior executives jumping on the Cloud conveyor belt, planned for days like this...

    1. Anonymous Coward
      Anonymous Coward

      Re: More Microsoft CloudFog

      None of them - moved on to next gig by then most likely

      1. Steven Raith

        Re: More Microsoft CloudFog

        The plan was to be able to blame someone else.

        Not their servers, are they? So not their fault. Bonus please!

        Steven R

        1. Phil W

          Re: More Microsoft CloudFog

          "The plan was to be able to blame someone else."

          Yes that usually is the plan, and both IT management and senior management tend to be on board with the idea that not having any control or ability to fix it yourself during an outage is acceptable, right up until there actually is an outage at which point senior management shout at IT management to fix it and IT management shout at the people who actually know what they're doing to fix it, and we say "Sorry boss, but you took it out of our control".

    2. TheVogon Silver badge

      Re: More Microsoft CloudFog

      "Engineers have identified the root cause, and actively working to mitigate the issue. Our telemetry has shown improvement, and next update will be provided in 60 minutes or as events warrant."

      1. hplasm Silver badge
        Happy

        Re: More Microsoft CloudFog

        " next update will be provided in 60 minutes or not..."

    3. Scott Marshall

      Re: More Microsoft CloudFog

      They planned for it by creating BaaS - Blame As A Service.

    4. Fungus Bob Silver badge

      Re: More Microsoft CloudFog

      "you've got to wonder, how many *if any*, senior executives jumping on the Cloud conveyor belt, planned for..."

      Planned? You have some rather strange ideas about life in the corporate world...

  2. Anonymous Coward
    Anonymous Coward

    Cloudy, with a chance of outages

    What do you do when it rains in the cloud?

    1. Anonymous Coward
      Anonymous Coward

      Re: Cloudy, with a chance of outages

      Get pneumonia(?)

  3. Hans 1 Silver badge
    Coffee/keyboard

    Who can tell MS not to install updates on all servers in a tier at once ?

  4. Anonymous Coward
    Anonymous Coward

    MTBF vs Blast Radius

    AWS, GCP and Azure are all more reliable and more secure than a traditional data centre but that doesn't mean you don't need a DR plan. Well Architected cloud applications can survive failures.

    The difference with public cloud is if it has an outage its major news due to its blast radius, not usually true when one of your own DC's has an issue.

    1. Remy Redert

      Re: MTBF vs Blast Radius

      Pretty much. If you must run in the cloud, make sure you're spread across multiple cloud providers.

      Fortunately their platforms are all highly interopable to make this kind of safe arrangement much easier to achieve. Right?

      1. Jack of Shadows Silver badge

        Re: MTBF vs Blast Radius

        No, they aren't directly interoperable although there's a few players out there that can build across providers. However, if you are doing your design correctly, everything should be loosely-coupled which mitigates things a bit. The problem, as I see it, is that they're holding it wrong: the providers and the clients.

        1. AMBxx Silver badge

          Re: MTBF vs Blast Radius

          Azure gives you the option for geographic redundancy. Easy for storage and cold standby, I'm not sure how complex it becomes for VMs (not my thing really).

          Multiple providers not likely to be the solution - you're just adding complexity.

          1. Anonymous Coward
            Anonymous Coward

            Re: MTBF vs Blast Radius

            > Azure gives you the option for geographic redundancy.

            It does, but many cost benefits of moving to the cloud are wiped out once you add the redundancy options.

            Move to the cloud for any two (and possibly just one) of cost saving, scale on demand and redundancy.

          2. Anonymous Coward
            Anonymous Coward

            Re: MTBF vs Blast Radius

            "Azure gives you the option for geographic redundancy."

            Compliance with local legislation, such as GDPR, may limit your options for geographic redundancy.

            1. AMBxx Silver badge

              Re: MTBF vs Blast Radius

              >> Compliance with local legislation, such as GDPR

              There are no end of centre to choose from. Even in the UK there are 3.

              1. Anonymous Coward
                Anonymous Coward

                Re: MTBF vs Blast Radius

                Not all services are available in all regions and some fundamentals may only delivered from the more mature regions but yes, you can deliver within the EEA if you do the work. I'd be reluctant to deploy in the UK until it becomes clear whether it can provide data adequacy post Q12019.

    2. johnnyblaze

      Re: MTBF vs Blast Radius

      I highly question that comment. Sure, cloud providers have distributed networks with umpteen layers of protection and resiliance, but when they go down, and they do very regularly, they go down spectacularly and take down thousands of customers. On prem solutions may seem 'old hat' now, but a well designed system can be rock solid and reliable, and I'd take a good one over cloud any day of the week.

    3. teknopaul Bronze badge

      Re: MTBF vs Blast Radius

      Where I work we have never had five hours of down time ever.

      1. PM from Hell
        Devil

        Re: MTBF vs Blast Radius

        =Thats the most idiotic thing I have ever seen anyone publish, Most of us will never go further than stating that "things are OK" or there has not been a major incident for "a while" for fear of offending the gods of hardware or the demons of database recovery. Good look with your weekend outage by the way & I hope you have the phone numbers for a 24 hour pizza delivery service.

  5. gerdesj Silver badge

    Title

    Wouldn't it be nice if threads that had a title with an OS in it had a no Anon rule?

    Has anyone really bothered to value their el Reg karma value?

    Does anyone really give a shit?

    1. Anomalous Cowturd
      Unhappy

      Re: Title

      Yes, no, and no.

      My silver karma badge fell off when I stopped posting regularly, quite some time ago. This site has gone downhill / changed in the past two years.

      I blame Sarah, for leaving.

      1. Inspector71
        Devil

        Re: Title

        I blame Eadon.

    2. ratfox Silver badge

      Re: Title

      Some employers know that their employees are posting on the Reg. The employees would like to freely comment, even when they disagree with their employer.

      1. Anonymous Coward
        Anonymous Coward

        Re: Title

        ... Which is why this Anon posts as such. 'plausible deniability'.

    3. Anonymous Coward
      Anonymous Coward

      Re: Title

      Has anyone really bothered to value their el Reg karma value?

      Well, I'm +50k up, 9k down. I reckon I can afford to piss people off, but I stay anon because I don't want badeg or name to distract from the contents. I know why some people are anon: it's either that or they would not be able to comment and given that such people are generally of a level that their input is valuable I prefer they stay anon.

      Besides, you know that any comment can be used against you when you cross the US border, right? You may have the right to silence, but that's pointless if you have already spoken online..

      1. Anonymous Coward
        Anonymous Coward

        Re: Title

        If you look at "my posts" you see your anonymous posts. Anyone who thinks posting AC makes them secure at the US border is not a valuable poster.

        In fact I call BS on your "most ACs are of a level that their input is valuable". Most ACs I see are posting anonymous snark about MS but don't use MS products. How is that valuable?

        1. Anonymous Coward
          Anonymous Coward

          Re: Title

          I don't snark, I'll post a sarcastic light humoured comment you know like "Your holding it wrong" or something about Office 258, I don't discriminate, I take the piss out of all companies equally when they have a massive fail. I post anonymous because when I'm not trying (emphasis on the "Try") to be funny I also like to post questions or opinions on things and I don't want to be labelled a fool because of my silly posts. As for "karma" points 46.9k up and 7k down so I must be doing something right.

          1. Anonymous Coward
            Anonymous Coward

            Re: Title

            As for "karma" points 46.9k up and 7k down so I must be doing something right.

            Hah! I'm ahead of you on the downvotes. More! More, I say!

            (is this comment useless and snarky enough?)

            :)

        2. Anonymous Coward
          Anonymous Coward

          Re: Title

          In fact I call BS on your "most ACs are of a level that their input is valuable". Most ACs I see are posting anonymous snark about MS but don't use MS products. How is that valuable?

          I'm sort of hoping you also see some valuable anon posts. Not that I would hold myself up as an example due to an exceedingly dark sense of humour, but still :)

        3. Ken 16 Silver badge
          Holmes

          I post anonymous when it bears on my day job

          I find some of the best analysis comes from AC's who are obviously on the inside of the story.

  6. Lee D Silver badge

    Love the "we're losing millions" one.

    Shame, with all those millions, that you didn't think to use another service, another location, a backup, a failover, etc.

    If your restore process is JUST THAT COSTLY, then you need other live hot/warm sites ready to go. All the time. And it'll cost you much less than "millions".

    Whenever I see that, I have so little sympathy for whoever thought that kind of setup was a good idea that I think those people SHOULD be made to pay for their mistakes.

    Azure is one cloud. You could run your own cloud too. You could use another cloud. And you could use multiple datacentres in entirely different continents for each cloud. And then you could join them together and have them failover. It's really NOT that difficult.

    But, nah, just sling Microsoft a couple of grand and obviously everything will be alright even if being down for three hours will cost you "millions".

    1. Bob H

      Absolutely

      Our business has a large part of our capacity in AWS but we have it spread across two availability zones with 100% resilience, two identical and parallel systems designed to balance and scale to the demand. If one goes out then a geographically distinct zone will take over with customers hardly noticing.

      We can't easily expand to different cloud providers because some of the technology we use gets complicated at that level. But as our business expands we are further looking at to what extent we have eggs in baskets and what would happen if the whole of AWS went down (which probably won't happen, normally it will just be a geographical issue).

      1. Ian 7

        Re: Absolutely

        Multiple AZs is a good start, but that wouldn't help you in this case as an entire region went out. AZs help you with issues at an individual data center, but AWS/Azure/GCP roll out changes at a region level so you need to be multi-region to protect you from times when that kind of thing f***s up. And it will. It may be more cost effective for you not to bother, depends on your RTO/RTP requirements.

        When you get into multi-cloud conversations, you're well into the law of diminishing returns.

      2. Claptrap314 Bronze badge

        Re: Absolutely

        Ooooo... You are in TWO AZs? You summer's child.

        You might want to read the book on SRE. Reliability STARTS when you are in three regions, with the DCs on non-overlapping maintenance schedules.

        Anything less, and you WILL have regular outages. If your business can handle those, fine. If not, you better get someone in house that I least knows the basics.

  7. SVV Silver badge

    Actively working to restore services

    Sounds like @AzureBarry is rushing to the data centre right now to put a few more 50p coins in the meter.

  8. Anonymous Coward
    Anonymous Coward

    Magic Rituals

    It added on its status page that "engineers have identified the root cause, and actively working to mitigate the issue."

    I'm guessing mandragore root? In that case, mitigate frog eyes, a hare's leg and magic mushrooms at midnight for best effect.

    1. Rich 11 Silver badge

      Re: Magic Rituals

      We're all used to sacrificing a black cockerel to Lucifer* to get shit going. You know it works.

      *Other gods are available.

      1. hmv

        Re: Magic Rituals

        And I thought it was supposed to be goats; no wonder I never got an answer.

    2. macjules Silver badge

      Re: Magic Rituals

      Engineers addressed the temperature issue, and performed a structured recovery of the affected devices and the affected downstream services.

      Translation:

      "We unfortunately re-arranged the server stacks into a pattern which very closely resembled the dread sigil Odegra. After realising this and and recording temperatures of extreme cold and heat at the same time we made a decision to switch off the power. Unfortunately every website hosted in the North Europe zone will now be displaying 'Hail the Great Beast, destroyer of worlds!'.

      Microsoft apologise for any inconvenience caused"

  9. Anonymous Coward
    Anonymous Coward

    Why is the floor silvery?

    "We've asked Microsoft what exactly an "underlying temperature issue" is."

    Solder's on the floor now. Was in the cabinets before.

    1. whitepines Silver badge
      Joke

      Re: Why is the floor silvery?

      However, our active Halon based cooling system worked wonders and the glowing air around the cabinet is now at correct operating temperature....

  10. Anonymous Coward
    Anonymous Coward

    Microsoft: 9 myths about moving to the cloud

    https://media.bitpipe.com/io_14x/io_142497/item_1698353/9%20Myths%20About%20Moving%20to%20the%20Cloud.pdf

    Take note of point number 4:

    "Myth 4 Dealing with the cloud is just another hassle"

    "Fact: Office 365 cuts down on hardware and upgrade headaches"

    "When you move to the cloud, the inconveniences of maintaining hardware and upgrading software are a thing of the past. Now your team can focus on being a business rather than a repair service. That means more time spent improving business operations and launching initiatives. Instead of spending ever larger portions of your capital budget on servers for email storage and workloads, you can think strategically and support managers with more agility."

    1. Anonymous Coward
      Anonymous Coward

      Re: Microsoft: 9 myths about moving to the cloud

      Doesn't say you don't have to worry about redundancy, just that software and hardware upgrades will be taken care of. There's nothing in this story that invalidates that statement.

    2. Destroy All Monsters Silver badge

      Re: Microsoft: 9 myths about moving to the cloud

      Instead of spending ever larger portions of your capital budget on servers for email storage and workloads

      Yeah, have you LOOKED at the pricing for MS Server Editions. Luckily there is Linux.

      you can think strategically and support managers with more agility

      It's 2018 and managers can't into strategic thinking. They have to be supported by Morlok brains how can package their ideas with magic agility juice.

    3. macjules Silver badge

      Re: Microsoft: 9 myths about moving to the cloud

      "Fact: Office 365 cuts down hardware and upgrades your headaches"

      FTFY

  11. Scott Marshall

    Underlying temperature issue

    We've asked Microsoft what exactly an "underlying temperature issue" is.

    I'll take a punt that an "underlying temperature issue" is actually the clients getting seriously hot-under-the-collar about the outages.

    1. hplasm Silver badge
      Flame

      Re: Underlying temperature issue

      Underlying temperature issue - see icon.

      1. J. Cook Silver badge
        Coat

        Re: Underlying temperature issue

        The phrase 'Thermal reluctance anomaly' was already trademarked, and MS didn't want to pony up the ludicrously high usage fees. :)

        Mines the Nomex one with the copy of certain set of romance novels in the pockets.

  12. Anonymous Coward
    Anonymous Coward

    And no, no, NO for social media updates

    We have a policy at work that we will not use, under any circumstances, any company which does not have a normal website for updates, but instead compels us to agree to social media Terms & Conditions to remain up to date and/or interface with their customer support.

    Simply not acceptable.

    1. Anonymous Coward
      Anonymous Coward

      Re: And no, no, NO for social media updates

      Azure has a normal website for updates. What's your point?

      1. Anonymous Coward
        Anonymous Coward

        Re: And no, no, NO for social media updates

        C'mon, all the cool kids hate social media! I'm just trying to fit in and, maybe, get a date to the prom.....

        1. Anonymous Coward
          Anonymous Coward

          Re: And no, no, NO for social media updates

          There's an advantage to having your support ticket visible on Twitter, all the other customers and potential customers can see it too. A number of times I've got nowhere with phone support but one of the kids has tweeted about the problem and we've been called about it shortly after.

    2. Rob E

      Twitter is blocked on our network, so any status updates on there are useless :(

      Sure, use it if you must, but not in place of a proper support channel!

  13. rivergarden

    Oops...

    Sounds like some people migrated on-prem applications to the cloud in a hurry.

    You CANNOT migrate on-prem workloads to the cloud as-is, or this kind of stuff happens - and it is your fault, not the cloud provider.

    Cloud IaaS / PaaS can be depllyed for production workloads with very high uptime and performance requirements. Migrating a "traditional" workload to the public cloud, without adapting to the infrastructure, iproving security, etc, is insane...

    1. Alister Silver badge

      Re: Oops...

      You CANNOT migrate on-prem workloads to the cloud as-is, or this kind of stuff happens - and it is your fault, not the cloud provider.

      But, but, but... My Boss said that Microsoft said that we could migrate our on-prem Exchange to Office365 with no issues, and all the ickle birds would tweet, and every cloud would be rosy-pink, and all the flowers in the garden would bloom, and beer would be 10p a pint...

      Are you saying they LIED to him?

      1. Anonymous Coward
        Anonymous Coward

        Re: Oops...

        > My Boss said that Microsoft said that we could migrate our on-prem Exchange to Office365. ... Are you saying they LIED to him?

        No, M$ just paid him and his boss a fancy vacation aka bribed him.

      2. Anonymous Coward
        Anonymous Coward

        SQL Server licencing hikes

        They didn't lie to him they just demonstrated how exorbitantly expensive it is to build a resilient SQL Server estate on-premises. Where I last worked we had to lock the SQL: Server VM's onto a single Physical host in each DC or licence SQL for every CPU core on every host in the private cloud we implemented. They didn't tell us that at the planning stage when we bought the O/S Licences. This meant that whilst 80% of the load could transfer across DC's automatically we had to manually move SQL server images.

        We were consolidating several hundred physical servers into a virtual estate so there was enough benefit in making the virtualisation leap to give us a decent ROI over the 3 year life of the servers but the next step will probably be a move to the cloud as budgets are shrinking,

  14. Anne-Lise Pasch

    Much as its easy to complain...

    My North Europe VMs were 'moved' to a new data centre automatically by Microsoft, keeping the same public IPs, and although specific web servers dropped, the load balancer did its job, and the clustered SQL kept working. Our websites stayed up; for me, this is the first time Ive seen my company's investment in Azure bear fruit.

    That being said, "which in turn caused a structured shutdown of a subset" is utter bullshit. Our servers were bouncing faster than a stage set by Hilltop Hoods.

  15. TonyJ Silver badge

    As others have pointied out...

    ...moving to Azure doesn't remove the responsibility of you to ensure you have the relevant levels of resilience.

    And also as others have said, if you're running services that lose "millions" when said services are down, then stop being a cheapskate and pay for a resilient service.

    There are advantages to using other peoples computers in other peoples data centres for sure, but there are also disadvantages - especially if your approach is just to throw it over the wall, sit back and feel smug.

    1. Claptrap314 Bronze badge

      Re: As others have pointied out...

      "throw it over the wall, sit back and feel smug." You forgot to insert "collect bonus."

  16. hoola

    Cloud Panacea

    The root of all this is that it is sold as a service an management do not give a stuff. IT is not their problem if it goes down they just quote SLAs et al and phone their account manager. No one really cares in the way they would if it was on prem. There you can scream at technical staff and play the blame game. Once it is in the cloud and something goes wrong, shrug shoulders, complain and a few ineffectual meetings, job done. Repeat every time it happens and continue to pay because it is too difficult now to do anything else.

    Yes you can mitigate to a certain extent but the costs become prohibitive, significantly more than well managed on prem solutions. And that is where the second major point comes in, cloud is all recurrent expenditure driven and looks good on the books. This obsession with converting capital into recurrent keeps accountants happy as there are no lumps (just an ever-increasing expense). The ultimate cost to the business is ignored, it is just like everything else in society where the monthly payment is king. It does not matter if it costs four times the cost of owning it. Each payment looks small so everyone is happy,

  17. PiltdownMan

    That sounds like a pretty large "subset"!!

    "A subset of customers using Virtual Machines, Storage, SQL Database, Key Vault, App Service, Site Recovery, Automation, Service Bus, Event Hubs, Data Factory, Backup, API management, Log Analytics, Application Insight, Azure Batch Azure Search, Redis Cache, Media Services, IoT Hub, Stream Analytics, Power BI, Azure Monitor, Azure Cosmo DB or Logic Apps"

    Just sayin'

  18. Ken 16 Silver badge
    Flame

    That sounds like a fire to me

    Subject the witch to an underlying temperature event!

    1. returnofthemus

      Re: That sounds like a fire to me

      Wouldn't have been an issue if they had stuck their Data Centres under water ;-)

  19. Crisp Silver badge

    Status : Degraded.

    Funny... That's my status as well.

  20. steviebuk Silver badge

    I have just two words for you that will fix ALL your problems...

    ..."Infrastructure Free".

  21. returnofthemus

    Where have all the M$FT Fanbois Gone?

    No doubt still working out how to connect AzureStack

    Come out, come out wherever you are

  22. Anonymous South African Coward Silver badge

    The Cloud = another man's computer. Somebody else's responsible for the hardware and other things such as cooling, electrical supply etc etc.

    The Cloud = as soon as somebody else's experiencing DC issues, you can do nothing.

    Your own server = you are fully in control of things and you are responsible for everything.

    So therefore people who go over to cloud-based solutions want to offset part of the responsibility of their businesses to somebody else, whilst paying peanuts for that, and expecting no problems.

    Surprise. Whether you're in the cloud or hosting it yourself, you will still get issues, no matter what you do.

  23. Claptrap314 Bronze badge

    SRE view

    I've been SRE at Google. I want to gather a couple of points.

    1) Reliability requires engineering. (See what I did there?) If you don't have in-house expertise on reliability, things are going to crash, it doesn't matter where the things are.

    2) Redundancy does not provide reliability. Redundancy is a component of reliability. Application architecture MUST be set up to turn redundancy into reliability.

    3) Redundancy is a many-faceted gem. You need regional redundancy. But you also need non-overlapping maintenance schedules. I argue that you need to think about weather patterns. Think about hurricane tracks. Think about arctic blasts. Think about the ring of fire on the Pacific rim. Think about a Iceland volcano dropping ash over 35% of Europe. Think about servers for emergency services being hosted someplace other than where the emergency being services happens. Think about raptors kissing across power lines in electrical substations. (Yes, it happens.)

    4) The majic number is two. You need two extras. One to take down on purpose (for maintenance) and one to take down because the moon went out of phase. If two DCs have overlapping maintenance schedules, they can NOT both be counted. If two DCs are in the same region, they can NOT both be counted. Same thing for individual servers in a DC. You need guarantees regarding server maintenance.

    5) If you don't have the budget to run an ops team in three different States, and your business can not tolerate major outages, then hire a reliability expert and go into the cloud. Once the numbers are run, you probably want to be in at least six DCs (again, in six States) before it makes sense to go independent.

    6) Monitor, monitor, monitor. Lots and LOTS of ways that things can go wrong in the world. If your customers ever are the ones that are telling you that there is a problem, you have a much bigger problem than the one your customers are telling you about.

    7) Keep an exit strategy. Actually, keep two. First, be sure that you have an executable plan to change cloud providers. Second, have a plan to build your own. Review annually. If you are an early stage startup, review quarterly.

    Yes, there is no free lunch. Probably not even a cheap one. But equipment leasing would not be a thing if it did not make business sense in particular circumstances. Some of those circumstances include IT.

    1. J. Cook Silver badge
      Joke

      Re: SRE view

      Re: point #3:

      You forgot to mention Kaiju attacks, the dead rising from the earth, dog and eats living with each other, and (bog help us) a world that fully knows peace. :) (although the two raptors kissing across power lines was pretty good.)

      Otherwise, yes.

      1. Claptrap314 Bronze badge

        Re: SRE view

        If your goal is five nines, you need to engineer for 6. That's 30 seconds per year on average. If you are down for an hour, you've blown your budget for more than a century. What's the mean time between a multi-state event of hurricane, artic blast, or volcano/earthquake? These things only sound outlandish until you run the numbers.

  24. LeahroyNake Bronze badge

    Air con failure

    They found have just said that the Air con failed, the servers had a sh!t fit and turned off to stop themselves melting.

    Or someone left the door open.

    Or Sales guy doing a tour thought it was a bit nippy so adjusted the temperature.

    Or.......

    We will never know. It's somebody else's server farm after all.

  25. Anonymous Coward
    Anonymous Coward

    5 hours isnt bad actually

    Over the years I've worked for many companies who have had on-prem 'heat events', usually due to cost cutting in the air conditioning department thanks to the facilities manager not understanding that it's not the same as keeping a meeting room comfortable. One of those led to about a million quids worth of Sun hardware barbequeing itself over a weekend (although to it's credit it kept running even with half the kit smoking)

    When they happen on-prem, my experience is that it often takes days to get engineers on site and replacement parts installed, especially if the facilities manager hasn't negotiated a proper SLA as 'who cares if the meeting room is a bit warm for a day'.

    So the ludicrous statements in this thread that this happening in Azure is somehow a fundamental failure of the cloud model is nonsense. These things happen on-prem as well, and if I was able to fix an AC failure on prem in five hours I'd think I was doing pretty well.

    1. Claptrap314 Bronze badge

      Re: 5 hours isnt bad actually

      If there is any SRE in the room, 5 hours is ridiculous in this case.

      1) EVERY microprocessor has an internal temperature sensor that signals thermal runaway, and even WinBloze is not stupid enough to ignore this. This condition can be read from the OS. It is insanely irresponsible for this alarm not to be aggregated to the dashboards under the limited set of "IGNORE EVERYTHING ELSE AND LOOK HERE" alerts.

      2) Same thing with the aircon units. If any of these fail, there should be claxons in the data center, and an alert in the previously noted set for the operations center(s).

      3) As soon as the problem was identified as aircon, jobs should have been migrated away from the datacenter as quickly as possible as the root cause analysis proceeds. (I would hesitate to try to automate this. Deciding which DC to move jobs to can be a real art.) This also requires that the jobs in question can in fact be migrated between DCs. It is irresponsible to sell cloud without education customers about the need to support involuntary DC moves.

      4) Note: SRE does not alert on RED. SRE alerts on YELLOW. In other words, unless there was a catastrophic failure of the air con, the alerts in 2 happen before the aircon actually fails. Likewise, the alerts in 1 happen before any servers actually shut down.

      5) If we assume that 1 & 2 are implemented as 4 specifies, then as servers are drained of applications, they are shut down. If the aircon failure is only partial, the heat budget of the dc can balance while a lot of jobs are still running. (And if the dc is set up to include passive cooling, there is only ever a partial aircon failure.)

      6) None of the above involves any level of management approval, or even knowledge unless some enterprising first-liner decided to read through some playbook.

      In other words, and aircon failure in a properly instrumented dc in a mature cloud offering has a decent chance of having 0 customer impact. This is the job of SRE.

  26. Anonymous Coward
    Anonymous Coward

    Rapid ambient temperature increase

    The ambient temperature in Dublin shot up by 10 degrees yesterday afternoon.

    1. Anonymous Coward
      Anonymous Coward

      Re: Rapid ambient temperature increase

      In Ireland that's the 3 non-rainy days we call "summer"

  27. RobertsonCR7

    I wonder what really happened ...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019