back to article AWS outage killed some cloudy servers, recovery time is uncertain

Parts of Amazon Web Services' US-East-1 region have experienced about half an hour of downtime, but some customers' instances and data can't be restored because the hardware running them appears to have experienced complete failure. The cloud colossus’ status page reports an investigation of “connectivity issues affecting some …

  1. wsm

    Not just Virginia

    Other regions had their EC2 instances so slow as not to allow most data traffic. Since our proud cloud-first management had mandated the move of the authentication and authorization SSO systems to the cloud less than a month ago, the entire organization had a holiday from web services for most of the day.

    Interesting how the art of conversation is suddenly revived in such circumstances.

  2. Denarius

    Isn't cloud supposed to be fault tolerant?

    so clouds barf just like any other server in a datacenter with even less customer control. Colour me snarking.

    1. Pascal Monett Silver badge

      Re: Isn't cloud supposed to be fault tolerant?

      Oh, but it is supposed to be. It's also supposed to be reliable, as near always available as marketing can dare say without actually writing 100% (because that'd make 'em legally liable), and fast.

      It is supposed to be all that and more - the operative word being supposed.

      I remember times when we use to argue that a server going down in a company's server room would only affect the customers of that company, but the cloud won anyway and now we know that when a cloud server goes down, it's the customers of many companies that are affected.

      Yay progress.

      1. John Robson Silver badge

        Re: Isn't cloud supposed to be fault tolerant?

        It is pretty reliable - but that's why you have multiple AZ's.

        You don't get to complain about an outage caused by a hard disk failure if you aren't running RAID.

        If you aren't running the fault tolerant options on a cloud service... you don't get to complain about an outage that those options are designed to deal with.

        1. ArrZarr Silver badge

          Re: Isn't cloud supposed to be fault tolerant?

          This. More of this.

          If you build your system properly then it will never* go down. On top of that, if the power goes out, any data centre you manage will have just as hard a stop as a cloud data centre.

          *If the power goes out in the US and EU, then something big is probably happening and your site is probably the least of your worries.

      2. Anonymous Coward
        Anonymous Coward

        Re: Isn't cloud supposed to be fault tolerant?

        It isn't supposed to be magic, just fault tolerant. Magic would be preventing anything failing ever - no-one who knows anything about Cloud, including the vendors, say they have that kind of magic. Fault-tolerant means providing capabilities to hide, mitigate and recover from failures. Cloud vendors do say that you need to build and architect applications to expect failures, and they provide lots of capabilities to allow you to do that. For example, AWS separates each region into Availability Zones (think "isolated data centre") and specifies that the way to get high availability and fault tolerance is to split your application hosting over at least two AZs. How difficult is it to do that? Basically check a few boxes in the web console or add a parameter to a couple of CLI/API commands.

        It is completely trivial to get an application running on servers in multiple AZs talking to a database that has a master in one AZ which is real-time replicated to a read-only replica in another AZ, and then to promote that replica to be the master automatically in case of failure. Do that, and problems like the one mentioned here are barely noticeable. AWS (and Azure and GCP do similar things) handles high-speed connections, load balancing and automatic data replication between AZs for you. This handles issues with a single data-centre going down very elegantly.

        Occasionally entire regions do go down, but *almost* never multiple regions at the same time. If you want to be clever and super-fault-tolerant, you build your application to work over multiple regions, not just multiple AZs. That isn't quite as trivial due to having to understand the data replication model - single-master, multi-master, eventually consistent, plus issues like latency etc. but there are good patterns out there to allow that, and if you are building from scratch rather than lifting-and-shifting and can take advantage of some of the globally-replicated NoSQL services all platforms provide then the problem largely goes away.

        And yes, if you're super-super-paranoid, you can even build a multi-cloud-provider solution. That definitely isn't easy. Or cheap. But then that hasn't changed from the old on-prem days.

    2. sweh

      Re: Isn't cloud supposed to be fault tolerant?

      No, clouds are not meant to be fault tolerant. "The cloud" may always be there and running, but individual instances inside the cloud may die at any time.

      Clouds allow you to build applications that are fault tolerant. Indeed, applications should be designed to assume failure. There are many design patterns that can help with this.

      This is why "lift and shift" doesn't buy you anything except "outsourced data center". If you build traditional applications and deploy them to the cloud then you need traditional HA solutions as well; duplicated service in a different datacenter, data copying, "DR" processes...

      The responsibility for availability in the cloud rests solely on the application owner.

  3. Anonymous Coward
    FAIL

    But I thought The Cloud solved all problems...

    What happened?

    1. Anonymous Coward
      Anonymous Coward

      Re: But I thought The Cloud solved all problems...

      The real world. :)

    2. Eric 23

      Re: But I thought The Cloud solved all problems...

      Thunderstorms happened. I live in the area of the US-East-1 data centers. My UPS's were freaking out for about 5-15min yesterday late afternoon/early evening.

  4. Anonymous Coward
    Anonymous Coward

    Everything Fails...

    The quote "Everything fails, all the time.” from Werner Vogels (CTO - Amazon.com) should indicate the level of due dilligence required when deploying into the cloud.

    Its an AWS principle that when deploying into AWS you deploy into 2 Availability Zones (AZ's) for fault tolerance.

  5. Anonymous South African Coward Bronze badge

    But the company says some instances haven't come back yet because they were "hosted on hardware which was adversely affected by the loss of power."

    What? No graceful handover to UPS power, then to generator power?

  6. Andy A
    Meh

    Now I know where Ancestry keep their servers

    Got thrown off Ancestry around quarter past eleven UK time last night. Attempts to log back in, using any of their domains, received a "something went wrong" type error.

    Now I know why!

    Not sure how long it took to restore service, since I just went to bed instead. It was back again when my alarm went off this morning.

  7. Claptrap314 Silver badge

    Only 2 AZs? Don't make me laugh

    2 AZs does NOT give fault tolerance. It gives maintenance tolerance or fault tolerance. You need 3 AZs. In each of 3 regions to be fully fault tolerant.

    Otherwise, you WILL have outages.

    The real question is: "How much redundancy should the business pay for?"

    1. Anonymous Coward
      Anonymous Coward

      Re: Only 2 AZs? Don't make me laugh

      The real question is: "How much redundancy should the business pay for?"

      Should read.... "how much is the customer prepared to pay for?"

  8. Timbo 1
    Trollface

    Never mind that...

    ...the real question is was Netflix affected by it?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like