Strava
Strava is down due to this! How can I check how many miles I've ridden so far this month?
Amazon Web Services is scrambling to recover from a cockup at its facility in Virginia, US, that is causing its S3 cloud storage to fail. The internet giant has yet to reveal the cause of the breakdown, which is plaguing storage buckets hosted in the US-East-1 region. The malady kicked off around 0944 Pacific Time (1744 UTC) …
Too many people (non IT folk) seem to think that the cloud is this magical place that never has an issue. No matter how many outages Amazon, Azure, etc have, people still seem to think that it's made of magic.
Deploy in the cloud by all means but still backup, replicate, ensure that you don't have a single point of failure.
"Too many people (non IT folk) seem to think that the cloud is this magical place that never has an issue."
True, but who's fault is that? Isn't this exactly their whole selling point to begin with?
I also don't think you should dismiss the whole argument that easily, because when properly set up you can get a redundant environment if you want to. The fact that it now doesn't work this way at AWS tells me more about their infrastructure than the (in)abilities of virtualized hosting.
"Deploy in the cloud by all means but still backup, replicate, ensure that you don't have a single point of failure."
Unfortunately, that is what they've done. This fault affects a specific region, each of which contain multiple availability zones. Each zone constitutes a logical datacentre, comprising multiple physical datacentres (between 3 and 6 in each AZ, I believe). Deployment across two or more AZs in a given region *is* removing the single points of failure. Supposedly. Didn't work this time.
AWS don't particularly recommend deploying across more than one region, because each region is effectively a completely different cloud, common in branding, usage etc, but connected only via the public internet. Replication between zones within a region is fast and free, but replication between regions is slower and costs.
Ultimately though, a well-designed AWS deployment, consisting of all the fault-tolerant bells and whistles, still has no upfront cost and is thus way more achieveable than doing it on-prem. Said bells/whistles will make nuclear outages like this the cause of the rare downtime you do get.
Isn't the selling point of all this cloudy stuff that it does not go down???????
No.
It's that 'IT stuff' has become a utility, as in you only pay for what you use.
This means you can build highly resilient and/or scaleable systems without huge upfront costs.
Doesn't mean people do though. ;)
Fact is any business running anything critical to the business on other people's servers better have a contract guaranteeing they get back more than the down time costs (goodwill for example ain't cheap) or the people responsible are simply shirking their fiduciary duty to the company.
Fact is any business running anything critical to the business on other people's servers better have a contract guaranteeing they get back more than the down time costs (goodwill for example ain't cheap) or the people responsible are simply shirking their fiduciary duty to the company.
No, that's exactly the opposite of what you should be doing, you're looking to apportion blame after the fact. This is little use if your business has gone bust due to the downtime. Better to design systems that minimise the risk of this happening in the first place.
Using the cloud allows you to build complex systems with little upfront cost.
That's it.
This does mean that smaller companies can build an infrastructure that's distributed and resilient in a way that wasn't financially feasible 10-15 years ago; and larger companies can potentially significantly reduce their DR expenditure.
It doesn't mean it'll never fail or require administration or backup or all the other things you should be doing with an IT infrastructure. It just means you don't spend a boatload upfront on kit.
>It just means you don't spend a boatload upfront on kit.
And generally have less say on how things are setup and ran. Which is fine I guess for some but I personally wouldn't work for a company where I was responsible for production mission critical software running on systems not owned by my company, with a contract or not. The edge to building a lifetime of skills is getting a say directly and indirectly on such matters.
"It just means you don't spend a boatload upfront on kit."
That is understating it.
One of the huge advantages of public cloud is that you pay for actual utilization vs scaling to peak. That is huge. It would be worth using public cloud just for that benefit. As anyone who has ever sized on prem infrastructure knows, you scale to peak (meaning that you are paying for infrastructure every day as though it is the busiest day in the history of the company, even though most days are not the busiest day in the history of the company) and then you add 20% to the sizing because no one can be certain that the peak will not increase at some point and you cannot just elastically add scale. That equals many, many billions of dollars every year in infrastructure which is purchased and never or very rarely used.
"Doesn't mean people do though."
Maybe because it's been sold as cheaper than running your own data centre.
When IT try to persuade the business to make provision for this sort of thing it's probably dismissed as IT being profligate again or even IT trying to bump up costs so their own service is still competitive.
>Guys, EVERYTHING goes down on you at sometime or another.
Of course but when you have a good working personal relationship with gentlemen equally professional to yourself and with badges that only contain a slightly different number to yourself then its causes a lot less panic and is much easier to contact the exactly right people on the exact right time and get the answers you can count on and the service you need without as others say having to worry about if someone is putting your company's interests first. If this is not the case with your company then you should start thinking about finding a new company.
>Guys, EVERYTHING goes down on you at sometime or another.
Network goes down and occasionally hardware goes down but fun fact even after years of supporting it I have never seen an HP-UX OS crash due to software ever. Of course thanks to Red Hat and cheap commodity hardware rising (and not giving 2 shits about POSIX) and HP squeezing its last few customers I do probably sadly see more Linux kernel panics in my future sigh.
Yes, exactly. All our deployment and storage services are dependent on S3 or S3 backed apps and were all critically impacted but you wouldn't have noticed because our cloud based infrastructure was spread over many zones with enough resources (and cache) to weather the storm. A fortune 500 company managing many hundreds of web services.
"our cloud based infrastructure was spread over many zones with enough resources (and cache) to weather the storm."
Righto.
Cache doesn't have everything in it though, so what happens when something uncached is required from somewhere else ?
Works, but slowly ?
Total failure of that request and anything related thereto?
"High error rate"?
Interested readers want to know.
"Isn't the selling point of all this cloudy stuff that it does not go down???????"
Not without multiple levels of geographic redundancy. It's hugely expensive for an event that might only happen once every few years. Those dumb pipes known as the carriers have it in spades*. The likes of Amazon and Google, no so much. I like carriers (from a technical perspective).
* Even for voice mail, and no one uses that.
"Just to be smug, it took us 3 minutes from the first alert to switch from serving from US East and Ireland to Ireland and Frankfurt."
This, times a thousand. Any website or service pinning itself to a single node of a by-design distributed storage facility deserves whatever arse-kicking their customers choose to administer. The cloud, as is so often the case, is not the problem here - it's how it's being (mis)used that is the cause of any woes.
Yes, the "Cloud" is the problem. The way it's hyped, priced and marketed encourages beancounters to outsource to it.
Almost Zero regulation.
No 3rd party audit or oversight
No transparency on backup, resilience, security or privacy. Just vendor hype.
There are things that are appropriate for the "Cloud". However increasingly due to marketing of the Cloud vendors, the applications are inappropriate.
You left out some key steps the auditors follow:
1) Pay us
2) Show us you don't do dumb things
3) Here are some pissant concerns/findings so we can say we did something. Oh, and here are some meaningless pain-in-the-ass findings to address because they are one auditor's special area of expertise - you should make his book mandatory reading.
4) Your own in-house staff know about the real problems. But, "A prophet is not without honor except in his own country, among his own relatives, and in his own house.."
5) Set up the next audit. Don't forget about (1)
I've worked at a place where the internal risk reviews, done by an employee of a different department in the same company, were exactly like that.
Real serious issues were not allowed to be raised. By order of the management, the only issues that were allowed to be mentioned were the ones that could be acceptably mitigated at no cost.
So something like only having one developer who knew anything serious about the company's internally developed customer-specific architecture-specific version of gcc, one not used (let alone maintained) anywhere else in the world, wasn't considered a recordable risk by the auditor.
Then one year the developer in question went on holiday and didn't come back. Never seen again.
Still, it mustn't have been a problem, because it wasn't recorded as a risk.
"Almost Zero regulation"
Almost? Care to list any?
I'd like to see the actual energy bill. Not a percentage estimate of what you save, but a percentage estimate of what Amazon does NOT save. Where's that at, in a NSA vault perhaps?
"...most audited data centres on the planet!"
Audited for what? Do you actually know, honestly know? Do you believe everything you read? Read this: the USA doesn't spy on its citizens.
"Audited for what? Do you actually know, honestly know?"
Yes. I and everyone else who bothered to look do know. It's quite well covered actually, and has to be to allow architects to do our work properly.
Azure details are in the trust centre.
https://azure.microsoft.com/en-gb/support/trust-center/
AWS is in their compliance and assurance pages
https://aws.amazon.com/compliance/
"Audited for what? Do you actually know, honestly know"
There are 2 main types of data centre audit - security and environmental.
Usually a security audit would be a once off and would certify the facility to a specific standard - or just generally that it was secure by design and process with no significant security risks.
An environmental audit should be conducted yearly on any critical datacentres, MERs, SERs, etc. Usually after your annual deep clean... This will give you an extensive report on everything from aircon, UPS and fire alarms to the type and size of the particles in the air! For anyone who has any of the above facilities who isn't do this then you should be. Two companies that can help are Bureau Veritas and Aquacair...
It's not "high error rates", it's total failure to accept connections!
$ telnet s3.amazonaws.com 443
Trying 54.231.82.140...
^C
$ telnet s3-external-1.amazonaws.com 443
Trying 54.231.33.168...
^C
These are the endpoints listed at http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
"An advanced cloud storage service fails to accept telnet connections. Shocker. Telnet and ping are not reliable test tools. I'd expect these services to drop such fake connections as security risks."
How is telnet to port 443 a 'fake connection and a security risk'?
How can you drop telnet connections to port 443 but allow legitimate SSL traffic to the same port?
"How is telnet to port 443 a 'fake connection and a security risk'?"
The lack of any legitimate data would flag it up as a security risk. Using Telnet without encryption to connect to a TLS service is a dead givaway that it's not legit since Telnet doesn't set up the TLS before the connection.
If you lot think ping is a good way to test a network then you need to get out more. For ping to work, it needs the service accessible and running on the endpoint you're testing and requires that nothing drops the traffic in between. It's quite a common thing and might confirm a connection is up, but lack of a ping response tells you nothing about whether that connection is down, certainly not a non-ping service on that same endpoint.
@Lusty,
You put:
The lack of any legitimate data would flag it up as a security risk. Using Telnet without encryption to connect to a TLS service is a dead givaway that it's not legit since Telnet doesn't set up the TLS before the connection.
And just how do you imagine a TLS session starts? If you are using telnet to prove or disprove connectivity exists to a host, then the initial connection attempt is all you need, and that is the same for any tcp connection, whether it be a TLS negotiation or any other protocol.
I agree with you about ping, most secured environments block ICMP traffic nowadays, however, it and traceroute are still useful for investigating latency and routing so long as you temporarily enable it on the endpoint.
TLS works at the transport layer, clue is in the name. The security device sitting between the AWS/Azure host and the network would likely terminate any connections which are not actually setting up a secure transport as part of that connection. In case you missed it, both services have installed custom silicon on the network side of the NIC for exactly this purpose.
Telnet doesn't expose the transport layer, and so if this were terminated it would indeed show as no connectivity when the service is up for legitimate traffic.
I've not tested whether these services work with a Telnet test - my point was that just like ICMP, it proves nothing about the service itself.
Umm, do you know the basics of networking? Even if Amazon had the most amazing WAF that specifically looked for telnet vs. curl or code, they'd have to let them connect first on the standard port to start talking. Until a program starts talking specific protocols and going, the WAF is going to have to let them start.
Having telnet (or nc, or anything else in the world that can make a network TCP connection) all operates the same at the most basic levels of connecting out to a remote server on a specific port.
The status page may be running on AWS gear.
Oh the hypocrisirony.
Steven "tempting fate" R
Good sir, I have voted you upwards for using the concatenation of "shitgibbons." Excelsior!
"my increased error rate is 100%"
That was a winner too! I love it!
'We had 10% errors, at FIRST, which is pretty bad, then it increased all the way too 100% errors which should be total failure, but until the dashboard that it clobbered recovers to tell us otherwise, we are calling this "increased error rate." It sounds nice. Like saying; We have no services for you at this time, but you're important to us so, have a great day!'
"If you have single points of failure you deserve everything you get."
Are you suggesting that multiple points of failure are better? Maybe I'm being pedantic, but I've never quite understood the expression. I've had to deal with people who wanted to put part of our system on AWS and another part on Azure to avoid "a single point of failure". That the two parts are required for the system to operate and thus the chances of the system being down would increase didn't seem to cross people's minds.
The concept does not imply that there be multiple points, each of which are required for proper operation of the system, but multiple redundant paths, processes, structures etc, such that failure of any one does not compromise the system. Think of a physical mass being held up by chain of links, wherein a failure of any single link would cause the load to fall, and a multi-stranded cable, wherein the failure of any strand would continue to hold the load.
1) us-east-1 is the cheapest was for a looong time the cheapest aws region (and it still joint cheapest), so plenty of people will have their eggs in that particular basket
2) Plenty of people are pretty dumb and trust the "multi-az" aspect of a single region. You're on the cloud. Use more than one region (or, frightening thought, more than one provider?). It's exactly the same effort and saves you from nightmares like this. Same as using more than one datacentre. The AZ should be thought of as a (very large) rack, not as a DC.
Yes. Our Echo Dot cannot play any music and while I can logon to our Amazon Music Library web site it cannot play any of our tracks - "We're Sorry We are unable to complete your action. Please try again later."
Good job I also uploaded everything to Google Play and still have local copies for our Sonos system.
"I just have multiple copies of all my music"
Are we supposed to care?
Amazon music comes free with Prime which I bought mostly for free delivery and a bit for Amazon produced video content. As I already paid for it occasionally I feel obliged to browse and listen to some of the music included with Prime.
Today was one of those days and it went tits up. No great loss, the biggest annoyance being from thinking it may be a problem with the tablet I was using or Amazon account.
>AWS, for some reason, insists this isn't an "outage" but rather a case of "increased error rates" for its
>most popular cloud service.
"Outage" means that they will have to cough up money due to service level agreements.
We have the same issue with Google Cloud, it never has an "outage" when it goes titsup, it just has "issues"
Hah, they couldn't even update thier own status page correctly:
"Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue."
~ S3 down, ouch! But it won't impact cloud business much. Why? Corporations are addicted to cost cutting. Its the 'New Innovation, Stupid'!
~ But industry still doesn't want anyone thinking about 'Someone Else's Computer'. Instead we should use buzzwords like hyperscale:
http://www.zdnet.com/article/stop-saying-the-cloud-is-just-someone-elses-computer-because-its-not/
http://www.techrepublic.com/article/is-the-cloud-really-just-someone-elses-computer/
~ Its a hyperscale failure today. And next time there's an even bigger cloud config / data center / net outage, its still not 'someone else's computer'...
Requires obligatory xkcd (908).
So Amazon Fire Sticks become completely unusable during an S3 outage, can't run any local apps or do anything. Lump of plastic.
And my Motorola security camera also stopped working due to it relying on .... S3
A classic example of how much a bad idea it is to reply on cloud services from unreliable vendors.
amazon's own webpile couldn't deliver my order history a hour ago....
I love the CIO's that mandate all internal critical systems are running on high availability high grade hardware, with redundant fiberswitches, multipath network connections, san storage, etc, then decides its all too expensive so outsources things to the likes of Amazon and Google, who are using the *cheapest* of commodity hardware they can get away with.... The irony of this escapes the suits.
Our storage is on S3 in the US East and we've not experienced problems or losses. Maybe there is some other problem some users have which manifests itself as an S3 problem.
On the back of this story we've wasted time checking our store on the S3 service and have not found any issues.
People who flippantly use phrases such as:
"That's like half the internet."
"It's all over Facebook."
"It broke the internet."
Should be made to stand in a cherry picker in front of a blackboard the size of a skyscraper and write out a billion times "I will not exaggerate ever again."
This post has been deleted by its author
This dramatically illustrates the U.S. national-security vulnerability of the whole interweb to hostile take-down. Like so many other infrastructure constructs - e. g., the electric grid, gas & oil pipelines, etc, they have been developed with only the least expensive, most "efficient", criteria in mind, with security and reliability under duress, as an afterthought. To add such security after the fact is extremely expensive, more so than would have been case if designed into the original system. Such was the situation pointed out in the weeks following 9/11 by ex-CIA director Woolsey regarding existing infrastructure, and little or nothing has been done to remedy the problem in existing or new infrastructure since.
"The Cloud" has its uses, like shared docs stored on google docs, or source on github. But if you don't have some means of "failure override" (like using a private repository, or e-mail documents to people) you're totally b0rked when the cloud has another 'technicolor belch'.
I can imagine people using Office 365, google's javascript document editors, or even a cloudy-based mail service, running about like chickens with heads cut off, if their entire business model has them as 'single point of failure'.
I have to wonder who didn't hear about "distributed load" "replication" and "automatic failover" over at AWS...
So I guess "The Cloud" now potentially means "you're system has gone up in smoke, and has been vapourised"
I guess The Cloud could be heaven for computers when they go and die.
"ah my machine is in the cloud ..."
As a society, I am now thinking Star Trek's Next Generation "Binards" were actually a prophetic warning to us all, and that was about 30 years ago.
( sorry for the obvious icon choice ;) )
Look, everyone piles everything on AWS East because it's the cheapest (or among the cheapest) of their datacenters.
It's the cheapest because it's the oldest.
It's not hard to do the math. Or it shouldn't be. It just proves that people really do stink at assessing risk.
Also as others have pointed out, it's not Amazons fault that applications fail when they have an eventual outage - it's why Amazon (and other cloud providers) have multiple data centers that are geographically dispersed. It's up to appliction owners/users to design redundancy into their applications. Indeed AWS makes it easier and far more accessable to everyone to build proper geo-diverse disaster recovery into their applications that has ever been possible before. Technology and functionality previously available only to the biggest organizations is now accessable to just about everyone.
People just don't want to pay for it, deluding themselves that it will never happen to them. Surprise!
It's not just about where you put your snazzy app stuff. It's also a lot of support infrastructure (Console, Status Page blah blah) is hosted in US-East-1 and not replicated out to other Regions. So a failure of a important service like S3 (that seems to be the pillar of the supporting services) leaves in the dark to reacting to the incident.
If you're in a co-lo DC, at least you can ring the DC support; ask a tech to check what's going on behind the scenes and make a local switch to another piece of kit. On AWS... you even need to automate that failover and even that might break if API breaks.
The android "walk my dog" app failed to sync last night's 4 mile walk with mitzy (my german shepherd) to magic cloud land which I'm going to attribute to this s3 outage debacle.
This is clearly an unacceptable disaster of biblical proportions. Not.
I'll be going out for an hour with the dog in the fresh air again tonight. During that I won't be worrying if virtual clouds are present but I do expect to be keeping an eye out for real clouds above
Quick, dig out the contract to see what protections you've got.
Clause 10: The service offerings are provided “As Is.” We…make no representations or warranties of any kind…that the service offerings or third party content will be uninterrupted.” https://aws.amazon.com/agreement/
If you didn't like that one, you definitely won't like clause 11.
It broke our system in two places:
1. We take a data feed from TfL. That died for five hours so no traffic updates. Nothing we can do as its not our kit, we just consume the data when its there.
2. We then discovered that cdn.leafletjs.com was also down. We use their CDN. That was our fault as we relied on a CDN server being up. Lesson learnt and 15 mins later we were back up.
That was the worst outage we've had and it wasn't our fault, Highly annoying but since we paid exactly 0p for the lot we cannot complain.
I have no doubt that far bigger businesses are talking to Amazon re outages and service penalties. Amazon can use weasel words like "100% error rate" but I'd be gobsmacked if money doesn't start flowing from Amazon to big clients (even if its service credits).