Change management 101
Bizarre, who, in their right mind, would schedule multiple concurrent changes without fully testing that scenario first?
Google has explained an August 11th brownout on its cloud as, yet again, a self-inflicted wound. At the time of the incident Google said App Engine APIs were unavailable for a time. It's now saying the almost-two-hour incident meant “18% of applications hosted in the US-CENTRAL region experienced error rates between 10% and …
Might be possible in a small fred in the shed organisation, but something the size and complexity of Goggle, these sort of things can't really be avoided, especially when humans are involved.
I suspect those on their high horses, might want to look closer to home, as I suspect they work in IT, and have yet to come across a perfect IT department, actually not even close. Incompetence runs rife usually.
When there is no real definition of competence and every IT professional considers that everything not done via their personally prefered way is incompetent, then incompetence would run rife, wouldn't it?
However, if you go by a few measures:-
1) Do you have backups?
2) Can you restore them?
3) you've tested that?
4) is the uptime for all of your systems >99.9% per month? (that's under 45 mins of unscheduled downtime during working hours, per month it's hardly difficult to meet)
5) your enviroment is virus/malware free without being user free?
6) Do the users have (working and adequate) tools to do their jobs?
7) You have a set of documentation for your site that will let somebody pick up with a minimum of fuss if you have an intimate encouter with a bus?
Then you'd probably come to the conclusion that the vast majority of people in IT are in fact competent, even if their approach to getting the job done is different to mine. Ultimately, "does it work" is the test.
@Peter2
Your measures are pretty valid and an excellent example, but as size really does matter.
For example:
1) Do you have backups?
Of everything? If your enterprise is 10 servers, and fifty clients it is likely that the answer is heading towards "of almost everything" or the "of everything we think is important right now" stage. Then it will turn out that most backups are weekly or less often and the failure comes at a critical period...
If you get much bigger than that, then there is a good chance backups are taken of critical systems at best and even then the definition of a critical system is harder to pin down than fog.
4) is the uptime for all of your systems >99.9% per month? (that's under 45 mins of unscheduled downtime during working hours, per month it's hardly difficult to meet)
Again, the bigger the environment the harder it is for "all" systems to meet this. In the end this morphs into "all important" systems and people fight tooth and nail over what important really means.
5) your enviroment is virus/malware free without being user free?
Has there ever been an enterprise with more than a few users which has never had a virus?
Ultimately, "does it work" is the test.
I actually agree with this - this is the ultimate test of competence at any given time. However it also implies that competence is a moving goal as most things are working right up until the point they stop working.
Yes, you have your emails which you sent warning about it and the potential consequences that included a list of potential remediation plans, and their response where they rejected your conclusion and/or remediation plans.
But too often admins will point to stuff and say "that's broken" or "that's a disaster waiting to happen" where they include no remediation plan at all, or give only one option (replace it with something new) even when they damn well know there are more choices than that. They just look at it as an opportunity to replace that Windows Server they hate with Linux, or that Linux server with dedicated hardware that the vendor will manage so they can wash their hands of it, or buy a software upgrade they've been pushing for for other reasons, etc. and their boss is smart enough to know that.
"I suspect those on their high horses, might want to look closer to home, as I suspect they work in IT, and have yet to come across a perfect IT department"
I see what you are saying, but, even as someone who thinks everything Google does is the best thing ever, this is not acceptable. I think Google would agree. The chocolate factory is above the frailties of mere mortals.
"They must need a really powerful telescope to see Microsoft and Amazon in the distance ahead of them..."
Let me fix that for you, "They must need a really powerful telescope to see Amazon in the distance." There is no doubt Amazon is out front by a wide margin. I don't think MSFT is that far ahead of Google. MSFT wraps in x thousand dollars of Azure with every renewal, much of it is sitting idle, so it looks like they are a clear number two, but most Azure customers wouldn't really consider themselves to be major Azure users. Google has more large cloud implementations than MSFT.... I think, given that Google owns the dev platform for Android (pretty important), they are going to be big. MSFT is largely just bundling the cost of Azure with other stuff and/or telling people that, if they want a discount on legacy software, it needs to be on Azure (continuing on prem will actually cost them more in some cases than using MSFT's servers).
When bean counters have outsourced everything they can to the Cloud, then one day it will be like E.M. Forster's "The Machine Stops".
We will get up on a Saturday to discover no Electricity (the network management), no ATM, no phone (the billing system down), no retail POS, etc.
Why Saturday?
-- A rushed release on a Friday evening. We have near mono-culture too.
Moral: NOT better management of updates and backups by the Cloud Provider, but if anything is part of your core business, run it on your own servers. Only use 3rd Party Datacentres for your Internet presence (probably core of it your own servers, co-lo to get connectivity) or temporary collaboration.
Moral: NOT better management of updates and backups by the Cloud Provider, but if anything is part of your core business, run it on your own servers. Only use 3rd Party Datacentres for your Internet presence (probably core of it your own servers, co-lo to get connectivity) or temporary collaboration.
This, this a million times this!
I have a simple question: who are you going to sue when you lose business when it doesn't work? As a matter of fact, who are you even going to call if it goes titsup again?
That said, please go ahead, especially if you're competition..
Simple answer to both of your questions - Google.
Now ask the same question for an on-premise solution.
If what you want is the satisfaction of being able to shout at someone then don't go cloud. If you want some of the world's best engineers fixing it then do. If you want the ability to design around failure then go cloud or run multiple data centres and comms links yourself. If you think that you are perfect and never make a mistake then give up IT and form a new religion.
... but with your own infrastructure, you can plan your own changes to avoid conflicts, and also make sure that changes don't happen during your critical business periods. You know how good the people doing the work are, you have the responsibility for hiring them!
You're also free to analyze what happened to whatever depth you want post incident, assigning the correct blame and improving future work without relying on partial, face-saving reports and promises from a service provider whose interests are not served by making the full details of their mistakes public.
I know Google have very skilled people in some places in their organization. But you think they're the people actually doing the day-to-day grind? They're probably mostly in design/third level escalation, The people doing the grunt work will be like every other company, the cheapest they can get to fill the roles.
And when it comes to suing service providers, trying to take legal action while trying to recover a business, especially against companies that employ good lawyers, is the last thing a company would need. Even if you won (after the appeal, of course) chances are the financial gain might be too late to save a business.
I would be pretty certain that for most tiers of cloud that people are using, the terms of the contracts and their resultant SLAs that the likes of Google provide will not have clauses that provide significant redress. And did I say that they employ good lawyers? And it might even be difficult to identify which legal jurisdiction any case should be brought!
Your applications should be designed to be flexible enough to tolerate failures of servers, availablility areas and even regions if they are that business critical.
The cloud can facilitate this in an amazing way which having multiple redundant DCs can't - you don't need to worry about invoking DR plans, failing over storage replication, and restarting things, they just keep running.
Have a look at the simian army which Netflix use, specifically Chaos monkey - this should be the end goal of your app - the ability to tolerate the loss of anything from a node to a region.
"The cloud can facilitate this in an amazing way which having multiple redundant DCs can't - you don't need to worry about invoking DR plans, failing over storage replication, and restarting things, they just keep running."
You read the article? Scale brings its own set of problems.
"Your applications should be designed to be flexible enough to tolerate failures of servers, availablility areas and even regions if they are that business critical.
The cloud can facilitate this in an amazing way which having multiple redundant DCs can't - you don't need to worry about invoking DR plans, failing over storage replication, and restarting things, they just keep running"
Anybody else just thrown up?
Application designers are exactly the wrong people to be designing resilience. Sensible infrastructure (servers, storage, connectivity) can be designed and configured to provide the appropriate level of resilience for particular applications, and the application designer should know as little as possible about it (though obviously there's a conversation to have about which chunks are most critical e.g. business transactions and which are less critical e.g. lots of other stuff),
Then at some point application failover, DR plans, etc, need to be tested. And then testing needs to be repeated. It's not rocket science but doing it right requires understanding (of business and technology) not just PowerPoint and pointy clicky things.
Back in the 90s, before Google and AWS, this was routine business common sense, for those who cared about these things. Five nines availability wasn't just PR speak, it was deliverable.
"the ability to tolerate the loss of anything from a node to a region."
The ways to achieve that may be rather different depending on the nature of the applications, e.g. content distribution network replicated across multiple sites may work for lots of content delivery stuff but is no help at all to transactional stuff (buy/sell/stock management).
And all that's before we even start thinking about round trip latency between input and response. The article says that some applications which continued working saw an extra 0.9 seconds latency during the failure - in many soft real time areas (factory scheduling, ticket sales, etc) that's a disaster.
Needs vary. The cloud is a timesharing bureau by another name (and with lots more capabilities), but fundamentally, public cloud information systems are owned and operated by a company whose business interests do not map neatly onto the "cloud users" interests. Tread carefully.
Now ask the same question for an on-premise solution.
It's probably just me, but it bugs me when people refer to it as 'on-premise' when it should actually be 'on-premises'. Picking up the Concise Oxford (or rather, Googling it) we have:
noun
ˈprɛmɪs/
1.
LOGIC
a previous statement or proposition from which another is inferred or follows as a conclusion.
"if the premise is true, then the conclusion must be true"
...as opposed to...
premises
ˈprɛmɪsɪz/Submit
noun
a house or building, together with its land and outbuildings, occupied by a business or considered in an official context.
"the company has moved to new premises"
So, "on-prem" kinda works, but I guess there's some sort of cloudy cognitive-bias going on here. You're as well saying "on-tomato" - it'll make about as much sense.
It's probably just me, but it bugs me when people refer to it as 'on-premise' when it should actually be 'on-premises'.
You may want to rethink bringing logic into a debate about how language is formed:
three cars
two cars
one car
zero cars - logically, that "s" shouldn't be there (there's "no car", but that doesn't refer to the items themselves but to availability).
Just an example that transcends language barriers - in this case it's not just English.
:)
"Simple answer to both of your questions [who to sue & who to call] - Google [or other cloud provider]."
What do your contract terms say?
They'll limit the compensation to whatever the cloud provider can afford which isn't necessarily within sight of your actual losses.
It doesn't matter how well the provider's staff compare with your own the financial incentives for their management are such that if they can make more by favouring some other customer, by cutting a corner somewhere or by scaling up beyond their ability to manage reliably than it will cost in compensation for failing an SLA then failing an SLA here or there won't be an issue for them.
It's something to watch out for. It may seem a risk dealing with a supplier much smaller than your company - how can they afford to provide the service you need. But the converse also applies; there's a risk of dealing with a supplier much bigger than you are because you're too small to matter very much.
Who do you sue? Simple answer to both of your questions - Google.
Dear Google, I am a small business owner and your outage caused me to lose so much business I've ceased trading, I plan to sue you for £1m. Yours, Small business.
Dear Small Business, Bring it on you soon to be unemployed layabout. Yours, Google's Army of Lawyers.
Even if the exchange wasnt as blunt, the reality for most people who use a cloud service is that they dont have the legal foothold for recompense that they think they have - even when they do, they have quite an uphill battle to overcome companies who spend huge amounts on legal defence teams.....
"Even if the exchange wasnt as blunt, the reality for most people who use a cloud service is that they dont have the legal foothold for recompense that they think they have - even when they do, they have quite an uphill battle to overcome companies who spend huge amounts on legal defence teams....."
OK, now talk me through a scenario where the above isn't the case. If you're not using cloud, you have a telecoms provider. Will they pay out more? no.
If you're not using cloud you may use a colo. It's a no there too.
Perhaps a server vendor who didn't issue a firmware fix, or fudged one? Nope.
There is NO situation where you are self sufficient, and no magical SLA outside of 1980s Sun/IBM that will compensate for business losses. Back when those existed the profit margins were large enough to cover the cost so the vendor still won. The ONLY difference here is that cloud providers aren't creaming it off your contract so they also lose money from the outages.
<quote>Who do you sue? Simple answer to both of your questions - Google.
Dear Google, I am a small business owner and your outage caused me to lose so much business I've ceased trading, I plan to sue you for £1m. Yours, Small business.
Dear Small Business, Bring it on you soon to be unemployed layabout. We direct you to the Terms and Conditions of Service, Section (blah) Paragraph (blah): "Limit of Damages". You will note that those Terms of Service do not provide for compensation for Lost Revenue or Profit. Tough Luck! Yours, Google's Army of Lawyers.</quote>
FTFY!!!
As someone who has had those conversations with Google and AWS... I can't comment because I'm bound by NDA. As is everyone else with an informed opinion, ergo all opinions here are uninformed.
What I can say is that nobody in their right mind would accept the boilerplate terms that these providers start the negotiations with.
"What I can say is that nobody in their right mind would accept the boilerplate terms that these providers start the negotiations with."
That's what we've been saying. But how many potential customers are big enough to have them changed?
"If you want some of the world's best engineers fixing it then do."
News Flash!: Google doesn't have some secret farm where they breed "the world's best engineers", they use the same labour pool that you do.
However, if you're just hiring someone to shout at, Then just hire someone that passed his exams because his mother passed him notes. All you're looking is a body to yell at, apparently no skill needed and works cheap.
If you want "the world's best engineers", expect to pay a decent wage.
The bonus you get is that when things go tits-up, you'll have them there and working on your system first, instead of fixing some other company's system so that they can get around to yours... eventually.
This post has been deleted by its author
For Google and AWS, these outages are always interesting - it results in downtime/reduced availability, but in my experience in IT, downtime or unavailability of components aren't uncommon when trying to run 24x7.
The interesting thing is how you keep the larger system in a functioning state, capture enough information to identify the root cause AND get it back to a functional state within a few hours. Sure, it turned out to be human error (software updates combined with large scale moves) but they had considered capacity during this work, and the thing that affected service was the retries rather than expected load.
I agree and up voted you. I am seriously impressed they managed to diagnose and sort this out. You can't always apply changes serially and that means unintended consequences will happen. The measure of a good organisation is the reaction to those consequences including lessons learned.
If you are an affected customer and running stuff critical to your business then I would hope you already had a resiliant plan in place to mitigate something like this, if you haven't then more fool you.
A big concern I have is the lack of competition in this space, when storage and compute are just utilities then with just a few suppliers you're at their mercy for pricing and service.
"If you are an affected customer and running stuff critical to your business then I would hope you already had a resiliant plan in place to mitigate something like this, if you haven't then more fool you."
The reality is that such customers will be using cloud because it's supposed to provide that resilience.
"the thing that affected service was the retries rather than expected load."
Fancy that.
You've never had to ask about how a setup copes under error conditions as well as under routine (and peak foreseeable routine) load? Some people have - it's a good question to ask. Some of these people are worth paying money for, but it doesn't happen as often as it should, partly because it's rare for businesses affected by failure to carry the cost impact themselves, it's just another outage that people have to live with (stock market transactions and air traffic control perhaps being obvious exceptions, plus a few others). As others have noted, SLAs are usually worth little more than the paper they're written on.
The first likely-to-be-familiar time I came across the "peak load in error recovery situation, including retries" issue was in the early 2000s, when PPPoA broadband was starting to be rolled out in volume in the UK, and it was always interesting watching how well an ISP coped when their connectivity dropped and recovered and tens/hundreds of thousands of users needed to re-authenticate over a few minutes.
Similar stuff in earlier decades involved LAN-connected kit which all ended up rebooting at the same time after various unfortunate failures. Lots of other similar scenarios, doubtless.
Pointy clicky, devops, etc makes doing some things easier. It doesn't replace experience.
"No mention, however, of trying to schedule upgrades so it is only doing one at a time."
Good.
Change control was a good idea, but has mutated into a hellish bureaucracy solely designed to do three things:
1. Make it hard to get changes approved
2. Spread the blame so thinly that no-one can get fired if it goes wrong
3. Make everyone "look busy" without _achieving_ anything.
(I am sure some of them are taking inspiration from militant shop stewards in the 70's and 80's...)
I have plenty of horror stories which I cannot share here. Anyone who can, please do...!