Does anyone really care what “five nines” means anymore? For the record, it means 99.999 per cent availability, which means your business managers can founder in digital limbo for just over an eighth of a second each day. That doesn’t sound too bad, does it? What about six seconds a week, or roughly five minutes a year? Talk is …
Just a way to measure
What we *really* want is for it to not fail ever. In reality, sometimes shit happens, but in that case you want things back online as soon as poss. But "as soon as poss" doesn't work for contracts. And an outage on one server needs to be averaged over a whole warehouseful of servers.
So you say "overall our servers will be online for 99.999% of the time as measured over the year". And then you and your customers can check (via tracker stats) whether you've met the contract.
Means downtime of just under a second a day (0.864s, but who's counting?)
with the volumes of traffic
going through some systems, 0.864 seconds is a lot of downtime.
Only FIVE Nines?
We offer NINE fives.
Swearing in the titile does not improve the article
"Shut up Flanders"
Unless I've missed out, and the title has in fact been changed from an earlier version, which read something like:
"WHO THE FUCK CARES ABOUT THIS FIVE NINES BOLLOCKS ANYMORE?"
I fail to too see your point, Neighbourino...
Which word is swearing?
Hell is a profanity, though not a blasphemy, and by strict definitions not a swear as it doesn't invoke but merely names.
only 5 ?
I've worked on very highly available systems, where the sort of downtime offered by 99.999% uptime seems like a summer break holiday. Some environments (blue light, city trading) need much higher availability....take a look at the HP NonStop range for the sort of kit that underpins this sort of thing....it simply Will...Not....Stop....
Blue light and City trading
They might need it, but they don't necessarily get it.
Completely missed the point
The point is that in a modern operation people have stopped running Jurassic Park and have escaped the illusion that super availability is anything to do with overpriced tin.
Since the late 90s people have worked out that the tin is commodity, will fail anyway and building an overcomplex clustered, replicated storage anachronism will only result in a system too complex to operate reliably.
We learned in the 90s that availability really comes from a properly written application stack which is designed from the ground up to cope with server, network, storage and entire data centre failures as part of the design criteria. This doesn't mean "no single point of failure" idiocy, it doesn't mean "non stop" servers, it means understanding that failures do and will happen and it is far smarter to embrace this and design to ride through the failures than to waste money on overcomplicated tin which will make the inevitable outage even harder to recover from.
If we took a "non stop" approach to the fear of getting a flat tyre on a car we would have articulated 200 wheel cars with twelve engines which blocked an entire highway when they broke down, only an OEM mechanic would be able to do anything to them at $10,000 / hr and replacing a cupholder would need 6 months of change risk assessment.
...availability really comes from a properly written application stack which ...
is designed from the ground up to cope with server, network, storage and entire data centre failures as part of the design criteria. This doesn't mean "no single point of failure" idiocy, it doesn't mean "non stop" servers, it means understanding that failures do and will happen and it is far smarter to embrace this and design to ride through the failures than to waste money on overcomplicated tin which will make the inevitable outage even harder to recover from.
Seriously, WTF? how does software cope with 'entire data centre failures'? especially without the idiocy of 'no single point of failure' (ie. you've only got one data centre and it just cratered)?
Please explain how I can 'embrace' failures and 'ride though' a complete loss of power or network or esp. data centre.
I sit at your knee, I have much to learn from you... Your wisdom is awaited.
...I'm guessing he's a software guy.
"how does software cope with 'entire data centre failures'?"
What you do need is the entire infrastructure around it and to put a lot of thought into it, but effectively this is the approach behind the internet.
We have a varied selection of DNS root servers for the simple reason that one may fail.
What you can't cope with is thinking that one server in the data centre is key and if the data centre fails this server might still be available, but normally this shouldn't be allowed to happen. Take google for an example, do you think people can't search if one data centre was to be removed from the face of the planet? All traffic would simply be switched to a peer data centre (hopefully :)).
I work in banking where HP NonStops and the like are king, but we've managed to create very acceptable replicated systems that cost a fraction of the price with very similar availabilities, mostly through software but of course you need the infrastructure around it.
One problem that is common is that on a HP NonStop people are scared to have any down time for the obvious reasons, so problems are woked around and systems not upgraded because any downtime would be horrendous. We can upgrade servers in the middle of the day with no downtime because the peer is handling all the load, then upgrade the peer later.
The system as a whole is 99.999% available (one customer rated it at 100%) but each node can (and will at some point) fail.
A nice story with a NonStop from years ago: the company was a bit tight fisted and decided to not replace a broken system fan, the backup fan broke, server overheated and shutdown. Millions of euros lost in downtime and a massive amount of anger due to a ~5 euro fan :)
HP NonStops do stop (if very, very, very rarely) and when they do the proverbial does definitely hit the fan! (see icon)
Single point of failure
If you only have one data centre then that *IS* your biggest potential single point of failure. The application stack needs to incorporate cross data centre data replication, cross data centre load balanced applications. Diverse network routing into each data centre, diverse power, UPS's, generators. It's all possible if you really want it.
If you don't want it then develop a good DR solution for when your data centre "gets cratered". A significant percentage of companies do not recover from a catastrophic data loss.
I think the original poster has a point, but to include this statement..
"This doesn't mean "no single point of failure" idiocy,"
..kind of undermines it, as the poster who highlighted that the DC is a single pof.
In built resilience through software is a lot cheaper than lots of fancy tin to do the same job, but it does need to take advantage of not having single pof's - otherwise the software wouldn't cope, n'est pa?
I speet on your steenking titles
"there will be many times when we don’t need our computing resource, for example overnight when staff are not around"
Oh - just switch it off then.
Seriously - is anyone intending to do business with a company that thinks the computing resource isn't in use just cos there's no users on the system?
Might not be relevant in data centres,
but it's still relevant for broadcasters - particularly those whose output comes largely from little spinny discs these days.
So yes, transport to local servers before broadcast; replicate the server rack, the server room, and some cases the whole building - because if the telly goes off the air for more than a few seconds there's hell to pay.
Must be nice
"In truth, of course, there will be many times when we don’t need our computing resource, for example overnight when staff are not around."
Must be nice to work in an industry where there's no overnight processing or 24-hour availability of services. Where I work, more computing resources are needed once the staff have gone home.
Nuclear reactors work with TWELVE zeroes safety...
...At least the PWR ones. You don't want even one of them failing in your lifetime, let alone 5 or 6.
Actually, they count chances of failure, and it counted as 1e-12. From 1e-9 to 1e-12 is perfectly OK. Even so, nuclear power plants are designed to SAFELY shutdown, and that's the number being evaluated. Should every possible (and acceptable) human error in its operation happen at once, that's the chance it will meltdown. Fail safe systems regard what happens when they DO meltdown, so they won't go Chernobyl....
I won't mention Fukushima, where that event was not predicted nor accounted for... and I won't mention Chernobyl itself, where the safety systems were DELIBERATELY disabled.
(It also means that statistically speaking, if you build 1e12 (1x10 to twelve power) nuclear power plants, one is bound to fail immediately...) Availability, though...
Iguaçu (or Iguazu?) hidropower plant was built to WITHSTAND an eventual flood that's supposed to happen in the Paraná river every 10.000 years. Mississipi dams (near New Orleans) were built to withstand 100-years flood...
Cobol code was built with only 2 digits for years (that would be 100 years "boundary check"), and now it's been in use for more than 50% of that time... ohh yes Y2k.... but that didn't bother lots of people...
So yeah, few places DO CARE about five nines, but really not IT related ones. I guess medical equipment runs on those premises too.
Consider that when some flight company misplace your luggage, or your delivery fails to get to your doorstep... or when you car's engine throws you a hissy fit or a flat tire.
I expect some downvotes here from whom didn't really understand the examples...
"(It also means that statistically speaking, if you build 1e12 (1x10 to twelve power) nuclear power plants, one is bound to fail immediately...) "
No. It doesn't.
Let's talk binomial distributions for a second. The simplest one is the coin flip. A balanced coin has a 1 in 2 chance of landing heads, a 1 in 2 chance of landing tails. By your logic, flipping a coin twice would mean you're bound to get one coin landing on heads. Actually, you only have a 75% chance of one coin landing on heads. That's because each flip is independent, so you have a 50% chance that flip 1 is heads, and then after that a 50% chance that flip 2 is heads. In other words, there are 4 equally likely outcomes (H-H, H-T, T-H, T-T) and 3 of them have at least one coin landing heads.
Generalizing and skipping a bit, given an event of probability P, the odds of N attempts containing at least one occurrence of the event is 1 - (1-P)^N. Plugging 1e-12 as P and 1e12 as N, we get approximately a 63.2% chance of one of those trillion power plants would fail immediately.
Re: Nuclear reactors work with TWELVE zeroes safety...
"I expect some downvotes here from whom didn't really understand the examples..."
I was appalled to see you hadn't collected a great big heap of downvotes. Do you think that 1e-12 is the chance that the PWR is broken out of the box, or something like that? I can't account for your conclusion any other way and, as the poster next after you pointed out, even that insane assumption wouldn't actually lead to your conclusion.
I wish the mnathematically/statistically illiterate would stop posting nonsensense about probability and other fields of maths.
I've worked as a service manager in a hugely complex multiple application set of data centres (gosh).
The customers wanted 100% no interruption service but were never prepared to pay for it. Yet, when pressed, they had no plans for 'people backup' if staff offices burned down or buildings lost power, or a flu epidemic struck. How we laughed.
Lessons I have learned: For any fully redundant system there is always a single point of failure.
Availability? Who cares?
Some people do care.
If folk want their high availability services designed by clueless people who some might call "presentation layer people", e.g. the kind that Google and the certified Microsoft dependent PHBs and many others rely on, here is obviously the place to be (especially a couple of standout comments here).
Hardware and software failures are pretty much inevitable. Service interruptions aren't, if you get the right people on the job.
There's a lot to learn, not much of it written about round here, and I don't have time.
For a different look at availability, one which might be more comfortable reading to those from the NonStop Kernel and VMS eras (where are they now in Apotheker's HP?), readers could do worse than have a look at Availability Digest's articles at
http://www.availabilitydigest.com/articles.htm (their wesbite obviously isn't designed by a Presentation Layer Person).
Will nobody think about the WAN/MAN
DC 5 nines are all good, but when you work with end user availbility in a global enviroment 5 nines is a whole new ball of wax, there are many obstacles to overcome before a user in a site could be given a garuentee of any figure.
How many ISPs opperate in each country? if only one then oh dear hold on to your ankles at your next SR
Which ISP owns the infra if running with multiple ISPs for sepracy?
How much influence do you have on SIP's after major outages affecting ISP's?
if running dual links, where does you ISP consider the last mile for convergence to be? No good if 5km from site and copper gets dug up by poor people and all your links go down including RF
Will your landlord let you add a new server room and seperate LAN in your location should you feel the need?
What is the local break fix support SLA's? do you have 4 hour resolution times for H/W failures, Desktops, Network, File Print?
Are spares easily avaialble to swap out in less then 4 hours?
How much revenue would you lose as a site if you do not get Five 9's? is this less than to cost of implementing? Do you want to spend €1m to protect €250k in losses over 3 years.
Who will fund these improvements? if you opperate in an Empire enviroment dominated by P&L never an easy question to answer
Is the user base running low attrition? is knowledge retained on Incident management, Business continuity management
is the user base well trainined? as above
Back at the DC
If the market has not provided an off the shelf solution and you A-team yourself a new one, will the original vendor(s) still support the core of their product you have not ripped apart to deliver the business what it needs
has the business given you a forecast of business for capacity planning?
just my thoughts.
As i say to my business partners off the record, you can have what ever you like just give me explicit and clear requirements, a blank cheque to deliver and run it inc for undocumented features, a well trained user base, good communication matrixs with named people to own it and review it, good SOP's owned and reviewed inc DR and BCP. A heads up on pipeline of volumes inc users, a pipeline of SIP and funding to implement
PS IT tries it's best to deliver world class service more so than other service providors like Finance, HR etc
One is sorely tempted
"Back in the mid-80s, when businesses were expected to talk the IT department’s language, all this may have meant something. These days it is like discussing angels dancing on the head of a pin. Business managers just want the damn stuff to work when they need it so they can make money. End of story."
To tell the business manager he is an arrogant know-nothing shit and find another way of earning a living. Almost anything would do as long as one did not have to put up with that kind of manager.
The non-technical point of failure
Genuinely having NO single point of failure takes a lot more thought and redundancy than you might think. Google's got multiple data centres scattered all over the place, and a setup which shifts load between them to deal with big failures ... and that setup itself has failed at least once (routing requests onto the wrong place, where not all the traffic could be handled) causing an outage for a while.
What about google.com itself - the domain? Yes, the .com servers are replicated and fully redundant, as are the google.com ones - but if the registrar screws up and expires the domain, or mis-delegates it, or someone pulls off a social-engineering attack to seize control, or it gets swept up in one of the US government's over-zealous crackdowns on "bad" websites, it's still gone.
Likewise, if one of that huge cluster of servers takes my incoming mail and dumps it in the bit-bucket - again, as happened not long ago with a huge batch of accounts - they're gone. They pulled the last tape backups out, but of course everything since that last tape backup will still have been lost that way.
However much technical redundancy - RAID, dual PSUs, UPS protection, even synchronous replication to another continent - if a bug, rogue or fat-fingered sysadmin hits delete on the wrong thing, stuff can disappear for hours or even forever.
This bit a company I sometimes do work for a few years ago: register.com had a little tantrum when they were told the domain was being transferred, and started publishing new DNS content which bit-bucketed all their incoming email. It was another 12-24 hours before the delegation transfer completed and mail stopped being discarded!
Ericsson has a telecom switch powered by Erlang, which has nine nines. Erlang makes such stuff as hot patching code easy.
Guinness world record...
I guess a sleeve-bearing belonging to a Hydropower plant generator was installed in 1897 and been working non-stop until 2007. But I can't find a link to it. (I'm guessing San Francisco too...)
How many nines could that be? For a MECHANICAL device? Shame on everything else...
Ow, memory fails...
(Please, help me find it. I want to find that article again.)
Well what can you do when you can't meet the 5 9's ?
Say it really isn't needed.
Like a lot of non IT people they do not have a slightist idea why they pay a night staff for. Where I work we come close at 4 9's . Night time is when all the batch work is done and there is a lot of it. One key thing that everyone forgets are backups. If you don't backup your data you are going to be in a lot of legal hassles the kind that the justice department (amoungst others) will shut your company down in record time. Do not ever try and argue with a FBI or SEC investigator they will have you in handcuffs faster than you can dial a lawyer. Oh yes and you must service your equipment and that is done usually at night when it has the least impact. Can you shut down a system while someone replaces say a disk drive or say a motherboard? Can you all say NO?