Calling Tuesday's one-hour-and-forty-minute Gmail outage a "Big Deal," Google has pinned the breakdown on some recent changes to the request routers that direct queries to the service's web servers. Ironically, at least some of the changes were meant to improve Gmail's ability to stay online. But Google underestimated the load …
Typical Google responce, which I think is one of their strongest assets. It's an clear honest, technical explanation about the problems and what they intend do in the future in order to prevent it again.
Not just a 'we're sorry, we're looking into it' response that you get from 99% of all the other big names.
The googits don't know what "testing" means ...
"The company says it will spend the next few weeks correcting the problem"
Weeks? WEEKS?? WTF? I've almost been convinced to use an icon button!
Pardon me while I continue to run my own email system.
Such a classic case of designing individual equipment to protect itself when under overload, instead of designing the *system* as a whole to be resilient.
It's the same problem that has caused power outages in the past, when an overloaded generator would disconnect to protect itself, thus just increasing the overload on the rest of the network. A better response is to shed load.
I suppose it's the usual dot.com problem of a new company hiring lots of whizz kids with bright ideas and absolutely no experience of the real world and historical events. They just keep re-inventing the same old wheels, with the same old problems. Electricity and phone companies learned that lesson a long time ago. Looks like the dot.com industry wasn't listening, and hasn't read the books yet. Maybe someone should Google for "network resilience"
And they want...
...us to hand all our apps and business over to them....
1 1/2 hrs downtime for a business can seem like a lifetime. Maybe not to a normal user, but to a business, this could cost millions of lost revenue.
My Hardware, My Software, My Downtime. Stuff this Web2.1 cloud crap....
It is outrageous that they could let something like this happen.
Wont someone think of the children??!1?!!!9.
Hardly. Okay, so they've had, what, 2-3 outages this year, totalling about 3 hours? Not exactly a bad record, considering the infrastructure they're running and the costs people pay to use it.
Great that they've learned from the experience and can increase resiliency, but I think they do pretty well, all things considered.
You can bitch and moan as much as you like about this failure (but you'd be a dick for doing so because nobody ever gets 100% uptime); the good part is that it allows us a fascinating glimpse inside the world of scalability on a Google scale.
is the reason this whole cloud for everything idea is utterly bloody stupid!
It simply can not be trusted, be it the companies holding all your data, or the fact anything goes wrong you are left in the lurch for hours, or having to trust a company like BT to actually reach it.
This whole cloud idea is like opening a cheapass bar in the middle of a minefield. The idea behind the cheapass booze isn't bad, the problem is just getting it reliably.
I heard that Google uses BT Home Hubs for it's routers, bloke down the pub said so.
Didn't tell them to bring other request routers online at top speed???
I'd have thought that would have been automated..
Maybe Google could make this a selling point: if you're on Gmail, everyone loses their email at once - whereas in the old days your Exchange Server might be down during the one brief period of the week when your client's Internet Mail Server was actually up.
It is not quite "business in real time", but at least it's "business in Google's Own Sweet Time".
Still good reliabilty
I'll qualify this by saying I only check my gmail about 5 or 6 times a day, but since I signed up when it was an early Beta 2004 I've only seen about 5 outages, which makes it way more reliable than the mail service provided by huge globo-corp I work for...
By my mathematical genius I reckon 3 hours per year outage is approximately 99.966% uptime which is a long way (in real terms) from the 5 nines Holy Grail of service delivery.
Yes, I know it is free, and if you can't go without your "free" email for a couple of hours then you really need to take stock of your priorities in life, but as Google want us all to trust them with data, apps and everything webby they really have to do much better than this.
And the moral of this story is...
Always choose an email supplier in a radically different time zone that does their upgrades when you're in bed...
One can only assume that you have bought a dedicated connection, dispensed with the need for an ISP, handle your own DNS, email etc and have no interaction with any other persons or company when you connect to the web.
The fact remains that whether it's Hotmail in 1997 or GMail today, we all like a little web-based email from time to time! That comes with its risks (downtime). I don't know anyone who relies solely on Gmail to run a business (and if they did, I'm pretty sure they'd have other email addresses and/or the savvy to use POP3/SMTP)
It really hit me hard...
... I couldn't access my spam, fake signup account details, newsletters or twitter alerts.
I was lost, running around the office crying "oh my God, oh my God, my life is over!"
I learnt a hard lesson yesterday, I will never again entrust my important email correspondence to the Chocolate Factory. (Shakes fist at sky in anger)
RE: @Allan Rutland / AC
"dispensed with the need for an ISP, handle your own DNS, email etc and have no interaction with any other persons or company"
lol you fail ... Alan is saying don't put all your eggs in 1 baskets, which is exactly what these "cloud " service providers are asking you to do, when it goes tits up, so does your whole business.
I love how many armchair critics there are on El Reg, ready to chip in with their superior knowledge and hindsight.
Isn't Imap wonderful
So I was worried? Was I inconvenienced? No and No.
Cloud Computing->Terminal Services->Mainframes
Mainframes used originaly for things like docs etc (think TV:Drop the Dead Donkey). Then moved to client PC's.
PC's get Terminal Services ....... TS dies a fads death, back to Client PC's
PC's get Cloud computing, dump the PC, use a cheap desktop web client........
anyone see the pattern?
Oh FYI, i've been a paid member of mail2web for over 2 years now and its NEVER been down for me using Exchange via WinMob and desktop Webclient. So yes, you can have 100% up time, at least thats my experience.
Cheers, that's the best cloud analogy I've heard yet!
FREE... I'll say it again. FREE. 99.9% uptime on a service you don't pay for and yet some smug gits still want to be clever and bleat on about their own mail servers etc. Really guys, get a grip. Not every company has the time and resources (my own included) to maintain a mail server and even if we did, if it went down we'd then have to spend time and money fixing it.
Gmail goes down, ok it might cost you some time lost and as a result some business money but at least you won't have to replace that gubbed hard drive, router, cabling or whatever caused the problem in the first place and spend hours figuring out whats broken in the first place.
They go down, they fix it, zero cost to us. We use gmail and frankly didn't even realise it had been down until I read it on the Reg, zero time spent fixing it, zero cost, zero need for being a smug git.
Meh but Duh.
Hasn't this 'failover' issue happened several times before at Google? IIRC the last time it took out the Google Search engine? Their upgrades do seem to throw spanners into the cogs on a regular basis, and the cause often seems to be other data centres refusing to take the extra load and shutting down.
But hey, its not like I pay for the service (other than handing over the wholesale records of my life), its not like I couldn't still receive the emails (my non-iphone not on an O2 network uses IMAP) and since when did email become something that required immediate response? Guys, if its that critical that you need to communicate with someone, there's a few inventions out there such as a "telephone", "pub" or even if you need to send written communications a "fax".
Compare this to O2's outage response
This is exactly the reason I continue to trust Google - they keep you informed throughout an outage, take downtime very seriously, and quickly release the information on why it happened. They are not matched by any other service in terms of sheer number of users and size of infrastructure. Managing this is never going to be easy, and all things considered, I think they do a pretty good job.
When data was unavailable through my O2 phone (and most others in the country) for pretty much all of Saturday, their Twitter account was silent, their service status webpage was non-existent, and when they finally admitted to the outage, it was said to be 'down for 2 hours'. No reason, no technical info, no reassurances it was going to be fixed to prevent it happening again. I had to rely on 3rd parties (Carphone Warehouse's Twitter account) for information. O2 could learn a lot from Google's honest approach to customer service.
Most people realise 100% uptime is unlikely at best, but it's how you manage downtime that matters. Staying silent is the worst thing any company can do.
<mumble> It's only free if your privacy, confidentiality, sanity and time have no value...</mumble>
/me walks away shaking head at the latest fad of cloud (nebulous and vapour-like) computing.
Gmail, web mail? Cloud? Is this something I should know about?
title goes here
Any of you ever tried to reroute trafic between data centers while taking down several load balancers / request routers, oh and I mean on the fly with people attempting to access those very services your working on.
If you can answer that with yes then your comments might have some validity.
As for the rest of these commenters.
You may not have noticed, but google "mail2web outage" and you'll see plenty of evidence.
No mail provider of any scale is immune to outages, and certainly not at the cost to the customers that Google provides (even for those enterprise Gmail users). As has been said a few times, it's how the organisation communicates and handles an outage that defines it, not some > 100% uptime figure.
So that means they're down nearly 9 hours a year (which of course is complete crap on a weighted average basis per user). I wonder how they count scheduled maintenance windows?
Don't they understand that getting the last additional 0.009% (hence 99.99% - or less than an hour of downtime a year - an acceptable number) is the key mark of stability? Problem is so much money (even for Google) needs to be spent at that scale to get there.
What? Nine hours a year is complete crap? I'd love to see how you justify that, even with your "weighted average basis" of course (whatever that means). Even assuming your weighted average basis means the CEO gets all nine hours of downtime at once, I suspect you'd be hard pressed to make a case that it caused anything other than inconvenience.
But the real question isn't one of absolutes. Say Google only gets 99.9%, or 99.7, or 99.99. How does that compare to what the alternatives are? No service is perfect, even ones administered internally. Perhaps *especially* ones administered internally, if you're a small company.
So, figure out if your favorite other service is better. If it is, figure out by how many hours better it is, divide that by it's cost, and compare that to your potential lost revenue from down time. If it's cheaper to just suffer through a bit of down time, guess which way companies will go? And if it's cheaper to pay to not suffer the downtime, guess which way they'll go?
it indeed sucks
By weighted average, I mean the countless times that users or blocks of users are down but it's not large enough (or in the public eye) to be counted as a technical downtime.
If inconvenience means that a business is crippled in supporting its customers or communicating with potential customers for an extended period of time than sure, it really doesn't matter.
It all really comes down to who you are. As an individual with the typical service level individuals demand, 99.9% (or even lower) is fine. As a business that relies on email as a critical mode of communication it is not an acceptable level.
As far as alternatives considering how inexpensive things are, with minimal effort and expense you can well exceed 99.9% managing an internal mail system. This would be appropriate obviously again for businesses of some size (say ~50 employees or more). Agreed that no system is perfect.
My point is that Google touts 99.9% as a great number and as someone who has had overall responsibility for managing internal operations of a small business (email included) I would have been fired quickly if that was my number.
ROFL Good one
Google do email now?
"As a business that relies on email as a critical mode of communication it is not an acceptable level."
Email is not suitable for critical communications. It is "best effort" only ...
That said, I can provide 100% uptime of YOUR end of the email system. For a price. Everyone else, of course, will be on their own ... which is why most email systems will make several tries to deliver outbound email, over [insert user configurable time frame here] ... which in turn kinda makes 100% uptime at your end a silly waste of money. (My own email system has been available to me, friends & family for about a quarter century. It's a research thingie, it ain't cheep, and I don't recommend trying this at home.).
Or, you can use google ... oh, wait ...