Words like Redundancy and FailOver come to mind.
Blackberry bosses held a short press conference at 10am EST (15.00 GMT) today to calm investors, answer media questions and shed (a little bit) more light on the faulty switch that caused the three-day service outages across the UK, Europe, Africa and Latin America. RIM’s co-CEOs Jim Balsillie and Mike Lazaridis, as well as …
Thursday 13th October 2011 17:22 GMT the spectacularly refined chap
What a pity then, that they didn't have someone with your tremendous insight.
What introductory networking courses don't show you is that in the real world it is often difficult to impossible to avoid pinch points that are not protected by redundancy, either at all or to an adequate extent that can cope with the load following the failure.
Thursday 13th October 2011 21:41 GMT Danny 14
"What introductory networking courses don't show you is that in the real world it is often difficult to impossible to avoid pinch points that are not protected by redundancy"
Yes and in these cases you have hot or cold swapouts or spares. These can be swapped at a few minutes notice. He is talking about core switches failing. It is quite hard to fault tolerance core switches, however. Most ISPs simply link and duplicate core switches (at great cost) when one dies they swap the interlink (this is all in software not hardware) and all is good again with about 10 mins downtime.
I can only assume that their issues cascaded or had the same inherent problem on all their switches.
Friday 14th October 2011 09:45 GMT Anonymous Coward
While the original poster oversimplifies things. It is not impossible to avoid pinch points or money as long as you have a) the willpower and b) the cash
And before I get the lame "aaah ... but in the real world" comments.
I've worked on exactly this problem for ohh .. about half a dozen ftse 100 companies where this sort of outage could have significantly affected share prices, possibly the economy and at least one case lives.
That's why it's not covered in "introductory" network courses. Still with this level of ignorance proudly displayed and supported at least I can be confident of not being out of work anytime soon.
Monday 17th October 2011 22:20 GMT Anonymous Coward
Let's, see, the last time the server at a travel agents checked with the airline there were two seats left on a flight. The server is now unavailable. Can you sell those tickets? What if there is redundancy and two servers that may be checked: one says those two seats are available, the other says they have been sold. Can you sell those tickets?
At the start of the day you had £1m in your bank account. One server says that £800k went out this morning. The other doesn't. How much is in your account?
When it comes to managing real world resources with real world implications fail over can only take you so far. Alternatively, the resources in question could be credentials or authorisations. These are often impossible to decentralise when an authoritative answer is needed.
I'm not impressed by your CV, why should I be? The thing about FTSE 100 companies is they tend to employ quite a lot of junior staff. These kind of very real issues are what become prevalent at the true architect grades. Yes, I've no doubt you'll still be in work but you'll be reporting to someone like me, who gets paid to manage the issues you claim do not exist.
Thursday 13th October 2011 17:23 GMT Martin Gregorie
When fault tolerant isn't
Some time ago I was developing on a Stratus box that shared a server room and mains connection with a Tandem Non-Stop box. Both are fault tolerant machines.
We came in one Monday to find that our Stratus was dead. It turned out that the Tandem PSU had shorted during the weekend which tripped out the mains, shutting down the 2nd half of the Tandem PSU and leaving the Stratus to run on its backup battery until that went flat after 3 hours or so.
Exactly the same thing happened again a month later, proving it was no fluke.
Moral: if the backup(s) aren't in different buildings which are connected to different substations and standby generators the system can't be considered fault tolerant - and still may not be due to other circumstances.
Thursday 13th October 2011 17:26 GMT Matt Bryant
RE: Zippy the Pinhead
If you read the article, it looks like they did have a distributed switch infrastructure (which means they did have redundancy and failover), but it didn't have the bandwidth required to handle the traffice load after losing a switch. The resultant backlog swamped the servers (application caches overflowed?). It sounds like they need to go back to basics and redesign the whole system with greater caching and bandwidth, and then do some proper failure testing.
BTW, no probs with a BES! :P
Thursday 13th October 2011 20:57 GMT Rob Crawford
I think it's just possible they may probably have heard those words.
Odds are that their most junior network staff know more about the subject than you ever will.
Like it or not shit happens, sometimes you are fortunate and it's the shit that you have planned for.
Just wait till you discover an obscure IOS or Junos bug, which would never have been noticed except for the failure of a major piece of your infrastructure.
From there is all down hill.
Of course when your experience is a couple of hubs you can be a little snide shouting from the edges
Friday 14th October 2011 09:45 GMT Anonymous Coward
Friday 14th October 2011 13:06 GMT Rob Crawford
Firstly I was responding to the very simplistic comment by the poster.
That is of course assuming that the information released is the entire story (I doubt it)
The point is no matter what you plan for something totally unexpected will happen at some point and it's the speed that your staff react that matters
For example : Wait until your electricity supplier and your rotoblock system (firmware issue) decide to fuckup at the same time. Causing the generators to constantly cycle through starting up & shutting down, the result was much worse than a simple fail over to a DR site.
By the time our DC staff got the power under control we had a lovely range of kit needing replaced.
A Datacentre being wiped out tends to be a nice clean fault if you have designed things even half way sensibly
Friday 14th October 2011 09:46 GMT Anonymous Coward
Redundancy and FailOver
Routing all of your communications traffic through RIM is, itself, an example of a single point of failure.
Trusting any single communications service provider to maintain their service is a spof risk.
That said, it beggars belief that after days of lost service they weren't able to reinstate (or route around) a switch.
Which makes me wonder, as a high profile security target, whether there might be more to the 'root cause' analysis that they are letting on.
Friday 14th October 2011 13:06 GMT Anonymous Coward
You are looking at them in the wrong place
Most SaaS, IaaS, BlaaS and in the RIM case MobileEmailaaS are driven by software development. The network seriously lags and is quite usually an afterthought. It is nowhere near where it should be for the growth which they have experienced. The sole exemption to that is Google.
Requiring both network and software/service competence is a tall order. It costs money too. So it takes a major incident for companies to realize that it is money well spent.
Amazon's rude awakening was this summer, RIM's was last week. Microsoft - well, it had more than one major clusterf*** and it has not awakened them yet by the look of it. The smaller fry - all is yet to come and believe me it will come.
Thursday 13th October 2011 16:22 GMT Anonymous Coward
Reminds me of our hosting co ...
had a 2 hour outage last year - Saturday pm-Sunday am.
Turned out that our solution, (which we *paid* for to be bomb proof) had all 4 servers on the same PSU, in the same room. Legal action was considered, until they refunded 15% of the annual fee for the past 2 years, as we clearly *hadn't* been protected. I'd love to know what they told their other customer. And they were ISO27001
Thursday 13th October 2011 17:22 GMT frank ly
" all 4 servers on the same PSU, in the same room."
Yes, stuff like that happens all the time in lots of places. I've seen dual redundant, critical data feeds get run in the same trunking in one place I worked at. Hence my belief that an independent technical assesor should be invited to examine critical systems and their report be made available to shareholders and regulatory authorities. (I know, it's not going to happen.)
Friday 14th October 2011 12:10 GMT Hud Dunlap
Friday 14th October 2011 15:06 GMT Anonymous Coward
I used to work for a very large UK bank, with office in London and Edinburgh. The north/south links, which were triple redundant and supplied by three separate telcos, all went down simultaneously. It turned out that despite all three telcos having signed up for not using the same fibre backbones, all having signed up for redundant routes and having shown us documentation to prove it, they had in fact routed all their traffic through the same tier one provider.
Even if you see "proof" that you've got everything you asked for, the only thing that you can be sure about in a system is that the system will fail.
Thursday 13th October 2011 17:24 GMT Anonymous Coward
All these standards are box ticking exercises. If you want proper quality control, test for it. The standard IME is normally set for the lowest(cheapest) possible quality level which everybody will agree upon. The only standards I ever saw which were really OK were those for aircraft, where failure is not a good idea! The bigger the organisation, the more chance there is of a major fail.
Thursday 13th October 2011 17:24 GMT Anonymous Coward
Thursday 13th October 2011 18:56 GMT Anonymous Coward
Thursday 13th October 2011 20:59 GMT Anonymous Coward
Thursday 13th October 2011 21:41 GMT Danny 14
Its not hard, you generally have cascading from sources. Simply switch the cascades bit by bit. Thats how ISPs do it when they have core routing issues. When I worked for BT they would routinely cut out chunks of backhaul and brink them back gradually to avoid massive rerouting issues (ironically from the redundancy).
Thursday 13th October 2011 17:26 GMT Anonymous Coward
Thursday 13th October 2011 18:56 GMT Dick Emery
Thursday 13th October 2011 20:38 GMT Anonymous Coward
Someone has cut corners
A couple of possible causes:
1. failover never been tested properly (like faithfully making backups, but never test the restore)
2. design fault(s), like single power feeds, or wrongly connected power feeds, failover links connected wrongly etc.
3. no active network monitoring at that level, maybe the switches where of the non-managed type to save cost (I have seen more weird been counter decision)
4. Dare I say: spanning tree issues, and due to the load the switches never recovered (I know this should not happen, but I have seen switches going into all ports forwarding mode instead of blocking mode).
I don't like spanning tree, for me it is layer 3 to the edge, this is done in HW so no lower throughput that layer 2.
Thursday 13th October 2011 20:47 GMT PowerSurge
"The systems are designed not to fail this way..."
I remember once in a TV discussion of an aircraft crash where an engine had fallen to bits the interviewer asked why there weren't containment rings to catch the bits. The expert replied that they'd be massively heavy and anyway they weren't required because the blades are designed not to fall off. The interviewer was incredulous as this plane might have been saved had such things been used. No, said the expert, you don't understand. The blades are designed not to fall off in the same that way the wings are designed not to fall off.
Thursday 13th October 2011 20:50 GMT Johan Bastiaansen
lack of imagination
I'm always surprised at how little imagination is used in setting up doom scenario’s. Remember the nuclear incident in Japan, where hot rods where supposed to be cooled. Apparently nobody could imagine that the cooling would fail if the whole thing would be cut from power.
Now nuclear reactors in Europe will be put through a stress test. How big is the chance that this scenario, cut the power completely for several days and see what happens with these rods, will be put in the stress test? No really, who would put any money on this scenario being in the test?
Remember the stress test that every European bank was supposed to take. Dexia came out 11th of 90. Now they’re down because “gee, who could have thought this would happen”.
Remember that ship, adrift at sea with the engines shut down. And with the engines down, they did’t have any electricity. So they couldn’t start the engines.
Being an engineer, I have to ask, what backward university did these engineers graduate from?
Or is it these modern times with their “naah, it will be ok” attitude?
No shit Sherlock, for obvious reasons.
Thursday 13th October 2011 23:32 GMT Jean-Luc
@Lack of imagination
>How big is the chance that this scenario, cut the power completely for several days and see what happens with these rods, will be put in the stress test? No really, who would put any money on this scenario being in the test?
IIRC somebody once thought something very close to your test scenario was worth checking for.
You might have heard of the place. Shernovyl something or other ;-)
Friday 14th October 2011 09:48 GMT Annihilator
You're right, there's only so far that you take doomsday scenarios and it's generally based on probability. Would you have them crash 50 planes into a power plant for example? 100? Draw the line for me somewhere, and while you're at it, figure out a way they can meaningfully run these tests.
Aeroplanes today are generally built with triple-redundancy, why not four?
Thursday 13th October 2011 20:52 GMT tin 2
Absolutely nothing wrong with....
....HP procurve switches!
They are impressively thought out pieces of kit, speaking as a reasonably long-time Cisco-only engineer recently been forced to make use of them. Of course, they're designed down to a cost to do a specific job but if they're being used for the wrong thing that's not HP's fault.
Regard BB - I am incredulous that "a switch" could cause this. Long time ago I took over a network that you could genuinely turn bits off as you pleased and users would not notice. DR testing could be done whenever and at will. Granted to see something so well designed, elegant and testable is rare, but RIM, especially the bigger they got, could afford to do such a thing. Particularly as their reputation would take a beating should it be down for any considerable time.
I've said before on the reg that infrastructure redundancy protocols seem to throw a paddy more often than the actual hardware fails, so code and protocol should be designed to failover at it's level at least. For everything to need to go through one "thing" even if that thing has several hot-standby devices ready and waiting is asking for trouble. Often as an implementer that is hard to design around but when you control the client and server code it should be a piece of cake.
Thursday 13th October 2011 20:52 GMT LarsG
Thursday 13th October 2011 21:48 GMT Annihilator
Give a complete refund for the percentage of however much they paid for Blackberry services. I can't see a difference in price plans over at O2 - as far as I can tell the Blackberry services are free.
Those who think you're paying a Blackberry charge, look close enough and it's the same price as the equivalent data tariff.
Samsung Galaxy or a Blackberry on O2: 50mins + 250 texts + 500MB data = £16.50
Friday 14th October 2011 21:02 GMT Annihilator
Selling their corporate solutions. BES hardware and licensing turns a pretty penny.
To prove a point, I've just slung my O2 SIM card in an old BB I have lying around and created a new O2 email via O2. I don't have an official blackberry contract. http://bit.ly/qQ8dQb
You may as well ask how gmail or hotmail makes any money. It's not their primary source of income, but sell the BB idea to a person, and that person may well be in a position to request BB in their place of work...
Thursday 13th October 2011 20:54 GMT Battsman
Observation 1: This sounds like a situation where the caching/bandwidth was inadequate post failure.
Obvservation 2: RIM's Blackberry products suck. errrr -strike that- RIM's BB products appear less broadly functional and exciting when compared to Apple and Android product competition.
Observation 3: Prior to the most recent outage, RIM was loosing an astounding number of customers per month.
Result: If RIM just waits a little bit longer, there won't be any need to add capacity to accomodate 1 because the resource requirements will have significantly reduced due to 2 & 3.
(Just trying to help...)
Thursday 13th October 2011 20:56 GMT Anonymous Coward
By coincidence only yesterday I got to see an internal root cause analysis on a carrier grade routing device manufactured by my employer.
A backplane scheduler chip had suffered an internal hardware failure and autonomously started taking (and delivering) traffic from/to the standby fabric - all without bothering to provide any external indication that it had done so.
While rare, this sort of thing usually takes quite a while to track down, especially in a large network with multiple redundant paths, as all of the normal diagnostic tools report normal functionality.
Anonymous for obvious reasons
Friday 14th October 2011 09:47 GMT Displacement Activity
> A backplane scheduler chip had suffered an internal hardware failure and
> autonomously started > taking (and delivering) traffic from/to the standby
> fabric all without bothering to provide any external indication that it had done so.
I call BS. For an IC of any complexity to fail in such a way that it carries out a logical series of rerouting ops has got to be next to impossible. There might be hundreds of potential reasons why your switch failed, and this has to be the least likely. Sounds like the guy who wrote your report is invoking black magic to cover his ass.
Thursday 13th October 2011 22:51 GMT flingback
My God, it's amateur hour amongst El Reg's readers...
I am an avid BlackBerry user, and I have suffered considerably as a result of the last 72 hours worth of switch aggravation. However, I am also an engineer who designs routing protocols and hardware level circuits. It astounds me of the stupidity of many of the people who have been posting on these threads. Am I the only true technologist amongst you who really knows the USP of the BlackBerry network?
Let's get something straight; IPV4/IPV6 switching is pretty vanilla - there are many vendors who will provide equipment that will literally drop in when there is a failure, and provide failover at a millisecond's notice. However, this is not about IPV4/IPV6, this is about BlackBerry's proprietary PIN routing which is a layer 3+ tunnel that provides the peer-to-peer capability that BlackBerry has (and no one else does).
I agree that this should never have happened, and I am furious with the management at RIM for everything they are doing to ruin what was an excellent company with good business tools (including the "lets bury our heads in the sand" attitude for the first 48 hours). However, networks DO fail and if you think that this will never happen to Apple/Android/MS in the next five years then dream on. It has already happened many times - the difference is that you've not detected it yet because if your "push" email doesn't work instantly you forgive it because it's an Apple. If you looked carefully you'd see that you'd lost your reverse tunnel for five minutes. You can only plan for so much and occasionally something really does go so wrong that everything comes tumbling down like a stack of cards. What I am more amazed about is that BackBerry managed to de-queue millions of messages and emails in less than 6 hours when they finally caught and fixed the exact problem.
Cutting edge delivery technology comes at a cost. Personally, whilst I love my Galaxy S2 for controlling my home AV and looking at Google Skymaps, and my iPhone for flying my Parrot AR Drone, I absolutely refuse to run my business on anything else but a BlackBerry - the rest are sheer toys by comparison and those of you that carry two phones and are blatantly honest know just what i am talking about.
Cut BlackBerry some slack. Fire the management and put *real* enthusiasts in charge, make the development environment more open, halve the price of the Playbook and for God's sake get Android compatibility working quickly. RIM are not dead, yet... but things have to change.
Paris - because she's Queen of the Press Release and RIM need to learn from her!
Friday 14th October 2011 09:48 GMT Anonymous Coward
Friday 14th October 2011 09:48 GMT Displacement Activity
Yes, it does look like amateur hour, and you clearly don't work on switches. BlackBerry has apparently said that a "high-capacity core switch" failed. Nothing to do with "IPV4/IPV6". And none of the kit and protocols I work on (LACP and MSP) "will provide failover at a millisecond's notice". A few seconds, maybe, if you're lucky, which BB clearly wasn't.
Thursday 13th October 2011 22:52 GMT Anteaus
Danger of relying on proprietary services.
If a standard email service fails then it's a relatively simple matter to set-up a temporary one elsewhere, and change your domain settings to suit. With a proprietary service using its own protocols you have no such option.
I reckon the lesson here is that it is most unwise to place reliance on closed, proprietary systems for mission-critical business IT. Adherence to open standards is the way to provide resilience.
Thursday 13th October 2011 22:52 GMT Anonymous Coward
Thursday 13th October 2011 22:55 GMT Chris 211
Redundant networks are always a compromise between money, physical space, power, ISP connection, time given to test fully, keep drilling down there is always, always a single point of failure. What the likes of Cisco are good at are selling redundant networks based on very reliable kit while unreliable (read untested/unproven) kit is left to fend for itself, some tinpot pc which is the AD/radius/dns/share/etc.
Some company's expect kit to go in and question your confidence if you say a failover test should be conducted. Of course then the sales/account manager kickin and say of course we know what we are doing we dont need to test the failover.
Network DOOMED - Due to over confident pricks in suits.
Friday 14th October 2011 07:42 GMT Richard Jones 1
All this talk about testing is interesting. I well remember we had an established test routine for new 'devices' and had followed it for several years. We were then forced to use a new supplier --- they took one look at the test schedules and went screaming from the room, straight to 'higher management' complaining that we wanted to' break their kit'. Long investigation including test results from a dozen or so other installations involving a competitor's' kit showing tests performed and no ill effects. The inclusion of a comment that this supplier was not up to much if they could not withstand a known possible failure path without the risk of serious hardware damage, caused ructions and the tests were watered down. The supplier is now no longer in business.... The operating company is now also a shadow of its former self.
Friday 14th October 2011 07:43 GMT amanfromMars 1
Jumping in at the deep end ...... with no visible means of support.
" "..and then, after a series of events that haven’t quite been explained, the service went down in the US as well.
Security has been a prime selling point for the devices, but that’s only part of the equation."
Pay back time from a disgruntled administration, or vulnerable cabal, blind sided by the encryption that allows Spring movements in dark environments? I wonder who would be leading that assault on RIM systems/servers/tools?"
A Wired comment on the same subject, which can be easily, plausibly denied but whenever "The systems are designed not to fail this way..." must one conclude that they have been attacked by systems which have failed?
Be careful out there, IT's a jungle and real murky in dark corners.
Friday 14th October 2011 13:05 GMT Anonymous Coward
Do the BB T&Cs commit to a level of service availability? I don't know as I've never bothered reading them! But if not, then why do people believe they're entitled to a continual level of service?
Now with my more reasonable head on......... 3 days probably breaks the general trading laws in a lot of countries as to what consumers can expect in return for a service that they procure!!
Last point........ high availability isn't a problem. It's determining the acceptable level of cost vs. availability that is.
Aside from mistakes at initial design stage (e.g. everything in 1 building, dual connections running the same cable duct, applications not designed and tested for stateful failover, etc) it's the change control and continual asssessment and adjustment to account for changes in capacity requirements, upgrades and monitoring (including test traffic) that's usually less well conducted on a thorough and consistent basis.
Fanboi avatar - because it looks like an ideal weekend :)
Friday 14th October 2011 19:30 GMT Tom 13
He didn't necessariloy dodge the question,
he may have answered it honestly. If they are still working on a root cause analysis, they are still working on it.
I've fixed many a problem without doing a proper root cause analysis because I knew a fix and it was cheaper to fix than to analyze. They may have gotten the systems backup the same way, and now they are working on the root cause analysis because they can't afford to have it happen again.
I'm not even a network engineer let alone a systems engineer capable of analyzing their issues. But I've worked often enough with the network engineer who bitched about the hardware failing because despite assurances that the connecting protocols were hardware agnostic, for some reason they weren't, but only in about 0.1% of the cases so the manufacturers never hunted down the solutions. It is also possible they did actually find a Honest-to-God new bug. Someone does, it's just that when you are in the field it isn't likely so you go looking for known bugs first.
Where's the soapbox icon?