RIM: 'Faulty switch took out faulty-switch-proof network' • The Register Forums

Thursday 13th October 2011 16:04 GMT Zippy the Pinhead

Words like Redundancy and FailOver come to mind.

2 13
1. Thursday 13th October 2011 17:22 GMT the spectacularly refined chap
  
  What a pity then, that they didn't have someone with your tremendous insight.
  
  What introductory networking courses don't show you is that in the real world it is often difficult to impossible to avoid pinch points that are not protected by redundancy, either at all or to an adequate extent that can cope with the load following the failure.
  
  19 0
  1. Thursday 13th October 2011 21:41 GMT Danny 14
    
    wut?
    
    "What introductory networking courses don't show you is that in the real world it is often difficult to impossible to avoid pinch points that are not protected by redundancy"
    
    Yes and in these cases you have hot or cold swapouts or spares. These can be swapped at a few minutes notice. He is talking about core switches failing. It is quite hard to fault tolerance core switches, however. Most ISPs simply link and duplicate core switches (at great cost) when one dies they swap the interlink (this is all in software not hardware) and all is good again with about 10 mins downtime.
    
    I can only assume that their issues cascaded or had the same inherent problem on all their switches.
    
    1 0
  2. Friday 14th October 2011 09:45 GMT Anonymous Coward
    
    Nonsense
    
    While the original poster oversimplifies things. It is not impossible to avoid pinch points or money as long as you have a) the willpower and b) the cash
    
    And before I get the lame "aaah ... but in the real world" comments.
    
    I've worked on exactly this problem for ohh .. about half a dozen ftse 100 companies where this sort of outage could have significantly affected share prices, possibly the economy and at least one case lives.
    
    That's why it's not covered in "introductory" network courses. Still with this level of ignorance proudly displayed and supported at least I can be confident of not being out of work anytime soon.
    
    1 3
    1. Monday 17th October 2011 22:20 GMT Anonymous Coward
      
      Let's, see, the last time the server at a travel agents checked with the airline there were two seats left on a flight. The server is now unavailable. Can you sell those tickets? What if there is redundancy and two servers that may be checked: one says those two seats are available, the other says they have been sold. Can you sell those tickets?
      
      At the start of the day you had £1m in your bank account. One server says that £800k went out this morning. The other doesn't. How much is in your account?
      
      When it comes to managing real world resources with real world implications fail over can only take you so far. Alternatively, the resources in question could be credentials or authorisations. These are often impossible to decentralise when an authoritative answer is needed.
      
      I'm not impressed by your CV, why should I be? The thing about FTSE 100 companies is they tend to employ quite a lot of junior staff. These kind of very real issues are what become prevalent at the true architect grades. Yes, I've no doubt you'll still be in work but you'll be reporting to someone like me, who gets paid to manage the issues you claim do not exist.
      
      0 0
2. Thursday 13th October 2011 17:23 GMT Martin Gregorie
  
  When fault tolerant isn't
  
  Some time ago I was developing on a Stratus box that shared a server room and mains connection with a Tandem Non-Stop box. Both are fault tolerant machines.
  
  We came in one Monday to find that our Stratus was dead. It turned out that the Tandem PSU had shorted during the weekend which tripped out the mains, shutting down the 2nd half of the Tandem PSU and leaving the Stratus to run on its backup battery until that went flat after 3 hours or so.
  
  Exactly the same thing happened again a month later, proving it was no fluke.
  
  Moral: if the backup(s) aren't in different buildings which are connected to different substations and standby generators the system can't be considered fault tolerant - and still may not be due to other circumstances.
  
  5 0
3. Thursday 13th October 2011 17:26 GMT Matt Bryant
  
  RE: Zippy the Pinhead
  
  If you read the article, it looks like they did have a distributed switch infrastructure (which means they did have redundancy and failover), but it didn't have the bandwidth required to handle the traffice load after losing a switch. The resultant backlog swamped the servers (application caches overflowed?). It sounds like they need to go back to basics and redesign the whole system with greater caching and bandwidth, and then do some proper failure testing.
  
  BTW, no probs with a BES! :P
  
  2 1
4. Thursday 13th October 2011 20:57 GMT Rob Crawford
  
  I think it's just possible they may probably have heard those words.
  
  Odds are that their most junior network staff know more about the subject than you ever will.
  
  Like it or not shit happens, sometimes you are fortunate and it's the shit that you have planned for.
  
  Just wait till you discover an obscure IOS or Junos bug, which would never have been noticed except for the failure of a major piece of your infrastructure.
  
  From there is all down hill.
  
  Of course when your experience is a couple of hubs you can be a little snide shouting from the edges
  
  6 2
  1. Friday 14th October 2011 09:45 GMT Anonymous Coward
    
    "Like it or not shit happens, sometimes you are fortunate and it's the shit that you have planned for."
    
    Which is why, like any other testing you get someone else to come and review your work.
    
    Start with the premise, "what if a plane crashes into our data centre" and work backwards.
    
    0 1
    1. Friday 14th October 2011 13:06 GMT Rob Crawford
      
      Firstly I was responding to the very simplistic comment by the poster.
      
      That is of course assuming that the information released is the entire story (I doubt it)
      
      The point is no matter what you plan for something totally unexpected will happen at some point and it's the speed that your staff react that matters
      
      For example : Wait until your electricity supplier and your rotoblock system (firmware issue) decide to fuckup at the same time. Causing the generators to constantly cycle through starting up & shutting down, the result was much worse than a simple fail over to a DR site.
      
      By the time our DC staff got the power under control we had a lovely range of kit needing replaced.
      
      A Datacentre being wiped out tends to be a nice clean fault if you have designed things even half way sensibly
      
      0 0
5. Friday 14th October 2011 09:46 GMT Anonymous Coward
  
  Redundancy and FailOver
  
  Routing all of your communications traffic through RIM is, itself, an example of a single point of failure.
  
  Trusting any single communications service provider to maintain their service is a spof risk.
  
  That said, it beggars belief that after days of lost service they weren't able to reinstate (or route around) a switch.
  
  Which makes me wonder, as a high profile security target, whether there might be more to the 'root cause' analysis that they are letting on.
  
  2 0
6. Friday 14th October 2011 13:06 GMT Anonymous Coward
  
  You are looking at them in the wrong place
  
  Most SaaS, IaaS, BlaaS and in the RIM case MobileEmailaaS are driven by software development. The network seriously lags and is quite usually an afterthought. It is nowhere near where it should be for the growth which they have experienced. The sole exemption to that is Google.
  
  Requiring both network and software/service competence is a tall order. It costs money too. So it takes a major incident for companies to realize that it is money well spent.
  
  Amazon's rude awakening was this summer, RIM's was last week. Microsoft - well, it had more than one major clusterf*** and it has not awakened them yet by the look of it. The smaller fry - all is yet to come and believe me it will come.
  
  0 2
7. Friday 14th October 2011 14:10 GMT Chris007
  
  @Zippy the Pinhead
  
  certainly lived upto your name...
  
  2 0
Thursday 13th October 2011 16:22 GMT Anonymous Coward

Reminds me of our hosting co ...

had a 2 hour outage last year - Saturday pm-Sunday am.

Turned out that our solution, (which we *paid* for to be bomb proof) had all 4 servers on the same PSU, in the same room. Legal action was considered, until they refunded 15% of the annual fee for the past 2 years, as we clearly *hadn't* been protected. I'd love to know what they told their other customer. And they were ISO27001

3 2
1. Thursday 13th October 2011 17:22 GMT frank ly
  
  " all 4 servers on the same PSU, in the same room."
  
  Yes, stuff like that happens all the time in lots of places. I've seen dual redundant, critical data feeds get run in the same trunking in one place I worked at. Hence my belief that an independent technical assesor should be invited to examine critical systems and their report be made available to shareholders and regulatory authorities. (I know, it's not going to happen.)
  
  2 0
2. Friday 14th October 2011 12:10 GMT Hud Dunlap
  
  Old Military saying
  
  You don't get what you expect, you get what you inspect.
  
  So no one from your company actually checked to see if you got what you were paying for?
  
  0 0
  1. Friday 14th October 2011 15:06 GMT Anonymous Coward
    
    @Hud
    
    I used to work for a very large UK bank, with office in London and Edinburgh. The north/south links, which were triple redundant and supplied by three separate telcos, all went down simultaneously. It turned out that despite all three telcos having signed up for not using the same fibre backbones, all having signed up for redundant routes and having shown us documentation to prove it, they had in fact routed all their traffic through the same tier one provider.
    
    Even if you see "proof" that you've got everything you asked for, the only thing that you can be sure about in a system is that the system will fail.
    
    1 0
  2. Saturday 15th October 2011 17:52 GMT Anonymous Coward
    
    Why have a dog and bark yourself ?
    
    And besides, they were 500 miles away ...
    
    0 0
Thursday 13th October 2011 17:24 GMT Anonymous Coward

ISO2700x

All these standards are box ticking exercises. If you want proper quality control, test for it. The standard IME is normally set for the lowest(cheapest) possible quality level which everybody will agree upon. The only standards I ever saw which were really OK were those for aircraft, where failure is not a good idea! The bigger the organisation, the more chance there is of a major fail.

1 0
Thursday 13th October 2011 17:24 GMT Anonymous Coward

Sounds like a massive failure in

Elections after a problem occured on the network, each time some part took the leader position it was knocked down and everyone tried to reelect and then tried to demote themselves.

Just guessing mind.

0 0
1. Thursday 13th October 2011 18:54 GMT Bruce Grunewald
  
  Fourth installment of Transformers?
  
  RIM's explanations so far have as many plot holes as a Transformers movie.
  
  We need a Deep Berry to give us the real skinny.
  
  0 0
Thursday 13th October 2011 17:26 GMT CJ 1

My bets are on....

The switches being HP ProCurves.

0 3
1. Thursday 13th October 2011 18:56 GMT Anonymous Coward
  
  cisco?
  
  Only cisco kit scales up to this level that I know of, and yes it can fail.
  
  Then once down, the backlog overloaded all the storage. In fact, if, say, only half the net went down, that other half could just overload, leading to cascade failures. Seen that.
  
  0 1
  1. Thursday 13th October 2011 20:59 GMT Anonymous Coward
    
    Cisco... or?
    
    There are other vendors making these type of core switches, not necessarily household names like Cisco but I wouldn't be surprised to see Juniper or others in the frame.
    
    1 0
    1. Thursday 13th October 2011 21:41 GMT Danny 14
      
      so?
      
      Its not hard, you generally have cascading from sources. Simply switch the cascades bit by bit. Thats how ISPs do it when they have core routing issues. When I worked for BT they would routinely cut out chunks of backhaul and brink them back gradually to avoid massive rerouting issues (ironically from the redundancy).
      
      0 0
Thursday 13th October 2011 17:26 GMT Anonymous Coward

Brilliant quote: "The systems are designed not to fail this way..."

3 0
1. Thursday 13th October 2011 18:53 GMT Bruce Grunewald
  
  Yes, I noticed that too
  
  So obviously they were designed to fail a different way and this failure caught them off guard.
  
  2 0
  1. Friday 14th October 2011 13:00 GMT JeepBoy
    
    I expect the wrong kind of leaves were on the line.
    
    0 0
Thursday 13th October 2011 18:56 GMT Dick Emery

Hey at least they put their hands up to it. Hadit been Microsft it would be a 'feature' or if Apple 'the customers are t blame' or if it was Sony 'let's stick our heads i a hole and pretend it's not happening for a month'.

3 0
1. Thursday 13th October 2011 20:56 GMT Anonymous Coward
  
  You missed...
  
  ...'You're holding it wrong' for Apple.
  
  3 0
Thursday 13th October 2011 20:38 GMT Thomas 4

I'm looking forward to RIM's new phone

The Blackberry Chumawumba: "It gets knocked down, but it gets up again...."

3 0
Thursday 13th October 2011 20:38 GMT Anonymous Coward

Someone has cut corners

A couple of possible causes:

1. failover never been tested properly (like faithfully making backups, but never test the restore)

2. design fault(s), like single power feeds, or wrongly connected power feeds, failover links connected wrongly etc.

3. no active network monitoring at that level, maybe the switches where of the non-managed type to save cost (I have seen more weird been counter decision)

4. Dare I say: spanning tree issues, and due to the load the switches never recovered (I know this should not happen, but I have seen switches going into all ports forwarding mode instead of blocking mode).

I don't like spanning tree, for me it is layer 3 to the edge, this is done in HW so no lower throughput that layer 2.

0 1
Thursday 13th October 2011 20:47 GMT PowerSurge

"The systems are designed not to fail this way..."

I remember once in a TV discussion of an aircraft crash where an engine had fallen to bits the interviewer asked why there weren't containment rings to catch the bits. The expert replied that they'd be massively heavy and anyway they weren't required because the blades are designed not to fall off. The interviewer was incredulous as this plane might have been saved had such things been used. No, said the expert, you don't understand. The blades are designed not to fall off in the same that way the wings are designed not to fall off.

That's engineering.

3 0
Thursday 13th October 2011 20:50 GMT Johan Bastiaansen

lack of imagination

I'm always surprised at how little imagination is used in setting up doom scenario’s. Remember the nuclear incident in Japan, where hot rods where supposed to be cooled. Apparently nobody could imagine that the cooling would fail if the whole thing would be cut from power.

Now nuclear reactors in Europe will be put through a stress test. How big is the chance that this scenario, cut the power completely for several days and see what happens with these rods, will be put in the stress test? No really, who would put any money on this scenario being in the test?

Remember the stress test that every European bank was supposed to take. Dexia came out 11th of 90. Now they’re down because “gee, who could have thought this would happen”.

Remember that ship, adrift at sea with the engines shut down. And with the engines down, they did’t have any electricity. So they couldn’t start the engines.

Being an engineer, I have to ask, what backward university did these engineers graduate from?

Or is it these modern times with their “naah, it will be ok” attitude?

No shit Sherlock, for obvious reasons.

5 4
1. Thursday 13th October 2011 23:32 GMT Jean-Luc
  
  @Lack of imagination
  
  >How big is the chance that this scenario, cut the power completely for several days and see what happens with these rods, will be put in the stress test? No really, who would put any money on this scenario being in the test?
  
  IIRC somebody once thought something very close to your test scenario was worth checking for.
  
  You might have heard of the place. Shernovyl something or other ;-)
  
  0 0
2. Friday 14th October 2011 09:48 GMT Annihilator
  
  No...
  
  You're right, there's only so far that you take doomsday scenarios and it's generally based on probability. Would you have them crash 50 planes into a power plant for example? 100? Draw the line for me somewhere, and while you're at it, figure out a way they can meaningfully run these tests.
  
  Aeroplanes today are generally built with triple-redundancy, why not four?
  
  1 0
Thursday 13th October 2011 20:52 GMT tin 2

Absolutely nothing wrong with....

....HP procurve switches!

They are impressively thought out pieces of kit, speaking as a reasonably long-time Cisco-only engineer recently been forced to make use of them. Of course, they're designed down to a cost to do a specific job but if they're being used for the wrong thing that's not HP's fault.

Regard BB - I am incredulous that "a switch" could cause this. Long time ago I took over a network that you could genuinely turn bits off as you pleased and users would not notice. DR testing could be done whenever and at will. Granted to see something so well designed, elegant and testable is rare, but RIM, especially the bigger they got, could afford to do such a thing. Particularly as their reputation would take a beating should it be down for any considerable time.

I've said before on the reg that infrastructure redundancy protocols seem to throw a paddy more often than the actual hardware fails, so code and protocol should be designed to failover at it's level at least. For everything to need to go through one "thing" even if that thing has several hot-standby devices ready and waiting is asking for trouble. Often as an implementer that is hard to design around but when you control the client and server code it should be a piece of cake.

2 0
Thursday 13th October 2011 20:52 GMT LarsG

COMPENSATION CULTURE

Sometimes things happen, it is unfortunate but that is how it is.

I suggest nothing is offered or given to these greedy pariahs to teach them a lesson that Blackberrys are not infallible and they may learn a life lesson.

0 0
1. Thursday 13th October 2011 21:48 GMT Annihilator
  
  Easy
  
  Give a complete refund for the percentage of however much they paid for Blackberry services. I can't see a difference in price plans over at O2 - as far as I can tell the Blackberry services are free.
  
  Those who think you're paying a Blackberry charge, look close enough and it's the same price as the equivalent data tariff.
  
  Samsung Galaxy or a Blackberry on O2: 50mins + 250 texts + 500MB data = £16.50
  
  0 0
  1. Friday 14th October 2011 15:34 GMT John Rose
    
    Could you explain how RIM make profits?
    
    0 0
    1. Friday 14th October 2011 21:02 GMT Annihilator
      
      @John Rose
      
      Selling their corporate solutions. BES hardware and licensing turns a pretty penny.
      
      To prove a point, I've just slung my O2 SIM card in an old BB I have lying around and created a new O2 email via O2. I don't have an official blackberry contract. http://bit.ly/qQ8dQb
      
      You may as well ask how gmail or hotmail makes any money. It's not their primary source of income, but sell the BB idea to a person, and that person may well be in a position to request BB in their place of work...
      
      0 0
Thursday 13th October 2011 20:52 GMT Anonymous Coward

A switch?!?

Bollocks, they're hiding something.

0 0
Thursday 13th October 2011 20:54 GMT Richard Pennington 1

A timely reminder ...

... of the availability risks of the Cloud.

5 1
Thursday 13th October 2011 20:54 GMT Battsman

RIP RIM

Observation 1: This sounds like a situation where the caching/bandwidth was inadequate post failure.

Obvservation 2: RIM's Blackberry products suck. errrr -strike that- RIM's BB products appear less broadly functional and exciting when compared to Apple and Android product competition.

Observation 3: Prior to the most recent outage, RIM was loosing an astounding number of customers per month.

Result: If RIM just waits a little bit longer, there won't be any need to add capacity to accomodate 1 because the resource requirements will have significantly reduced due to 2 & 3.

(Just trying to help...)

0 0
Thursday 13th October 2011 20:56 GMT Anonymous Coward

Another option

By coincidence only yesterday I got to see an internal root cause analysis on a carrier grade routing device manufactured by my employer.

A backplane scheduler chip had suffered an internal hardware failure and autonomously started taking (and delivering) traffic from/to the standby fabric - all without bothering to provide any external indication that it had done so.

While rare, this sort of thing usually takes quite a while to track down, especially in a large network with multiple redundant paths, as all of the normal diagnostic tools report normal functionality.

Anonymous for obvious reasons

4 1
1. Friday 14th October 2011 09:47 GMT Displacement Activity
  
  @AC
  
  > A backplane scheduler chip had suffered an internal hardware failure and
  
  > autonomously started > taking (and delivering) traffic from/to the standby
  
  > fabric all without bothering to provide any external indication that it had done so.
  
  I call BS. For an IC of any complexity to fail in such a way that it carries out a logical series of rerouting ops has got to be next to impossible. There might be hundreds of potential reasons why your switch failed, and this has to be the least likely. Sounds like the guy who wrote your report is invoking black magic to cover his ass.
  
  0 1
Thursday 13th October 2011 22:51 GMT flingback

My God, it's amateur hour amongst El Reg's readers...

I am an avid BlackBerry user, and I have suffered considerably as a result of the last 72 hours worth of switch aggravation. However, I am also an engineer who designs routing protocols and hardware level circuits. It astounds me of the stupidity of many of the people who have been posting on these threads. Am I the only true technologist amongst you who really knows the USP of the BlackBerry network?

Let's get something straight; IPV4/IPV6 switching is pretty vanilla - there are many vendors who will provide equipment that will literally drop in when there is a failure, and provide failover at a millisecond's notice. However, this is not about IPV4/IPV6, this is about BlackBerry's proprietary PIN routing which is a layer 3+ tunnel that provides the peer-to-peer capability that BlackBerry has (and no one else does).

I agree that this should never have happened, and I am furious with the management at RIM for everything they are doing to ruin what was an excellent company with good business tools (including the "lets bury our heads in the sand" attitude for the first 48 hours). However, networks DO fail and if you think that this will never happen to Apple/Android/MS in the next five years then dream on. It has already happened many times - the difference is that you've not detected it yet because if your "push" email doesn't work instantly you forgive it because it's an Apple. If you looked carefully you'd see that you'd lost your reverse tunnel for five minutes. You can only plan for so much and occasionally something really does go so wrong that everything comes tumbling down like a stack of cards. What I am more amazed about is that BackBerry managed to de-queue millions of messages and emails in less than 6 hours when they finally caught and fixed the exact problem.

Cutting edge delivery technology comes at a cost. Personally, whilst I love my Galaxy S2 for controlling my home AV and looking at Google Skymaps, and my iPhone for flying my Parrot AR Drone, I absolutely refuse to run my business on anything else but a BlackBerry - the rest are sheer toys by comparison and those of you that carry two phones and are blatantly honest know just what i am talking about.

Cut BlackBerry some slack. Fire the management and put *real* enthusiasts in charge, make the development environment more open, halve the price of the Playbook and for God's sake get Android compatibility working quickly. RIM are not dead, yet... but things have to change.

Paris - because she's Queen of the Press Release and RIM need to learn from her!

7 1
1. Friday 14th October 2011 09:48 GMT Anonymous Coward
  
  "It astounds me of the stupidity of many of the people who have been posting on these threads"
  
  Why be astounded, just read replies on other topics, and you can clearly see that the majority of posters are just trolls. Stupid spoilt little brats who have grown to be stupid spoilt big brats.
  
  0 0
2. Friday 14th October 2011 09:48 GMT Displacement Activity
  
  @flingback
  
  Yes, it does look like amateur hour, and you clearly don't work on switches. BlackBerry has apparently said that a "high-capacity core switch" failed. Nothing to do with "IPV4/IPV6". And none of the kit and protocols I work on (LACP and MSP) "will provide failover at a millisecond's notice". A few seconds, maybe, if you're lucky, which BB clearly wasn't.
  
  0 0
Thursday 13th October 2011 22:52 GMT Anteaus

Danger of relying on proprietary services.

If a standard email service fails then it's a relatively simple matter to set-up a temporary one elsewhere, and change your domain settings to suit. With a proprietary service using its own protocols you have no such option.

I reckon the lesson here is that it is most unwise to place reliance on closed, proprietary systems for mission-critical business IT. Adherence to open standards is the way to provide resilience.

2 1

Topics

Special Features

Vendor Voice

Resources

COMMENTS

Page:

wut?

Nonsense

When fault tolerant isn't

RE: Zippy the Pinhead

Redundancy and FailOver

You are looking at them in the wrong place

@Zippy the Pinhead

Reminds me of our hosting co ...

" all 4 servers on the same PSU, in the same room."

Old Military saying

@Hud

Why have a dog and bark yourself ?

ISO2700x

Sounds like a massive failure in

Fourth installment of Transformers?

My bets are on....

cisco?

Cisco... or?

so?

Yes, I noticed that too

You missed...

I'm looking forward to RIM's new phone

Someone has cut corners

"The systems are designed not to fail this way..."

lack of imagination

@Lack of imagination

No...

Absolutely nothing wrong with....

COMPENSATION CULTURE

Easy

@John Rose

A switch?!?

A timely reminder ...

RIP RIM

Another option

@AC

My God, it's amateur hour amongst El Reg's readers...

@flingback

Danger of relying on proprietary services.

Page:

About Us

Our Websites

Your Privacy