Feeds

back to article Skype's mega-FAIL: exec cops to cause

Chastened by its pre-Christmas mega-FAIL, Skype on Wednesday explained in detail why the titanic titsup takedown happened and how the company plans to ensure that the globe will never again go VoIP-less for an extended period of time. For those of you tuning in late, last Wednesday the online telephony service began to wobble, …

COMMENTS

This topic is closed for new posts.

Page:

Silver badge

Sounds familiar

Didn't the US POTS phone system go down for a similar reason years ago.

They introduced an upgrade into a switch which failed - after it had sent the upgrade to the next switch, and so on.

3
0
Silver badge

Also...

Kinda reminds me of the Morris worm...

http://en.wikipedia.org/wiki/Morris_worm

0
0

The PSTN...

...cascading failure was apparently caused by the omission of a single semicolon at the end of a C statement. (Western Electric ESS switches are UNIX driven.)

0
0
Bronze badge
Boffin

Actually, it was a missing "break"

As I remember it, the reason for that failure was a missing "break" keyword at the end of one branch in a C switch statement. A common programming error in C, and one not caught by the compiler, because according to the rules of C, the execution then continues in the next switch branch if there is one. This is one of the worst design flaws in C (and amazingly, aped by many related languages)!

3
5
Anonymous Coward

One man's flaw...

It's not a flaw at all - it's as designed and works very elegantly when used properly as a cascading switch.

2
1
Silver badge

Design flaw?

Sure it's not a feature? It's very handy if you have a couple of operations that are very similar, but where one requires a little extra activity beforehand. Very, very handy. Almost like relay logic.

1
0
Stop

@MacroRodent: Nope, Please Go Back To School

It IS desirable for a switch() statement to have the "fall through" capability, as there is quite often the need to group several possible cases together and process them with identical or nearly identical code.

Look at this:

http://codesearch.google.com/codesearch/p?hl=de#1oUPVh-C1Wg/trunk/eval/gx/symfony/web/sf/prototype/js/controls.js&q=switch&sa=N&cd=5&ct=rc

4
1
Silver badge

It is a flaw

The "fall through" feature should be a option, not the default. Lousy design, sure, but C is pretty old now and some progress has been made in language design theory. Although the continuing use of C++ suggests that such progress has had limited impact.

3
5
Unhappy

C 'a bit old now'? Good grief...

Read some daft posts on here, but dismissing C (one of the most elegant and constantly relevent languages on the planet today) as being '...a bit old now', and stating that the use of C++ (a different language altogether) shows that limited progress has been made, just took the biscuit.

I'm not normally confrontational on these forums, but to harbour such beliefs indicates a pretty narrow field of personal expertise. I have to say that anyone even reasonably knowledgeable in computer language theory, or simply having practical expertise in many different programming languages, whether compiled (C, Delphi, VB, assembler), interpreted (BASIC, Java etc) or JIT semi-compiled (VB .Net, C#, J#), would have found these comments unbelievably crass and ill-informed.

3
2
Anonymous Coward

@ It is a flaw

No it's not.

Just because you don't understand it doesn't mean it's wrong.

This is well documented and as designed.

I guess kids nowadays need stabilisers on their coding too :)

0
0
WTF?

Someone, as you say...

..."simply having practical expertise in many different programming languages" would no better than to place Java in the interpreted group, instead of "JIT semi-compiled".

Every heard of Hotspot? And do you even know how Java programs are executed?

On another note - you criticize the predecessor's arguments as not relevant (which is true), yet you don't seem to provide your own - e.g. just how is the fall-though switch statement "elegant"?

0
0
Happy

Java 'semi-compiled'?

@AC - Do I even know how Java is executed? Having worked on compilers and interpreters most of my working life, I think so, a little. Pure java is pseudo-compiled down to P-code, which is a form of super-tokenisation really, not JIT-precompiled. Not the same thing. And the elegancy of C code is in its many other aspects such as being able to both modify and test variables at the same time, as in (very simply):

if (y--) {

// Do this

}

Of course this can be achieved in most other languages with a little more code, but the particular beauty of C is that it was built to do work this way, keeping statements succinct.

However, to answer your valid point, default fall-through is indeed elegant and should be the default in all cases. It allows you to create cascaded tests. Understandably, not everyone will agree though.

1
1
Silver badge
Alert

It's been going down hill in quality..

3.8.0.188 Seems to be last decent version

2
0
Thumb Down

says it all

Skype for Windows version 5.0.0.152............

says it all

1
3
Silver badge
FAIL

yer doin' it rong...

50% of the userbase on a flawed version caused a cascade failure that rendered the network fully inoperable, so the solution is make sure 100% of the userbase is on the same release? Seriously?

I would SLOW the dessimation of releases, so that 20% of users are two releases behind, 30% are one release behind 35% are "current" and 15% are "experemental." Any release that was fatally flawed could be marked as "KOS" at the same time a new "experemental" release is declared. This way at MOST 35% of users wll be on a flawed release.

Not upgrading everyone at once is actually a good move, for the exact reason stated in the story. Their plan is not bad, it just didnt actually account for a failure on the scale that was possible. The correct solution is to DECREASE the scale of potental failures, not INCREASE it.

8
4

Exactly

The best way to make a distributed arch stable is to deliberately make it more diverse. Having no more than let's say 50% of clients on the same release is the right approach here. It is not a guarantee of course, but it tends to make things more reliable. Having sufficient numbers on different platforms is also a good idea. And so on as long as a complete failure of one release can allow the rest of the network to function.

1
2
FAIL

@Ominoshiko

I'm sorry, but this is a pretty poor idea. Bugs tend to persist across multiple versions until they are fixed. A bug like this which doesn't crop up until a particular failure mode triggers it (delayed messages in this case) might lie dormant for years, meaning it is remarkably easy to have the bug exist in 2 or more of the versions you have in the wild. Not being able to rapidly upgrade the software you have there means you can't fix major bugs quickly because people don't upgrade quickly. If you follow this methodology you end up with half the Internet running a bugged, security disaster like IE6!

The obvious "right thing to do"TM is pretty simple. The Skype Client and Skype Server should be separate processes on every machine. The server should do very little other than talk to the P2P network and pass messages on to the client. The client can parse all the messages and do the stuff that will likely crash. Hopefully over time the server becomes very stable and is rarely updated and hence likely to have very few bugs; the client becomes the thing you keep changing as new features come on line.

11
1
Silver badge
Stop

not as poor as running everyone on a release that has not had adiquate testing.

Fixing and upgrading everyone at once sounds like a good idea on the surface, but is only so if you can be completely sure that you do not introduce a MORE problimatic bug during the operation. From what was said in the article they use their software to update itself, so by forceing an update to everyone at once, a newly introduced bug (or did you think that your fixes couldn't introduce new, or exacerbate existing bugs?) is focred onto the entire network with no systems making up the network that can exist to provide service to the onlineing systems as they are repaired.

From what Mr. Rabble has said, having forced EVRYONE onto 5.0.0.152 would have made the network less resilent, as everyone would have experenced the problem immidiately, rather then some time for it to cascade. If less peers had to fail over at the same time, because they where running a different release (from his report, ANY different release) the network would have only likely suffered the hosts running the affected versions falling off, and a minor slow-down for everyone else, rather then a complete cascade failure.

Yes, I recognise that you can have bugs across multiple releases, that is exactly WHY i put it all the way up at four, and. Your IE6 compairison is a complete non-sequitior, as I EXPLICITLY INCLUDED a way to force the elimination of particularly problematic releases from the network. I am not saying users decide when to upgrade (on a normal basis), but the network does. While this may leave some users vulnerable for some time, the objective is to protect the network from catostrophic failure.

Frankly, dividing the software into multple interacting programs (I'm not sure stopping at two makes sense) would probibly make transparent upgrades far more seemless, and transparent. and works well with makeing sure there are always multiple versions in the network.

In addition, if you were on windows, and having problems with skype, what would be high on your list of things to try... maybe reinstalling skype? in which case most of the affected users pull the latest version OOB (from your web site) and it doesnt matter.

2
2

also familiar to us ancients who had VAXclusters

simply put, there was a small flaw in the architecture there. if the cluster controller went away, the rest of the cluster looked for a new arbitrator. DEC had determined and enforced that the earlier a MAC address you had, the better qualified you were to serve as the cluster controller, since of course it would always fall back onto a classic VAX.

until.... the physical MAC address pool ran out, and they needed to start reusing hardware MACs for PC controllers.

if you've had a VAXcluster fall onto a 286 PC as cluster controller, you'd know empirically that you have to define a class of trusted systems that you always look for first.

way too early for the Sky Hype guys, although they could have read about it.

4
0

cluster controller ?

What is a cluster controller? I suppose you mean the node that holds the locking database?

That was governed bij the system parameter LOCKDIRWT but that was on VMS 5 and higher.

Anyway, I never hit on that flaw in 16 years, must have been solved a long time ago.

0
0
Heart

Classic VAX?

Actually, that could be a real pain if you had a mixed bag of a cluster. Our first (V4) cluster included a 11/785, 11/750, an 8600 and an 8700. You didn't want the 750 to own any of the master capabilities (including mastering any distributed locks) if you wanted any decent performance!

Happy (and simpler) days!

0
0
FAIL

DECnet design

@swschrad

The underlying design flaw there was to have the network address be the MAC address, and to decide to override the hardware MAC address with the DECnet network/MAC address.

Most dumbass design I've ever seen. Fell off a chair when I learnt of it.

Made my BICC MPS (Multi-Protocol Support) DECnet driver hard to get to play well with the ISO/OSI driver. And had to jump through some hoops I'm quite ashamed of when the card driver for the 16-bit card didn't allow the MAC address to be changed - like scanning through the driver code looking for the MAC address and then changing it there.

Only in MessyDOS.......

0
0
Anonymous Coward

Simples

Bug in Windows version 5.xxxxxx, simple solution would be to ban Windows me thinks..... Linux for the Win

4
10
FAIL

of course since few use linux

You'd never have an issue. Problem solved.

2
3
Flame

@A. Coward. Can't blame Bill and cronies this time, it's Skype's problem....

...But why?

Now ask yourself why would a specific version, especially a new version, cause such a catastrophe?

The questions we should be asking are (a) exactly what was the bug, and (b) what's this latest version doing that's so radical that no previous version (over all those years) has ever caused?

Seems to me that Skype needs to become an open standard. Final questions: who has vested interests in keeping Skype closed, and why would they want to keep it closed?

Encrypted end-to-end it might be, but I still don't trust it.

2
0
Linux

Mr

So who provided the 'thousands' of 'mega-supernodes'?

1
0
Thumb Down

Oh for the love of God!

This ISN'T about Windows you useless troll.

6
0
Flame

Absolutely

If it was a linux-only client, it could have been down for a week before anyone even noticed!

4
4
Linux

open source VoIP

but how would you make a call to a PSTN? The client is, or will be, open sourced -- but that's just the GUI. At the end of the day, someone has to pay for those telephone calls, so, no, skype can't just open source the whole thing and remain a business.

0
0
Anonymous Coward

Use a SIP service provider?

There are any number of companies providing SIP to PSTN gateway services at prices similar to skype. They don't care is you use open or closed software clients, or a physical SIP handset. Since many (most?) home ADSL routers now include transparent SIP proxying the old problems of getting it through the firewall should be gone.

You could even be your own gateway provider with a box like the Linksys SPA3000 plugged into your landline. Then there are things like Asterisk and FreeSwitch, but they're probably beyond most home users.

1
0
Gates Halo

@yosemite

Actually it is. The Linux version of Skype has been stalled at... let me check... 2.1.0.81 BETA for quite a while now. Because it hasn't been upgraded for ages, it just couldn't pull such a stunt as the Windows version. Programmers need to "improve" and "fix" the software for shit to hit the fan. Although I'm not happy that the Linux version is 3 major versions behind the Windows version, it seems it's not entirely a bad thing :)

0
0
Go

Suggestion Regarding Open Source Voip

At least here in Germany people normally have free fixed-line-calling as part of the telephone/DSL contract.

The OS community could simply allow other people to use their land lines to call fixed-line numbers when they don't need it themselves. IDSN cards can be fully programmed.

1
0

exponential back off?

Would some form of back off on clients trying to reconnect have helped with this? I'm thinking that this way, the supernodes would have had time to reestablish themselves without being slammed with a huge amont of traffic?

Any idea where they found the bandwidth/processing for their mega-super-duper-nodes to fix the system? I guess it'd be one of those things that processing on demand would be pretty handy for?

2
0

Umm

"I'm sorry Dave, I'm afraid you can't login right now". Yeah, that'd be 'better'.

0
0
Bronze badge

Copy; Start; Goto 10.

--Any idea where they found the bandwidth/processing for their mega-super-duper-nodes to fix the system?--

Presumably one of those virtual instance resellers, Amazon AWS, Rackspace, Azure et al. At least that's one advantage of virtual machines is you can copy/start almost ad infinitum until either the cloud can hold no more or your bank balance holdeth no more either.

0
0
Silver badge
FAIL

Mono-Cultures

Don't you just love them.

6
0
Anonymous Coward

skype failure - who really cares?

From the above..." - globe will never again go voipless" is a crazy thing to say, after all Skype is only one of literally thousands of voip networks. However Skype already has the severe limitation of being proprietary and thus does not interface with anything else, hardware or software. This is exacerbated by the P2P system it relies upon which as described above lends it self readily to an "avalanche" type of failure. The VoIP industry's open standard SIP is far more widespread and far more flexible and so the majority of global VoIP users were not affected by the Skype catastrophy.

8
3
FAIL

Wrong

You clearly know little about the SME sector. Very many use Skype as an excellent and low cost communications system, let alone all the personal users. As I write this there are nearly 18m online.

Whilst big business has its big IT budgets and largely wasteful IT deparments (and yes I speak from experience) the small guys make their hard-earned money work hard. Some will have learned though that putting their trust in Skype alone without a fallback, like any thing else in IT, was a stupid thing to do.

But most of us missed Skype because it is to us, an excellent comms tool despite the fact that since v3.8 the GUI has gone, well, gooey, like all other apps it seems.

0
0
Coat

Take Mega Nodes offline?

Maybe ... since they intoduced Mega SuperNodes to help alleviate the problem, they should run some of their own (or use some VMs in the EC2/Azure/Whatever cloud) so they can help stave this off in the future?

"We found the fix. We added more computers to handle the increased load."

"Great. Now that we have a band-aid on it, what are we doing?"

"We're going to take all those bloody extra computers offline!"

Mine's the one with the mega super node in the cloud.

1
0
Bronze badge

I know but...

Presumably running thousands of mega-super-nodes costs alot of money.

0
0
WTF?

Where did they get 1000s of Mega supernodes at zero notice?

Just a thought that popped into my head, but you have to where the heck they pulled 1000s of mega supernode servers from on basically zero notice. Seriously, where and how did their engineers activate so many in such a short time? Provisioning thousands of servers not any network isn't a trivial task.

Then I wondered, why doesn't Skype, with it's wonderful P2P model that generates revenue on Internet capacity paid for by someone else, have a server farm of really big supernodes to handle this kind of thing? And if they do, and this is how they activated 1000's of mega supernodes so quickly, why are they so keen to withdraw them from service as soon as possible?

Then it struck me. Those clever chaps had activated their own botnet of Skype clients and promoted thousands of ordinary customer peers to be mega supernodes. I can't think of any other way they could so quickly provision so many servers on a distributed basis in such a short time. It's no wonder they want to retire as many as possible as quickly as possible, I might be somewhat miffed if my PC and Internet bandwidth were suddenly being eaten alive to serve Skype.

Two suggestions above are strikingly logical. 1) stagger the software releases so that you don't have a high predominance of a single version of your peer server code, just in case, and 2) alter the back-off code so that when a new peer server attempts to join and finds the network is busy it doesn't just hammer the servers into submission, nor do all peers back off for the same time.

Seriously though, where can I get 1000's of supernodes at zero notice?

5
1
Silver badge

Think virtual

Either, they provisioned a bunch of virtual devices from an online provider, or your suggestion of promoting "ordinary" users was what they did. To be honest though, as I understand it *any* Skype user can be promoted to supernode - although there is a way to disable it if you want... It's all part of the Skype experience that you sign up for though.

Though IIRC even being a supernode isn't a massive drain on your network.

0
0

Re: Where did they get 1000s of Mega supernodes at zero notice?

> I can't think of any other way they could so quickly provision so

> many servers on a distributed basis in such a short time.

Amazon EC2.

4
0

Cloud 2

"Then it struck me. Those clever chaps had activated their own botnet of Skype clients and promoted thousands of ordinary customer peers to be mega supernodes"

Just go to Amazon (or another cloud provider) and ask for them? Or a non-cloud traditional hosting company.

1
0

close

Per Skypes comments other places online, super-nodes don't work unless you have skype on the external IP though. Super-nodes won't NAT by design.

0
0
Flame

Similar mechanism to the Great Northeastern Blackouts!

Similar mechanism to the Great Northeastern Blackouts!

Here we have another unpredictable 'complexity' failure in a gargantuan system. Like the Northeastern Blackouts, they strike when least expected, never get fully understood and cannot be properly analyzed with the tools available.

State analysis methods, if possible, would require a computer the size of 'Deep Thought' and take just as long as it did to calculate '42'. The fact is many of our large engineering systems are vulnerable to 'complexity/scaling' failures and we shouldn't quite as surprised when they happen.

Let's fact it, we've all been aware of them for over 40 years.

4
0
Happy

Brainwashed

I read the computer name as "Deep Throat" and was wondering: who'd name a computer that and why? It took a few passes to read it correctly.

0
0
Anonymous Coward

Only if your an IT illiterate who never read the Adams books.

The rest of us see the 42 and don't have to read the name.

2
0
Silver badge
FAIL

You're all missing the point

Rik actually wrote "normalcy"!

For this he should be locked in El Reg's darkest cupboard for a week.

7
0

Normalcy

Better placed in the irony cage.

0
0

Page:

This topic is closed for new posts.