26 posts • joined 11 Jul 2008
Hang on, seems like you guys missed out the real fun in this story
Q: How and when did Cisco find out about this issue?
A: Cisco first became aware of this issue in December 2010.
Only in late 2012 did field failures and supplier review data point to a potential customer impact
So it took an engineering company 2 years to figure out it was bad memory, and the same time frame to admit to customers they were exploring an issue (They denied all knowledge until this PR stunt was ready to be pushed out the door).
I dont know about you, but that seems like a very long time, a significant portion of the shelf life of most of these products.
Re: Right on.
TaabuTheCat is indeed correct in this - one primary concern is that the blast radius (regardless of your specific architecture) is fundamentally larger.
Another point I neglected to make is that this brings us way back in terms of network stability as a whole. It'll be like the late 90's again in terms of OSPF/ISIS - running fresh builds, having outages because the implementations are just not mature. You can argue about architecture all you want, but the fact is that you just cannot afford outages. In some cases, it's better to solve the problem with existing protocols, rather than throw everything out and start again (However I would argue that _some_ protocols should be thrown out by default).
The fact is that SDN is only built for scale, and nobody running at that scale can afford _any_ downtime. I actually fully support most of the key arguments behind SDN, it's just that some of the principles seem to come directly from the VM world, and wont have a 1:1 translation to the networking world. I am a huge believer in automation, in partitioning services. However I'm also a believer in correctly architecting the network to your users requirements, and automating deploy time.
Lastly - it's not so much FUD as paranoia and skepticism, due to watching what sales promise go up in smoke with a frequency that's left me bitter right through to my core.
So we have finally started to convince the world that large layer 2 domains are a very bad idea (think: blast radius), and that lots of individual devices working together (Distributed) is much better than centralising everything...
Until along came SDN and re-centralises everything once again (Controller wise). I can already see the mega-outages a bug at the controller level will cause, or the lack of individual node optimisation this will cause.
It's not that I think SDN is a bad idea, I just think it's half baked right now. I also think that simplicity in design, and a good provisioning tool and excellent engineers will trump the cost of SDN, in terms of man hours, resources and impact.
Beer; because this is what SDN drives me to.
Can any of you tell malice from incompetence? Why assume malevolence when incompetence is the more likely answer.
This saga is non-trivial, and it's got many moving points, namely:
1. Netflix's peering policy ( https://signup.netflix.com/openconnect/guidelines , min 2Gbit _each way_ at 95th percentile).
2. Netflix's Open Connect program - ISP's dont want to lose rack space + power to these boxes.
3. ISP's want to make the content providers pay for content traversing their network.
What's more than likely happening is that Netflix traffic is taking the congested path (They are a victim of their own success) inside of Verizon's network. I dont think there is malevolence involved here, however at the same time, there's nothing to be gained right now for Verizon. They're not losing customers, and _if_ they do lose customers over this Netflix saga, it's the customers who tend to cost them more in transit bills.
BCP38 people - ask for it from your upstream provider.
Also, for any of you running any networks - drop NTP on your border, drop ingress traffic with a source address of your address space at the border. Drop egress traffic that does not match your address space at your border.
Been saying it for years - Cisco is going to lose the switching market, and probably the routing market after that.
I see them being relevant in corporate IT (Voice, user access) and servers (I'm guessing that their switching market share will collapse into the embedded switches they're building into UCS chassis).
I'm not 100% sure that home grown devices are mature enough right now for the regular DC. However from an OS perspective, Cumulus Networks should change this in the coming years, and hopefully a larger install base of Trident I / II based boxes will also level the playing field.
Interesting times ahead, and not even one mention of SDN (d'oh!)
What I'd really like to see is event correlation in an intelligent way...
By parsing flow data in almost-real time, looking for patterns in syslog, interface changes (ie: flap, or an interface counter going +/- X% across samples), and snarfing up accounting data. Hell, even take an iBGP feed of updates from my eBGP peers, and a feed of OSPF LSA's and correlate an event with a specific set of updates. There's so much room for correlation, there's just nothing about that I've found that works for me.
I think overall, we have all the tools we need to do this, but the time needed to integrate them all, make them talk nicely, and set intelligent thresholds, relative thresholds and even a little historical predication based on previous events is just not worth it. Lately, instead of spending time on this, I'm fighting to get nfsen/observium/smokeping/homegrown scripts to talk together and give me a coherent view of my traffic patterns.
I'm battling with the stupidity of SNMP traps, SNMP's format and the absurdity of 5 minute samples when I have 40Gbit interfaces.
Monitoring makes me stabby.
Where were your articles when I was in my 20's? I've encountered every type of boss listed. Actually, sometimes, multiple types in the same person.
Excellent article, as always. Look forward to the next.
Why do people need to invent the wheel. The B.A.T.M.A.N protocol exists to solve this exact problem in IPv4, and can be easily ported to IPv6.
Nothing new here....
> Our Future Internet should know no barriers, least of all barriers created because we did not prepare for the data revolution.
Sorry, darlin', speed of light is a constant.
In this case, it's a problem with how WiFi is setup, rather than TCP. WiFi is a shared medium, so you're going to get collisions (CSMA/CA attempts to give fair access to the medium). TCP is affected as it sees a collision as a drop, so it scales back throughput wise. That's why they're dropping new sessions, and giving priority to the existing data flows (It's kinda cheating, throughput wise).
TCP has quite a few throughput hacks in it (Window Scaling, SACK, binary backoff Vs exponential etc), and is quite predictable and mature. The real issue here is wireless ethernet being "non switched", thus having collisions and packet loss with many users.
So it's adaptive, intelligent tail dropping? I dont see what's earth shattering about this.
It's like a bloody soap opera over there on wikipedia.
Oh, but they did.
Arrowpoint created the CSS platform. CSM was homegrown, ACE was homegrown.
I'm not surprised about the ACE at all, it's not exactly the best platform out there.
We're Facebook. We are trying to get people to take us seriously with our super-dooper server designs and by showing all you guys that we live in the past.
...before FF steal this feature?
I suppose this is the end of facebook as we know it.
Didnt Apple make the same mistake with Cisco with the iPhone as they're making now with IOS?
" In 2008, he attempted to traffic 100 gigabit interface converters that were bought in China"...
100Gbit? If this is not a typo, which dumb ass actually bought this and considered it legit (in 2008)?
Cad e sin? An bhfuil ciall ar bith ar an fhear seo?
""through regular discussions with management" and the compensation committee of the board of directors."
Haha, yeah. We all know how effective committees are.
RE: Unified communications. Pfaw
>> The reality is that no matter how many lines of communication you offer a given employee, you can't change the fundamental fact that it is their CHOICE to respond or not.
Agree, however you have presence awareness. At least now the boss knows the employee is unable to respond.
... Milking his media coverage.
I'll get me coat.
Nope, one machine can never live up to that level of service :)
But, say you have a farm of 10 machines behind a load balancer (in fault tolerant mode), with the correct type of probing configured on the load balancers (so that machines which still ping, but say, dont serve out HTTP, are not used for a load balancing decision), 99.999% should be an attainable goal.
You hardly think that MS or Apple use just one machine for these types of services?
I'd say their setup is a little more complicated than that, I'd suspect that they have GSLB in place, to direct you to the nearest (geographically) serverfarm.
Anything less than 99.999% these days is just not acceptable from a corproate entity.
Paris, cause she's had more downtime than Ubuntu's site..
This story is..
Err, mine's the denim one
- Vid Hubble 'scope scans 200,000-ton CHUNKY CRUMBLE ENIGMA
- Bugger the jetpack, where's my 21st-century Psion?
- Google offers up its own Googlers in cloud channel chumship trawl
- Interview Global Warming IS REAL, argues sceptic mathematician - it just isn't THERMAGEDDON
- Apple to grieving sons: NO, you cannot have access to your dead mum's iPad