C For Hell: Data centre meltdown for irate customers as C4L GOES TITSUP • The Register Forums

Tuesday 25th August 2015 10:25 GMT TimR

Would BT's DNS problems be connected to this?

Currently getting no response from ns3.bt.net (194.72.6.57) or ns4.bt.net (194.73.82.242)

1 0 Reply

Wednesday 26th August 2015 08:20 GMT Anonymous Coward

No, BT have nothing to do with C4L.

0 0 Reply

Tuesday 25th August 2015 10:27 GMT Anonymous Coward

Have to feel pity for the bods beavering away trying to fix

Did consider colo-ing with them.. Pretty glad I didn't now.

0 0 Reply

Tuesday 25th August 2015 10:57 GMT Anonymous Coward

Colocation

co... location......

co...

location....

So, this is everyone's secondary system right, they are doing their own redundancy... or are the tweeting customers expecting C4L to manage ALL redundancy?

(I'd still be bitching, but worth pointing out that its never down to the supplier alone).

7 1 Reply

Tuesday 25th August 2015 11:02 GMT Anonymous Coward

More fool the idiots...

.. who think "cloud" based storage is somehow magic and won't ever suffer this sort of thing. If PT Barnum were alive today he'd probably revise his famous saying to swarms of suckers born every minute.

8 0 Reply

Wednesday 26th August 2015 07:57 GMT Roj Blake

Re: More fool the idiots...

Outages not only happen at in-house facilities, but are typically more likely to happen there.

0 0 Reply
1. Wednesday 26th August 2015 09:43 GMT Vic
  
  Re: More fool the idiots...
  
  Outages not only happen at in-house facilities, but are typically more likely to happen there.
  
  Not at mine, they aren't...
  
  Vic.
  
  1 0 Reply

Tuesday 25th August 2015 11:27 GMT Mephistro

Another day...

... another cloud SNAFU.

To me, the IT world is beginning to look like a huge pack of lemmings on crack. The cloud, tablets, BYOD...

What worries me the most is that I'm beginning to remember with nostalgia the days when we complained about WinXP's lack of security. At least we had some kind of control over our hardware and software. Ahhh, those were the days!

11 0 Reply

Tuesday 25th August 2015 13:24 GMT Anonymous Coward

Re: Another day...

Or maybe, the fact that there are thousands of hosting providers out there, and only a handful of outages like this one, means that on the whole things in "the cloud" are done extremely well?

Of course, you can can take your own colo, and you can connect to multiple ISPs with BGP. Could you do it better than a larger provider? How's your BGP expertise in your 24x7 NOC staff?

Suppose you were the subject of the same Juniper bug as C4L have apparently succumbed to. Do you think Juniper would spend as much time with you, a tiny customer, as with a large provider?

Things which are based on complex software have bugs, and bugs interact in complex ways. Equipment may be tested in a simulated environment, but bugs may not show up until production traffic is put on them, or they have been running for an extended period of time.

Short of building a completely separate parallel network, with a completely different vendor's kit, bugs are going to surface once in a while.

3 5 Reply
1. Tuesday 25th August 2015 13:42 GMT Mephistro
  
  Re: Another day...(@ac)
  
  "Things which are based on complex software have bugs."
  
  I think you answered your own comment with that sentence. Problems and bugs grow exponentially with complexity. Taking all your IT infrastructure into "The Cloud" means that you'll suffer errors from many different sources -ISPs, cloud providers, VM software, datacenter software and hardware... plus possible combinations of these- and you don't have any control over (most of) said problems or bugs. Add the issues with consumer lock-in, security and privacy and the picture gets even darker.
  
  There are cases where the cloud makes sense, but nowadays it's being peddled for just everything, and companies are swallowing the bait, the line, the sinker and the f**king fishing rod.
  
  9 0 Reply
  1. Tuesday 25th August 2015 19:20 GMT Anonymous Coward
    
    Re: Another day...(@ac)
    
    I don't use "Cloud" so that's out of the bias equation. What I do have broad experience is problems in sysyems small to one of the largest concerns on the planet. If you can pull off IT at your scale better and cheaper than "Cloud," then local is best. Do notice the "and" in there. It has to be both. Keeping a log book of downtime usually puts paid to the notion of on-site part-timers doing it.
    
    It's outsourcing, pure and simple, and needs to be weighted properly in the analyses.
    
    1 0 Reply
    1. Tuesday 25th August 2015 21:25 GMT Vic
      
      Re: Another day...(@ac)
      
      If you can pull off IT at your scale better and cheaper than "Cloud," then local is best. Do notice the "and" in there. It has to be both.
      
      No, it really doesn't.
      
      If cloud doesn't fulfill your needs, it really doesn't matter how cheap it is.
      
      As my grandfather used to say, "a cheap solution that doesn't work is neither".
      
      Vic.
      
      1 0 Reply
2. Tuesday 25th August 2015 14:30 GMT vmistery
  
  Re: Another day...
  
  If your business depends on uptime then yes duplicating things with another provider is exactly what you should be doing and have a tested plan to flip over to it.
  
  1 0 Reply
3. Tuesday 25th August 2015 16:01 GMT Anonymous Coward
  
  Re: Another day...
  
  "Short of building a completely separate parallel network, with a completely different vendor's kit, bugs are going to surface once in a while."
  
  It you can't guarantee 24/7/365 uptime don't put it in the marketing.
  
  Aside from that vendor bugs don't all surface at exactly the same time in all similar pieces of equipment unless the "bug" is actually a hack but they don't want to admit it.
  
  4 0 Reply
4. Tuesday 25th August 2015 22:38 GMT P. Lee
  
  Re: Another day...
  
  Not sure why you got all the downvotes there.
  
  Cloud uptime is generally quite good. Having said that, a seventeen hour outage is long, even for a small business with no support.
  
  The only thing I would say is that if you run your own stuff, is that your own stuff is generally simpler and therefore less likely to go wrong. You are more likely to be able to hack together a work-around than a cloud-provider can. Popping an extra 1G NIC into a device is easier than working out how to deal with an extra 40G going somewhere unexpected.
  
  Multiple cloud providers are an option, but then you need to be careful about who controls DNS and how wide-area load-balancing is carried out.
  
  2 0 Reply
5. Wednesday 26th August 2015 10:20 GMT Anonymous Coward
  
  Re: Another day...
  
  > Do you think Juniper would spend as much time with you, a tiny customer, as with a large provider?
  
  I do if I have paid for support, because that is what it's there for, they might pay a less per device for support due to a bulk discount but the level of support shouldn't be any less.
  
  1 0 Reply

Tuesday 25th August 2015 12:18 GMT Anonymous Coward

Unfortunate timing as I'm expecting a call from a C4L sales droid about a leased line this afternoon! Still when I've dealt with them in the past they've always given the impression of being one of the few infrastructure and hosting companies in the UK that actually has a clue what they are doing.

0 1 Reply

Tuesday 25th August 2015 12:24 GMT anothercynic

@AC

You *do* realise that it is possible they *do* know what they're doing, but not what Juniper's f***ed up in their software? It's not always the messenger who cocks things up, so shooting them is pointless...

4 0 Reply
1. Tuesday 25th August 2015 16:03 GMT Anonymous Coward
  
  Re: @AC
  
  "You *do* realise that it is possible they *do* know what they're doing, but not what Juniper's f***ed up in their software? It's not always the messenger who cocks things up, so shooting them is pointless..."
  
  They're not the messenger, they're the army. And any army that has no backup for faulty kit deserves to be defeated.
  
  0 0 Reply
2. Tuesday 25th August 2015 16:23 GMT Andrew Dancy
  
  Re: @AC
  
  Agreed - didn't mean to say it was their fault as it does look like a Juniper issue from what has been said so far. I was just commenting on the ironic timing. I've always taken the view that everyone in the IT industry is going to have problems some day, it's how they deal with them that matters and so far (to an outsider) their communication has been reasonable.
  
  0 0 Reply

Tuesday 25th August 2015 12:22 GMT Velv

He who laughs last...

I love the schadenfreude comments idiots make about cloud providers.

For every cloud outage that makes the media there are hundreds of minions running round in-house data centres recovering their business right now that never make the media.

Stuff breaks. Once you accept that fact, you plan how you will work around those times. Doesn't matter if it's in-house, outsourced, hybrid or distributed, plan for it to break, and test it

11 4 Reply

Tuesday 25th August 2015 12:45 GMT Rich 11

Re: He who laughs last...

But a cloud outage hits hundreds or thousands of companies at a time.

6 0 Reply

Tuesday 25th August 2015 12:30 GMT Doctor_Wibble

Software problem?

Software problems like this don't usually appear out of thin air, something surely had to have triggered it - so is it their fault (e.g. an unfortunate patch) or has someone found a cleverly-shaped packet to spaff at their network? If the latter, then are they hosting anyone particularly controversial or just a random test target?

0 0 Reply

Tuesday 25th August 2015 17:04 GMT Anonymous Coward

Re: Software problem?

BFD (bidirectional forwarding detection) is used to determine if a link is operational. My guess is that it is a mismatch of software with newer interfaces - no idea if it is a known issue

I've experienced software issues with BFD on another vendors kit and it isn't a pleasant experience watching all of your redundancy melt away as the software decides all your links are no marked as failed/in-service and routing chaos follow. That was a few years ago and in a in-house data centre.

0 0 Reply

Tuesday 25th August 2015 12:40 GMT hamiltoneuk

The hosting firm that I use uses C4L. I was not aware of them until it went wrong. Never heard of a NOC in my life. I still have no idea what BFD/100G issues are. What a clever world we live in.

1 0 Reply

Tuesday 25th August 2015 18:58 GMT Dwarf

Obligatory Quake Reference

There's the problem - you should have used the BFG 9000.

It solves all your problems in one go.

6 0 Reply

Tuesday 25th August 2015 13:55 GMT Nate Amsden

sounds like

a big SDN fail clusterfuck. I'd bet they tried to automate a few too many things and when shit exploded it blew up big time and now they don't know how to recover (I bet every time they try it just breaks itself again).

Either that or perhaps they are running Qfabric if it's Juniper, another network architecture myself that I would never use (nothing against Qfabric specifically more against TRILL's complexity). Of course I'm not a network engineer (by trade) so don't take my advice.

1 1 Reply

Tuesday 25th August 2015 15:37 GMT Anonymous Coward

Re: sounds like

" (nothing against Qfabric specifically more against TRILL's complexity). Of course I'm not a network engineer (by trade)",,,,

If you were a network engineer, then you would know Qfabric does not use TRILL.. :>)

4 0 Reply
Wednesday 26th August 2015 10:10 GMT Anonymous Coward

Re: sounds like

From what I gather they are not using SDN or QFabric, they run Juniper MX routers with MPLS for CoreTX stuff, Maybe some Extreme stuff (Was in one of blog posts but haven't seen them say anything about Extreme for a ages). The old network seems to be Cisco based.

0 0 Reply

Tuesday 25th August 2015 19:50 GMT MHZawadi

C4L customer

I wouldn't mind the odd issue, but C4L have had issues for the last 6 months. Some only 5 - 10 minutes, others like this hours.

0 0 Reply

Wednesday 26th August 2015 00:54 GMT Henry Wertz 1

Pushing the hardware?

I'm wondering if this isn't an issue that shows up when the Juniper (or some links on it) hit 100% load, i.e if they'd had a slightly larger Juniper, or one additional Juniper on site, it wouldn't have happened.. It shouldn't just blow up, but I've heard about some of this higher-speed hardware (25gig, 40gig, 100gig) not implementing flow control, and having to decide if it'll attempt to buffer packets or drop them if a link actually hits 100% load. Buffering can cause unreasonable latency fluctuations. But dropping, some types of traffic will apparently assume perfect traffic delivery, despite using UDP or raw ethernet frames (neither of which guarantees delivery.) That's on top of other potential problems people have already brought up.

edit: maybe not, the NOC post linked to suggests routing issue.

0 1 Reply

Wednesday 26th August 2015 16:06 GMT Anonymous Coward

Re: Pushing the hardware?

My understanding was that this was a BFD issue - BFD is generally used to rapidly detect link failures and fail traffic over to a second link by interacting directly with the routing processes. If BFD is failing then it is probably taking routes out of service by marking links as down

In terms of the cause of the BFD issue, it was termed a software bug, but that would cover numerous potential causes...

0 0 Reply
Thursday 3rd September 2015 17:44 GMT Anonymous Coward

Re: Pushing the hardware?

"I've heard about some of this higher-speed hardware (25gig, 40gig, 100gig) not implementing flow control"

(1) Nobody does link-by-link flow control these days, because you can't push it all the way back to the sender anyway. This belongs to the world of X25.

Packets are received into buffers, and if the buffers fill, packets are thrown away. Better devices will start throwing away a few packets *before* the buffers are completely full (Google "random early drop")

(2) Due to the random arrival times of packets and other effects, buffers can fill long before the link hits 100% utilisation. Google for "network microburst"

0 0 Reply

Wednesday 26th August 2015 08:20 GMT Anonymous Coward

The real issue

The real issue here is that C4L are testing their network design on their production system with their customers. They rushed in the CoreTX network without proper planning or testing because their old network was running spanning tree as the only form of resilience ! They've already had to replace a lot of their switches because the convergence time was way too low following a failure.

Both old and new networks have experienced significant downtime over the past few years, they have probably the least reliable network of any provider in their sphere. Fortunately I'm not a customer but I know a few people who are and the service they have received is poor.

1 0 Reply

Wednesday 26th August 2015 10:51 GMT Anonymous Coward

Not cloud

I don't think they are a "cloud", Looks like they mostly provide rack space and connectivity (When it works). I suspect it's more their customers who do "cloud" and not them.

1 0 Reply

Wednesday 3rd February 2016 11:25 GMT Anonymous Coward

Auto rollover of contract

In our business over the last 10 years we have used in total 5 data centres from dedicated servers to collocated and finally rack space.

In order to bring some new services on line last year we took a 1/4 rack at C4L in Poole which is now proving to be a nightmare as more than 12 months down the road we have not switched any traffic to the servers. This is purely because of the number of periods of time where the servers are simply not available.

We use an external DNS service which is able to switch A Records between our locations in the event of any becoming unavailable. For the servers located at Rackspace and Rapidswitch, this actually never happens. For C4L we get alerted at least twice a week of such outages.

It seems that any network event within the core at London propagates through the whole network and result in some level of outage.

As the confidence is low and we were out of our 12 month agreement, there is no other solution other than cancelling an locating to a more stable location. I have just received an email from the account manager stating the contract rills over for a subsequent 12 months with a 90 days cancellation period.!

Looking for a decent data centre? C 4 Hell isn't an option.

Frustrated London.

0 0 Reply

Topics

Special Features

Vendor Voice

Resources

COMMENTS

Have to feel pity for the bods beavering away trying to fix

Colocation

More fool the idiots...

Re: More fool the idiots...

Re: More fool the idiots...

Another day...

Re: Another day...

Re: Another day...(@ac)

Re: Another day...(@ac)

Re: Another day...(@ac)

Re: Another day...

Re: Another day...

Re: Another day...

Re: Another day...

@AC

Re: @AC

Re: @AC

He who laughs last...

Re: He who laughs last...

Software problem?

Re: Software problem?

Obligatory Quake Reference

sounds like

Re: sounds like

Re: sounds like

C4L customer

Pushing the hardware?

Re: Pushing the hardware?

Re: Pushing the hardware?

The real issue

Not cloud

Auto rollover of contract

POST COMMENT House rules

Enter your comment

Add an icon

Other stories you might like

911 goes MIA across multiple US states, cause unclear

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Datacenter outages are on the decline, but when they hit, they hit hard

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

Tech trade union confirms cyberattack behind IT, email outage

McDonald's ordering system suffers McFlurry of tech troubles

LinkedIn's turn to fall over: Outage hits thinkfluencer hub

World-plus-dog booted out of Facebook, Instagram, Threads

AT&T's apology for Thursday's outage should stretch to a cup of coffee

Americans wake to widespread AT&T cellular outages

X protests forced suspension of accounts on orders of India's government

Cyberattack downs pharmacies across America

About Us

Our Websites

Your Privacy