back to article C For Hell: Data centre meltdown for irate customers as C4L GOES TITSUP

C4L has been battling a major outage for the best part of a day and customers are becoming increasingly angry about the lengthy downtime. The Bournemouth-based colocation and cloud provider, which switched its network over to Juniper kit in October last year, said it was working with the vendor to resolve the outage. C4L said …

  1. TimR

    Would BT's DNS problems be connected to this?

    Currently getting no response from ns3.bt.net (194.72.6.57) or ns4.bt.net (194.73.82.242)

    1. Anonymous Coward
      Anonymous Coward

      No, BT have nothing to do with C4L.

  2. Anonymous Coward
    Anonymous Coward

    Have to feel pity for the bods beavering away trying to fix

    Did consider colo-ing with them.. Pretty glad I didn't now.

  3. Anonymous Coward
    Anonymous Coward

    Colocation

    co... location......

    co...

    location....

    So, this is everyone's secondary system right, they are doing their own redundancy... or are the tweeting customers expecting C4L to manage ALL redundancy?

    (I'd still be bitching, but worth pointing out that its never down to the supplier alone).

  4. Anonymous Coward
    Anonymous Coward

    More fool the idiots...

    .. who think "cloud" based storage is somehow magic and won't ever suffer this sort of thing. If PT Barnum were alive today he'd probably revise his famous saying to swarms of suckers born every minute.

    1. Roj Blake Silver badge

      Re: More fool the idiots...

      Outages not only happen at in-house facilities, but are typically more likely to happen there.

      1. Vic

        Re: More fool the idiots...

        Outages not only happen at in-house facilities, but are typically more likely to happen there.

        Not at mine, they aren't...

        Vic.

  5. Mephistro
    Devil

    Another day...

    ... another cloud SNAFU.

    To me, the IT world is beginning to look like a huge pack of lemmings on crack. The cloud, tablets, BYOD...

    What worries me the most is that I'm beginning to remember with nostalgia the days when we complained about WinXP's lack of security. At least we had some kind of control over our hardware and software. Ahhh, those were the days!

    1. Anonymous Coward
      Anonymous Coward

      Re: Another day...

      Or maybe, the fact that there are thousands of hosting providers out there, and only a handful of outages like this one, means that on the whole things in "the cloud" are done extremely well?

      Of course, you can can take your own colo, and you can connect to multiple ISPs with BGP. Could you do it better than a larger provider? How's your BGP expertise in your 24x7 NOC staff?

      Suppose you were the subject of the same Juniper bug as C4L have apparently succumbed to. Do you think Juniper would spend as much time with you, a tiny customer, as with a large provider?

      Things which are based on complex software have bugs, and bugs interact in complex ways. Equipment may be tested in a simulated environment, but bugs may not show up until production traffic is put on them, or they have been running for an extended period of time.

      Short of building a completely separate parallel network, with a completely different vendor's kit, bugs are going to surface once in a while.

      1. Mephistro

        Re: Another day...(@ac)

        "Things which are based on complex software have bugs."

        I think you answered your own comment with that sentence. Problems and bugs grow exponentially with complexity. Taking all your IT infrastructure into "The Cloud" means that you'll suffer errors from many different sources -ISPs, cloud providers, VM software, datacenter software and hardware... plus possible combinations of these- and you don't have any control over (most of) said problems or bugs. Add the issues with consumer lock-in, security and privacy and the picture gets even darker.

        There are cases where the cloud makes sense, but nowadays it's being peddled for just everything, and companies are swallowing the bait, the line, the sinker and the f**king fishing rod.

        1. Anonymous Coward
          Anonymous Coward

          Re: Another day...(@ac)

          I don't use "Cloud" so that's out of the bias equation. What I do have broad experience is problems in sysyems small to one of the largest concerns on the planet. If you can pull off IT at your scale better and cheaper than "Cloud," then local is best. Do notice the "and" in there. It has to be both. Keeping a log book of downtime usually puts paid to the notion of on-site part-timers doing it.

          It's outsourcing, pure and simple, and needs to be weighted properly in the analyses.

          1. Vic

            Re: Another day...(@ac)

            If you can pull off IT at your scale better and cheaper than "Cloud," then local is best. Do notice the "and" in there. It has to be both.

            No, it really doesn't.

            If cloud doesn't fulfill your needs, it really doesn't matter how cheap it is.

            As my grandfather used to say, "a cheap solution that doesn't work is neither".

            Vic.

      2. vmistery

        Re: Another day...

        If your business depends on uptime then yes duplicating things with another provider is exactly what you should be doing and have a tested plan to flip over to it.

      3. Anonymous Coward
        Anonymous Coward

        Re: Another day...

        "Short of building a completely separate parallel network, with a completely different vendor's kit, bugs are going to surface once in a while."

        It you can't guarantee 24/7/365 uptime don't put it in the marketing.

        Aside from that vendor bugs don't all surface at exactly the same time in all similar pieces of equipment unless the "bug" is actually a hack but they don't want to admit it.

      4. P. Lee

        Re: Another day...

        Not sure why you got all the downvotes there.

        Cloud uptime is generally quite good. Having said that, a seventeen hour outage is long, even for a small business with no support.

        The only thing I would say is that if you run your own stuff, is that your own stuff is generally simpler and therefore less likely to go wrong. You are more likely to be able to hack together a work-around than a cloud-provider can. Popping an extra 1G NIC into a device is easier than working out how to deal with an extra 40G going somewhere unexpected.

        Multiple cloud providers are an option, but then you need to be careful about who controls DNS and how wide-area load-balancing is carried out.

      5. Anonymous Coward
        Anonymous Coward

        Re: Another day...

        > Do you think Juniper would spend as much time with you, a tiny customer, as with a large provider?

        I do if I have paid for support, because that is what it's there for, they might pay a less per device for support due to a bulk discount but the level of support shouldn't be any less.

  6. Anonymous Coward
    Anonymous Coward

    Unfortunate timing as I'm expecting a call from a C4L sales droid about a leased line this afternoon! Still when I've dealt with them in the past they've always given the impression of being one of the few infrastructure and hosting companies in the UK that actually has a clue what they are doing.

    1. anothercynic Silver badge

      @AC

      You *do* realise that it is possible they *do* know what they're doing, but not what Juniper's f***ed up in their software? It's not always the messenger who cocks things up, so shooting them is pointless...

      1. Anonymous Coward
        Anonymous Coward

        Re: @AC

        "You *do* realise that it is possible they *do* know what they're doing, but not what Juniper's f***ed up in their software? It's not always the messenger who cocks things up, so shooting them is pointless..."

        They're not the messenger, they're the army. And any army that has no backup for faulty kit deserves to be defeated.

      2. Andrew Dancy

        Re: @AC

        Agreed - didn't mean to say it was their fault as it does look like a Juniper issue from what has been said so far. I was just commenting on the ironic timing. I've always taken the view that everyone in the IT industry is going to have problems some day, it's how they deal with them that matters and so far (to an outsider) their communication has been reasonable.

  7. Velv
    Facepalm

    He who laughs last...

    I love the schadenfreude comments idiots make about cloud providers.

    For every cloud outage that makes the media there are hundreds of minions running round in-house data centres recovering their business right now that never make the media.

    Stuff breaks. Once you accept that fact, you plan how you will work around those times. Doesn't matter if it's in-house, outsourced, hybrid or distributed, plan for it to break, and test it

    1. Rich 11

      Re: He who laughs last...

      But a cloud outage hits hundreds or thousands of companies at a time.

  8. Doctor_Wibble

    Software problem?

    Software problems like this don't usually appear out of thin air, something surely had to have triggered it - so is it their fault (e.g. an unfortunate patch) or has someone found a cleverly-shaped packet to spaff at their network? If the latter, then are they hosting anyone particularly controversial or just a random test target?

    1. Anonymous Coward
      Anonymous Coward

      Re: Software problem?

      BFD (bidirectional forwarding detection) is used to determine if a link is operational. My guess is that it is a mismatch of software with newer interfaces - no idea if it is a known issue

      I've experienced software issues with BFD on another vendors kit and it isn't a pleasant experience watching all of your redundancy melt away as the software decides all your links are no marked as failed/in-service and routing chaos follow. That was a few years ago and in a in-house data centre.

  9. hamiltoneuk

    The hosting firm that I use uses C4L. I was not aware of them until it went wrong. Never heard of a NOC in my life. I still have no idea what BFD/100G issues are. What a clever world we live in.

    1. Dwarf
      Joke

      Obligatory Quake Reference

      There's the problem - you should have used the BFG 9000.

      It solves all your problems in one go.

  10. Nate Amsden

    sounds like

    a big SDN fail clusterfuck. I'd bet they tried to automate a few too many things and when shit exploded it blew up big time and now they don't know how to recover (I bet every time they try it just breaks itself again).

    Either that or perhaps they are running Qfabric if it's Juniper, another network architecture myself that I would never use (nothing against Qfabric specifically more against TRILL's complexity). Of course I'm not a network engineer (by trade) so don't take my advice.

    1. Anonymous Coward
      Anonymous Coward

      Re: sounds like

      " (nothing against Qfabric specifically more against TRILL's complexity). Of course I'm not a network engineer (by trade)",,,,

      If you were a network engineer, then you would know Qfabric does not use TRILL.. :>)

    2. Anonymous Coward
      Anonymous Coward

      Re: sounds like

      From what I gather they are not using SDN or QFabric, they run Juniper MX routers with MPLS for CoreTX stuff, Maybe some Extreme stuff (Was in one of blog posts but haven't seen them say anything about Extreme for a ages). The old network seems to be Cisco based.

  11. MHZawadi

    C4L customer

    I wouldn't mind the odd issue, but C4L have had issues for the last 6 months. Some only 5 - 10 minutes, others like this hours.

  12. Henry Wertz 1 Gold badge

    Pushing the hardware?

    I'm wondering if this isn't an issue that shows up when the Juniper (or some links on it) hit 100% load, i.e if they'd had a slightly larger Juniper, or one additional Juniper on site, it wouldn't have happened.. It shouldn't just blow up, but I've heard about some of this higher-speed hardware (25gig, 40gig, 100gig) not implementing flow control, and having to decide if it'll attempt to buffer packets or drop them if a link actually hits 100% load. Buffering can cause unreasonable latency fluctuations. But dropping, some types of traffic will apparently assume perfect traffic delivery, despite using UDP or raw ethernet frames (neither of which guarantees delivery.) That's on top of other potential problems people have already brought up.

    edit: maybe not, the NOC post linked to suggests routing issue.

    1. Anonymous Coward
      Anonymous Coward

      Re: Pushing the hardware?

      My understanding was that this was a BFD issue - BFD is generally used to rapidly detect link failures and fail traffic over to a second link by interacting directly with the routing processes. If BFD is failing then it is probably taking routes out of service by marking links as down

      In terms of the cause of the BFD issue, it was termed a software bug, but that would cover numerous potential causes...

    2. Anonymous Coward
      Anonymous Coward

      Re: Pushing the hardware?

      "I've heard about some of this higher-speed hardware (25gig, 40gig, 100gig) not implementing flow control"

      (1) Nobody does link-by-link flow control these days, because you can't push it all the way back to the sender anyway. This belongs to the world of X25.

      Packets are received into buffers, and if the buffers fill, packets are thrown away. Better devices will start throwing away a few packets *before* the buffers are completely full (Google "random early drop")

      (2) Due to the random arrival times of packets and other effects, buffers can fill long before the link hits 100% utilisation. Google for "network microburst"

  13. Anonymous Coward
    Anonymous Coward

    The real issue

    The real issue here is that C4L are testing their network design on their production system with their customers. They rushed in the CoreTX network without proper planning or testing because their old network was running spanning tree as the only form of resilience ! They've already had to replace a lot of their switches because the convergence time was way too low following a failure.

    Both old and new networks have experienced significant downtime over the past few years, they have probably the least reliable network of any provider in their sphere. Fortunately I'm not a customer but I know a few people who are and the service they have received is poor.

  14. Anonymous Coward
    Anonymous Coward

    Not cloud

    I don't think they are a "cloud", Looks like they mostly provide rack space and connectivity (When it works). I suspect it's more their customers who do "cloud" and not them.

  15. Anonymous Coward
    Anonymous Coward

    Auto rollover of contract

    In our business over the last 10 years we have used in total 5 data centres from dedicated servers to collocated and finally rack space.

    In order to bring some new services on line last year we took a 1/4 rack at C4L in Poole which is now proving to be a nightmare as more than 12 months down the road we have not switched any traffic to the servers. This is purely because of the number of periods of time where the servers are simply not available.

    We use an external DNS service which is able to switch A Records between our locations in the event of any becoming unavailable. For the servers located at Rackspace and Rapidswitch, this actually never happens. For C4L we get alerted at least twice a week of such outages.

    It seems that any network event within the core at London propagates through the whole network and result in some level of outage.

    As the confidence is low and we were out of our 12 month agreement, there is no other solution other than cancelling an locating to a more stable location. I have just received an email from the account manager stating the contract rills over for a subsequent 12 months with a 90 days cancellation period.!

    Looking for a decent data centre? C 4 Hell isn't an option.

    Frustrated London.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like