back to article Expired cert... Really? #O2down meltdown shows we should fear bungles and bugs more than hackers

It's a bit of a cliche that "everything's connected", but O2's stunning outage yesterday – chalked up by Swedish kitmaker Ericsson to an expired software certificate – is a reminder of how true that is. Payment terminals croaked, bus displays went blank. Strangers blinked at each other in the street, like Robinson Crusoe …

  1. Alan Bourke

    Bad news. The fog's getting thicker.

    And Leon is getting laaaaarrrrrger.

    1. Ledswinger Silver badge

      Re: Bad news. The fog's getting thicker.

      The fog's getting thicker.....And Leon is getting laaaaarrrrrger.

      In fog, the time to worry is when the word "Scania" looms into view and is getting rapidly larger.

      1. Ozumo

        Re: Bad news. The fog's getting thicker.

        Or ovloV

  2. fedoraman
    Flame

    Acronyms

    FFS (For F£$k Sake) expand your acronyms the first time you use them!

    I've got better things to do on a Friday mid-morning than work out whether M2M means made to measure, machine-to-machine, or some defunct Norwegian pop duo!

    Well, slightly better, I mean - reading the Reg ......

    1. upsidedowncreature

      Re: Acronyms

      Indeed. '"MVP" mentality' - Model/View/Presenter mentality? Most Valued Professional mentality?

      1. Anonymous Coward
        Anonymous Coward

        Re: Acronyms

        Minimum Viable Product

        What normal people would call an alpha release

        1. hoola

          Re: Acronyms

          MVP, what counts for any normal techy solution in the current day. Deliver the absolute minimum, promise the earth & walk away, safe in the knowledge that unless the customer is really, really big there is sod all anyone can do about it.

          And even if you are really big, this is still probably sod all you can do about it.

      2. Ozumo

        Re: Acronyms

        Most Valuable Player

    2. Dr Who

      Re: Acronyms

      Beat me to it

    3. Anonymous Coward
      Anonymous Coward

      Re: Acronyms

      As this is a "co.uk" site they're abbreviations not acronyms

      1. A.P. Veening

        Re: Acronyms

        "As this is a "co.uk" site they're abbreviations not acronyms"

        No, these are all TLAs (Three Letter Acronyms).

        1. The First Dave
          Headmaster

          Re: Acronyms

          All of these were, indeed Three Letter Abreviations (TLA's)

          1. Semtex451 Silver badge

            Re: Acronyms

            Agreed, in my book an acronym should be pronounceable, as in SNAFU

        2. Doctor Syntax Silver badge

          Re: Acronyms

          "No, these are all TLAs (Three Letter Acronyms)."

          Two out of three ain't bad.

          Three Letter Abbreviations.

          1. TRT Silver badge

            Re: Acronyms

            MJE.

            Miniumum Journalistic Effort

            or

            Maximum Jargon Enclosure

            1. Ragarath

              Re: <strike>Acronyms</strike> Initialism

              Came up on the first Google search so it must be right.

              Acronym = Letters that from words

              Abbreviation = Shortened word E.G. St, Dr etc

              Initialism = First letter of each word and enunciated E.G. VIP

              If I'm wrong blame Google, it's not that I'm lazy... honest!

              1. #define INFINITY -1 Bronze badge

                Re: <strike>Acronyms</strike> Initialism

                Wow, on a dot-uk site, no-one seems to have a copy of Fowler? This all falls under 'curtailment', and Britons do not need to keep their vocabulary in the Victorian era. Acronymn is a 20th century invention.

              2. Danny 4

                Re: <strike>Acronyms</strike> Initialism

                Abbreviation = Shortened word E.G. St, Dr etc

                I always thought St and Dr were contractions but never bothered with the apostrophe.

                1. TRT Silver badge

                  Re: <strike>Acronyms</strike> Initialism

                  No. St. and Dr. are abbreviations.

                  Can't and Don't are contractions - they are made up from two or more words.

        3. John Brown (no body) Silver badge
          Happy

          Re: Acronyms

          "No, these are all TLAs (Three Letter Acronyms)."

          Ok, so what is M2M then? TLAAN? (Two Letters And A Number)

        4. illiad

          Re: Acronyms

          what about ETLAs??? :P

  3. Aladdin Sane Silver badge

    Hanlon's razor strikes again.

    1. Anonymous Custard Silver badge

      Most definitely - Never attribute to malice that which is adequately explained by stupidity.

      1. Wayland Bronze badge

        Hanlon's razor

        How about DBN. Don't Be Naive. People can be bad so stop giving them the benefit of the doubt. If they did something wrong don't let them off just because you think they did not mean it.

  4. djstardust Silver badge

    Was this

    Not the same Ericsson who caused a series of outages on O2 in 2012?

    Lessons learned of course .......

    1. Alister Silver badge

      Re: Was this

      Lessons learned of course

      Not sure which of the lessons from the 2012 outage would be applicable to yesterday's situation?

      1. Threlkeld

        Re: Was this

        They're funny things, Accidents. You never have them till you're having them.”

        ― A.A. Milne, The House at Pooh Corner

    2. Voland's right hand Silver badge

      Re: Was this

      Do not blame Ericsson here.

      UK telco operations have a well established and entrenched fear of certificates for anything.

      Once upon a time, before I went back to write software, I still did network architecture including security aspects. So while working in a major UK telco I proposed the idea of certificates everywhere for purposes of inventory, identification and security of provisioning. I was freshly out of a vendor where I did most of the design and implementation of a x509 retrofit into everything and they became the foundation of how the system fits together. So I was expecting some questions or a technical discussion.

      I got none.

      The faces around the table looked like they were a still frame from The Shining. They looked at the idea like I was serving a disemboweled body with maggots and suggesting they eat it. They were horrified at the idea despite having less than 60% accurate inventory and a long standing requirement to secure key aspects of the network management.

      This fear has its roots in incidents like the one in O2. It is also the root cause of incidents like O2.

      UK telcos (and most telcos in general) fail to understand the most basic principle of using X509 for infrastructure purposes.

      It is: YOU RUN YOUR OWN CA. No vendor roots. The root is yours. And so are ALL certs.

      Because they do not understand it and fear it, they either use vendor certs (which expire at the most unfortunate moment) or outsource it to an external CA which defeats the purpose of the exercise as you are no longer in control of your network. Either one of these results in an incident like O2 which in turn results in more fear, more vendor use and more outsourcing.

      Ad naseum, rinse repeat.

      Oh, and by the way, no lessons will be learned from this incident - O2 will NOT start running its own CA as it should.

      1. Anonymous Coward
        Anonymous Coward

        Re: Was this

        How difficult is it to put the certificate expiry date in the electronic diary with a reminder a fortnight before

        1. DougS Silver badge

          Re: Was this

          In what electronic diary? Notifying whom?

          Do you know how many certificates large enterprises have to manage now? It would be a full time job for someone - but if you made it that, you'd be screwed when they went on vacation or quit and the reminder from their electronic diary went to /dev/null.

          The whole system around certificates is irretrievably broken if you require humans to be in the middle of it. It has to be automated - a subscription service that automatically updates. We will never see the end of such issues so long as humans have to be "reminded", because we are fallible. If the certificate for some weird page hardly anyone visits expires, it might be weeks before the company is notified. If the certificate required for mobile data to work at a large provider expires, it could do a lot of damage in the hours required for the problem to be diagnosed and corrected.

          1. Roland6 Silver badge

            Re: Was this

            The whole system around certificates is irretrievably broken if you require humans to be in the middle of it. It has to be automated - a subscription service that automatically updates.

            Suggest you dust down the risk assessments from the mid-1990's for Single-Sign-On solutions - these worked well whilst everything worked, break something and everything fell into a rather big heap, from which it was easier to reset and start again than trying to recover...

            The obvious issue with subscription services is ensuring the bank account(s) from which monies are automatically taken always have sufficient funds (or haven't been closed) and if there is a hiccup in payment processing things get escalated so that action can be taken before certificates expire...

            1. DougS Silver badge

              Re: Was this

              True, payment processing can be a problem, but no more of a problem than it is for manual payment. Ideally it would be done with a yearly subscription for all your certificates in a lump sum, or paid in monthly installments, rather than dribbling out a small payment each time a certificate is renewed. The accounting department would HATE YOU if you managed 3000 certificates and each was a separate charge for yearly renewal!

              Automated renewal also makes it practical to have certificates that last only a month, making the cumbersome process of revoking them if compromised less of a factor.

              1. Roland6 Silver badge

                Re: Was this

                If your organisation relied on certificates and you were using more than a handful, I suggest you would be well advised to set up your own PKI, it isn't all that difficult. That would reduce your 3000 certificate (subscriptions) to one root certificate.

                It also makes it practical to have as you suggest short lived certificates as they would be wholly managed within your own infrastructure.

                BTY, if your Accounts department can't handle 3000 certificate renewals a year then there is something wrong with it - its not that difficult in many accounts/financial systems to set up a bank account and ledger for reoccurring IT expenditure/subscriptions. But I expect the problem is that in many companies IT doesn't talk finance to Finance and so get things neatly structured.

        2. Oneman2Many

          Re: Was this

          Not so simple when you have thousands of certs to look after. However when you have that many certs then all the more reason to have processes in place to manage certs properly

          1. werdsmith Silver badge

            Re: Was this

            Thousands of certs is precisely why they should be electronically tracked.

            1. DougS Silver badge

              Re: Was this

              And that still requires a manual process to insure EVERY certificate finds its way into that electronic monitoring system. This is better than a manual process around every renewal since you only need to do it once for a certificate and then you are good for as long as that particular certificate-requiring function remains exactly the same.

              Better, but not good enough.

      2. ItWasn'tMe

        Re: Was this

        Alternatively I have first hand experience of a UK telco that did act as a CA, but then managed to 'lose' the passphrase to their root cert! You couldn't make it up

        1. werdsmith Silver badge

          Re: Was this

          Cheap almost free open source monitoring software can keep an eye on certificates and give you prior warning that the date in one is approaching. You can choose how much warning you want and it will display it on a dashboard in red, ,send you an email or automatically open an ITIL compliant helpdesk ticket for you, with P1 urgency if you want.

          Even the most shoddy IT shops I've dealt with have this sorted. It's really simple stuff.

          1. Mandoscottie

            Re: Was this

            werdsmith, your missing a vital point, your assuming O2 (the company) actually give a fook (shareholders will if share price slides longer than 24hours).

            Give it a week and nobody will even remember they had an outage, once they can upload fish face pictures to instatwat or pictures of their lunch to twatbook

      3. Dave Bell

        Re: Was this

        I can see what you're getting at. The certificate system has a different purpose for this situation. It isn't about somebody such as me, downloading software from a myriad of possible suppliers, possibly via intermediaries, where the certificate is about blocking access to possible malware, now with such things as HTTPS. Secure delivery still needs attention, but once a genuine copy of the software is delivered and authorised for use, the supplier's action (or inaction) shouldn't be able to stop it working.

        Yeah, I suppose contracts can set up something like software rental, and that's nothing new. But if you shut down your customer I am sure the lawyers would be interested in the procedures you followed.

  5. el kabong

    Move fast, break things

    break your neck too.

  6. Semtex451 Silver badge

    What was it that Giffgaff did that they come in for so much stick?

    1. MrMerrymaker

      Appealed to a customer base of the lowest common denominator?

      1. MrMerrymaker

        Why the thumbs down? I'm with giffgaff!

        But a look in their forums shows tons of people just screaming at them, who didn't even bother reading the news. Even the Grauniad mentioned it in enough depth to say it wasn't Giffgaff at fault

        1. werdsmith Silver badge

          You see people in Grauniad comments doing the same.

          In fact all over the internet.

        2. John Brown (no body) Silver badge

          "But a look in their forums shows tons of people just screaming at them, who didn't even bother reading the news."

          How were they supposed to read the news when their phone data connection was down? You don't honestly think they would have something old fashioned like a landline based connection or a radio or even a TV, do you? No, of course not. The world had just ended!

          1. Roland6 Silver badge

            "How were they supposed to read the news when their phone data connection was down?"

            How were they able to post in forums if they had no data connection...

            I suggest that those able to access forums weren't those truely impacted by this outage, who's smartphone would have been reduced to a games console for Snake and Tetris (aside: showing my age here)

        3. Mandoscottie

          Giff, Gaff, you mean Telefonica aka O2?

          Maybe its just me but their adverts really get on my goat, moreso than any other telcos ads (which are bad) every add they spout all i can hear in my head is Liar Liar Bums on fire, your telefonica in disguise you charlatan!

          replace Giff Gaff with Tesco, Sky and Lyca......it fits!

    2. Rogerborg 2.0

      GiffGaff made a point of not blamesplaining that it wuz O2 wut dun it, they just apologised to their customers as though they were at fault.

      Worse than Hitler, really.

      1. caffeine addict Silver badge

        blamesplaining

        There are currently 30 google results for that abomination for a word.

        If it becomes popular, we're holding you directly responsible. The tar is already being warmed and the chickens are being plucked...

        1. xanda
          Trollface

          ...The tar is already being warmed and the chickens are being plucked...

          Please spare some too for the pillock(ess) who came up with "functionality"...

  7. The Vociferous Time Waster

    Painter's 2nd Law of IT

    If an IT organisation has to manage something that can expire and must be renewed then it follows that it shall, at some point expire without having been renewed.

    1. lglethal Silver badge
      Joke

      Re: Painter's 2nd Law of IT - Fixed it for you...

      If an IT organisation has to manage something that can expire and must be renewed then it follows that it shall, at some point expire without having been renewed at the worst possible moment.

      Certificates always expire at the time when a) the responsible IT bod is on annual leave, or b) there has been a change in management/HR/re-organisation such that no-one is sure who is responsible for the certificates or who can approve paying for their renewal, or c) just after a major IT upgrade, so everyone thinks that the failure is due to the new equipment. Other options are also available...

      1. #define INFINITY -1 Bronze badge

        Re: Painter's 2nd Law of IT - Fixed it for you...

        That's if you're using MS-Windows. cron doesn't need to be 'updated' when the root password changes.

        Oh, you're manually updating certificates?

        1. werdsmith Silver badge

          Re: Painter's 2nd Law of IT - Fixed it for you...

          "the system is down, nothing is working...."

          "what time did it stop working? "

          "midnight...."

          Ahh, I see.

        2. Oneman2Many

          Re: Painter's 2nd Law of IT - Fixed it for you...

          Are you auto renewing certs without checking with app owner first ?

  8. tfewster Silver badge
    Facepalm

    The difference is that buses failed safe - the network connections failed, but the buses still ran.

    As I heard it, the Ericsson software was just used for billing usage. But because O2 couldn't track customers usage, they denied them access completely.

    I think O2 should have to credit every account with 20p, even if the customer didn't complain (or get through to complain). Costly enough to impact execs bonuses, but cheaper to implement than handling 32m complaints, so even then they get off lightly. And if I have to waste £10-worth of my time to get a 20p credit, that just adds insult to injury - I'd be asking for the £10 rather than 20p

    1. el kabong

      the buses still ran...

      Yes, this time they still ran but in the future maybe not.

      As things are headed now everything will stop working when a network meltdown occurs.

    2. TRT Silver badge

      The buses ran... but customers on O2 networks couldn't pay their bus fares (in London at least), nor did their shiny work on the Tube, and if you delay more than 100ms at the gateline in London, then you are to be crucified against the TfL roundel by your former fellow commuters that you have held up. But that's OK because all the little iPads they gave to Tube staff when they eliminated ticket offices a few years ago (the ticket offices where the machines had ludicrously outdated bits of electric string) run off O2 (except they fall back to the WiFi which was mostly Virgin, so the effect was minimal).

    3. John Brown (no body) Silver badge

      "The difference is that buses failed safe - the network connections failed, but the buses still ran."

      There was a comms failure on our local metro system the other week. Complete shutdown of the system resulted despite the fact there far fewer vehicles involved, no other vehicles other than authorised ones with trained operators, very few junctions, but, no, to be safe, it all has to stop. Could you imagine the reaction of roads being closed because traffic lights failed?

      Admittedly, there are stretched os single line operation and even sections where the light rail shares track with main line trains, so I suppose those sections might be more dangerous to operate without comms or signalling.

      1. Roland6 Silver badge

        Admittedly, there are stretched os single line operation and even sections where the light rail shares track with main line trains, so I suppose those sections might be more dangerous to operate without comms or signalling.

        Suggest you read up on the early railways and why signalling systems were developed...

  9. MrMerrymaker

    The wrong day

    I picked the wrong day to quit heroin

    1. Aladdin Sane Silver badge

      Re: The wrong day

      Is there ever a right day?

      1. m0rt Silver badge

        Re: The wrong day

        The day you overdose?

        1. Sir Runcible Spoon Silver badge

          Re: The wrong day

          "The day you overdose?"

          Technically that's quitting too, after all , you won't be doing any more will you?

          1. caffeine addict Silver badge

            Re: The wrong day

            Technically that's quitting too, after all , you won't be doing any more will you?

            I'm not sure I'd consider a hit that lasts the entire of the rest of your life quitting. That's the holy grain of getting high.

        2. Anonymous Coward
          Anonymous Coward

          Re: The wrong day

          @mort

          I can see where you got your name from.

  10. TheCynic

    Maybe the network needs a friend

    Travel on the trains and when One carrier is having issues it's tickets are valid on the others for a short duration. Maybe if one carrier is having a 'Really bad day'[tm] the others could let their customers on theirs.

    You never know it might be like good for everyone.. so it will never happen

    1. el-keef

      Re: Maybe the network needs a friend

      The article explains why this is a bad idea - sudden influx of customers onto another network might bring that network down too, causing a cascade effect.

    2. james_smith

      Re: Maybe the network needs a friend

      "Travel on the trains and when One carrier is having issues it's tickets are valid on the others for a short duration."

      Not valid on other carriers from Paddington station, and I suspect that's true of most commuter terminals.

  11. Pen-y-gors Silver badge

    V2X

    I assume this is something to do with controlling autonomous vehicles.

    If it is, then it's worrying. An autonomous vehicle must be able to work without a network connection! For emergencies and for areas without 5G. All it needs is to know what is around it - it doesn't need the latest news on traffic problems 300 miles away. It should be able to rely on its own sensors, and, possibly, short-range comms to chat to nearby vehicles. That's it. Updates can wait until it's next connected, like phones.

    1. The First Dave

      Re: V2X

      You mean it should actually be, like _Autonomous_ ??

    2. Doctor Syntax Silver badge

      Re: V2X

      "An autonomous vehicle must be able to work without a network connection!"

      If it needs a network connection it isn't autonomous.

    3. vtcodger Silver badge

      Re: V2X

      We've been told that the low-latency modes of 5G are required for V2X (vehicle-to-everything)

      V2X is going to be necessary for smooth traffic flow -- negotiating permission with oncoming traffic to make a left turn (for those of us who drive on the right)/right turn (for those who drive on the wrong) for example. And it's probably how the folks that are repairing yonder bridge are going to tell your car that that area that looks like a hole in the pavement is in fact a hole in the pavement. It's not clear that it needs a lot of bandwidth or especially high speeds. But it probably does need latencies never more than a few hundred ms. And of course it needs standards that are unambiguous and are actually adhered to.

      1. DougS Silver badge

        Re: V2X

        Vehicles will need (or at least want) to communicate with one another, yes. But there's absolutely no reason they need to communicate via a cell tower. They will be in close proximity to one another and can communicate directly, there's no need to go to/from a cell tower which will often be further away than the cars that need to talk to each other.

        As long as autonomous cars have to share the road with human driven vehicles they will need to be able to operate without any V2V communication though. They can't trust humans to always signal a turn etc. so they will still need to drive defensively and not fully trust the info they get from other vehicles.

        The exception to that trust would be for things like drafting bumper to bumper in the left lane, obviously you'd need to trust that the cars ahead will act appropriately and the lead car will alert the rest of a hazard that will require braking or steering. So sorry, no user modifiable software allowed!

  12. sanmigueelbeer Silver badge
    Trollface

    Could've been worst.

    O2 could've been managed by either IBM or Capita.

    1. m0rt Silver badge

      You never see, in those future dystopia movies, that the real cause of societal breakdown was due to shite service companies.

      Except for Douglas Adams, who hit the nail on the head with Sirius Cybernetics.

      1. Voyna i Mor Silver badge

        "Except for Douglas Adams, who hit the nail on the head with Sirius Cybernetics."

        Douglas Adams proved that an English degree from Cambridge can make you a better futurologist than someone with a STEM degree. I'm not sure what that proved.

        1. Anonymous Coward
          Anonymous Coward

          Douglas Adams proved that an English degree from Cambridge can make you a better futurologist than someone with a STEM degree.

          Please don't. You're only encouraging that Fry chap.

          1. Voyna i Mor Silver badge

            -->Please don't. You're only encouraging that Fry chap.

            Note in my post I wrote "can make", not "does make".

            DNA was a genius. (Possibly why he died young, quos amo deum morietur puer.) Fry's father is/was a bit of a genius. Fry is just very clever.

            1. Mike Pellatt

              Re: -->Please don't. You're only encouraging that Fry chap.

              Fry just thinks he's very clever.

              There, FTFY.

      2. antonyh
        Coat

        Shurely you can't be Sirius?

        1. Ozumo

          Must...not...

          OK, I am Sirius. And don't call me Shirley.

      3. MrMerrymaker

        I'm sure in Blade Runner 2049, there's a huge IBM logo. A dystopia indeed!

  13. Dan 55 Silver badge

    Seems to be many problems are down to large organisations not being able to use Outlook calendars or a big calendar on the wall with garish-coloured post-it notes.

    If the beancounters can get something done by a certain date, why can't the IT monkeys?

    1. storner
      Unhappy

      Because certificates typically expire after 2-3 years - beancounters and bosses cannot see that far ahead (except when pulling "strategies" out of various orifices).

      Even the IT monkeys doing the renewals have moved to new offices at least 3 times, so that two your old calendar with the post-it notes? Noone remembers what it was for, so it goes down the bin.

      1. Jamie Jones Silver badge

        Only tangently related, but it reminded me of a policy at my last place of work that I managed to change.

        If it was required to run a one off job on a machine overnight (yeah... no "at" batch command) then it was recommended that you put the job in cron, scheduled to run the next day, on that day-of-month, on that month.... so that your job wouldn't be run the next day too if you didn't remove the cron entry in time.

        Yes, you've got it - there were a number of times where some system would "randomly" cock up, and be traced to some date specific cron job that no-one remembers anything about, and which is presumably at least a year old.

    2. Anonymous Coward
      Anonymous Coward

      Generally beacounters delay payment as long as possible without actually getting sued. The idea that you cannot convince a cert that the cheque’s in the post is literally beyond their tiny minds

    3. Anonymous Coward
      Anonymous Coward

      Errr.

      Because the bean counters were the people responsible for outsourcing the IT department to a provider incapable of managing (or unaware of) things such as this.

      I can almost guarantee you its the bean counter's prior actions in chasing the cheapest IT solution in order to line the pockets of those at the top that has led to this mess.

    4. Anonymous Coward
      Anonymous Coward

      I gave up on electronic calendars years ago and have reverted to a wall planner in the office. I haven't missed an SSL certificate expiry date since. Bloody technology.

    5. Doctor Syntax Silver badge

      "If the beancounters can get something done by a certain date, why can't the IT monkeys?"

      One of the things that the beancounters get done by a certain date is to outsource the IT monkeys who had their calendars sorted. And when the IT monkeys get outsourced are they really going to tell the beancounters "by the way, you need to keep an eye on this."? At some point beancounters get to discover that the IT people they outsourced weren't monkeys but there's a distinct possibility the outsourcers were - or maybe they were snake-oil salesman.

    6. Anonymous Coward
      Anonymous Coward

      Except, and I have it on good authority, that this cert was hard coded with no access to it. The only option was to update the software.

    7. Anonymous Coward
      Anonymous Coward

      Beancounters cant

      At least in my case they didnt.

      Retired earlier this year but they still kept paying me cor three months.

      Very nice but I had to give it back in the end.

    8. Mandoscottie

      due to said pesky bean-counters taking away ALL the beans.

  14. 87red

    I wonder why networks don't have a roaming agreement in place for such catastrophic events. The events yesterday would have cost o2 £millions in bad publicity, yet if there was the ability for them to allow customers to temporarily roam to say EE or Vodafone this could all be avoided.

    1. MrWibble

      The article explains why this is a bad idea - sudden influx of customers onto another network might bring that network down too, causing a cascade effect.

      1. Doctor Syntax Silver badge

        "The article explains why this is a bad idea"

        I wonder how many times this statement is going to have to be repeated.

        1. Adrian 4 Silver badge

          It will be repeated until answered with "properly engineered systems fail softly".

          Which is answered with "fail-safe systems fail by failing to fail safe".

        2. Wayland Bronze badge

          "The article explains why this is a bad idea"

          James Burke's Connections explained a failover system of the electrical grid in America. One relay tripped because a street was overloaded and it passed the current over to other circuits in a domino effect until the whole state was offline.

          If the load is too much for one then it will be too much when added to the next one that's still working.

      2. Vince

        Also, but not mentioned in the article....

        If the user was "roaming" the traffic goes back to the home network, as does the check to allow roaming, voice access, etc etc and SMS.. so as O2 was not suffering "no signal" that would help in zero way.

    2. SWCD

      These sorts of decisions are often made by marketing type teams, where the brand identity is worth 10x more than the damage from down-time. The decision to allow customers onto a competitors network as theirs broken? No way! Get ours fixed!!

      20 or so years ago as a tech-support rep at Orange, a frequent issue at the time was SMS jamming. An easy fix was browsing for another network in the phone settings, attempting to join it (which would fail), then just joining the Orange network again. Within a minute or so, the "stuck" SMS would start coming in. Marketing or some similar dept caught wind of the advice being given out - and said it was to stop.

      No matter it worked 99% of the time, no matter there was no other fix available, no matter the customer was inconvenienced by it not working.. The sheer fright that another network's name would come up on the customers screen? Unthinkable!

      1. Joe Harrison Silver badge

        The fix is not "failing to join the other network", it is more correctly "disconnecting and rejoining."

        Similar connectivity faults exist even today and can often be cured by temporarily going into airplane/flight mode then back again to normal mode. Or even by switching off and on again but not recommended as boot times are getting ever longer because all the crap with which we fill up our phones.

        1. Jens Goerke

          Graceful reconnect...

          ...was a concept that went out of fashion over a decade ago.

          The mere idea of software actually checking the status of its connection and then retrying, rechecking, disconnecting (cleanly!) and reconnecting before trying again has been deemed ancient cruft - programmers have become too used to reliable always-on connections and never experienced firewall timeouts or line noise causing a modem to hang up.

          $Deity, I feel old.

          1. sed gawk

            Re: Graceful reconnect...

            Not out of fashion everywhere,

            I'm consulting at a UK Uni at present, and I was pleasantly surprised to see something not very far from the classical clean disconnect and reconnect patterns in a recent pull request crossing my desk.

        2. ibmalone Silver badge

          Or even by switching off and on again but not recommended as boot times are getting ever longer because all the crap with which we fill up our phones.

          One of the things I've noticed about my current smartphone is it boots quicker than the one it replaced (both were mid-high end compact models), and probably about as fast as the feature phone I had before that. Brands omitted in case anyone thinks the data point is just shilling...

          ...although the 3310 was obviously quicker than any of them ;)

        3. SWCD

          "The fix is not "failing to join the other network", it is more correctly "disconnecting and rejoining.""

          Might be being a little over fussy there, Joe. The post said accurately the steps given to customers.. I think it's realised by all what those steps achieved (the disconnect/reconnect)!

  15. SVV Silver badge

    Nice to read a wider look at the issues raised by this screwup

    And the image it has painted in my head of Cameron flying into a red faced rage, because his magic smartphone kept failing in his artisan yoghurt eating Cotswold smugster's paradise has put a smile on my face for a few hours at least.

  16. Mage Silver badge
    Facepalm

    Incompetance

    I've been saying this for over 30 years.

    Even most user's computer infections are relying on the user's lack of computer expertise (not disabling Autorun, unwanted services, adding toolbars, not disabling remote content in email viewer, clicking on OK boxes without reading them, opening unexpected documents to see what they are, not hovering to check links etc etc).

    Most really bad IT disaster I've seen have been human error. Even HW failures were everything was lost is human error in sense of not having a backup, RAID or Cluster depending on importance of system. Once there was a server moved while running. Two reasons everything lost. 1) The HDDs only had one or two screws. 2) You don't move stuff that's not portable while running. It's not even a good idea to move a laptop with a regular HDD while running, Dropping it is more likely to be fatal to HDD than when off or asleep.

    I even wrote a book about an "apocalypse" caused by human error. Faulty patches to BGP on Routers and on HTPP and eMail on servers on same late Friday.

    1. Mage Silver badge

      Re: Incompetance

      Also having RAID or a Cluster makes no difference to need for a backup. Most data lost is caused by user error, also RAID or a Cluster is no protection against malware.

      A nasty malware may have a timed later activation so that your backups are infected. Thus you can't just keep rotating the backups or just using one USB HDD etc.

      You need to keep archived backups off site.

      You also don't know how long it might be before user error deletion or mess of data, or patch or new program shows a problem. You may need an earlier backup than you imagine.

      Most individuals, small companies and many Corporates have no real "disaster recovery" plan. What if your single shop or office is burgled, burnt down, blown up, flooded. You can buy new stock, office furniture and PCs. What about your accounts, supplier data, customer data / CRM, payroll, etc? Also do not rely on 3rd party "Cloud" CRM, Payroll or accounts. What is their backup, security etc? What do you do if you lose your broadband? How do you migrate to a different supplier. Can you make your own backups in case of error of one of your users, not just the failure of provider?

      Cloud services may be essential for a Commerce Web site. Or two co-located servers in two data centres is cheaper than electricity and Fast Broadband to a single office. Cloud services or outsourcing for your core business, your backend data etc is really stupid. Banks are particularly crazy to do this.

      1. Anonymous Coward
        Anonymous Coward

        Re: Incompetance

        This. I am actually working on this kind of problem right now and nobody seems to understand that just because you have global SAN replication/synchronisation and additional backup copies (to the same SAN!) there is still value in having master snapshots and emergency backups per server kept completely off the grid for the sole purpose of that 'once in a career' real DR event such as a data centre fire or flood. My personal preference is the KISS approach and keep a rolling swap/set of air-gapped USB3 SATA disks on standby at your DR site, swapped out quarterly and immediately after the latest NFT patch/DR testing completes on your master data servers.

        1. Anonymous Coward
          Anonymous Coward

          Re: you have global SAN

          We have a very large SAN in our USA data centre, everything looks good until one day some tech was replacing the backup PSU for routine servicing and got a little confused which was the backup - once they put it all back and system rebooted it was then found the SAN had never saved configurations so it went back to day one.

          1. Doctor Syntax Silver badge

            Re: you have global SAN

            "once they put it all back and system rebooted it was then found the SAN had never saved configurations so it went back to day one."

            This is why you test your restore/recovery procedures.

            1. Roland6 Silver badge

              Re: you have global SAN

              >This is why you test your restore/recovery procedures.

              I've tended to make restore/recovery part of normal day-to-day operations - probably because of my initial training on non-stop and fail-safe computing systems and focus on business continuity. However, I suspect unless you've had your fingers singed (SSO) you probably haven't considered certificate expiry to be an operational risk.

        2. Doctor Syntax Silver badge

          Re: Incompetance

          "that 'once in a career' real DR event such as a data centre fire or flood."

          One of the things about having had your place of work burn down is that you realise such things can actually happen and potentially more than once in a career. Those who haven't experienced one tend to put them in the "won't ever happen" category.

    2. CustardGannet
      Windows

      "Most really bad IT disasters I've seen have been human error"

      That would definitely include Windows Vista, then.

      And Windows 8.

      And Windows 10.

      And probably whatever piece of crap they force on us next time.

    3. Ozumo

      Re: Incompetance

      Is this irony?

      1. Semtex451 Silver badge

        Re: Incompetance

        No

    4. Glenturret Single Malt

      Re: Incompetance

      You've been saying it wrong. It's incompetence.

  17. Anonymous Coward
    Anonymous Coward

    Counting MNOs is hard

    having four MNOs, the UK is more fortunate from this perspective than most nations, which have three

    It isn't really four. It depends on how you count them and that turns out to be a lot harder than you might think.

    Nowadays there is a lot of sharing going on: sharing of towers, radio, core network, back office and other things. And the sharing is different for different technologies (2G, 3G, LTE). And then there are (secret) roaming agreements where effective national roaming happens in some places (often to provide rural coverage). And the operation is mostly outsourced so the same outsourcer may be operating multiple networks (or parts of them, normally split geographically).

    I think the answer is that for this sort of thing there are about 2 1/2 networks in most places in the UK. If I remember correctly there are about three main core networks but they are split up geographically. So most places end up covered by 2 or 3 of them plus, sometimes, a much small piece of network (for example microcells in a city). So, call it 2 1/2!

    Anyone got better insight into the effective average number of networks with SGSNs covering a single point in the UK? And how many different SGSN vendors involved? And how many different operations companies?

    1. Mage Silver badge

      Re: Counting MNOs is hard

      ONE physical network, properly designed, resilient and regulated is best (A RAN). Then there can be as many MVNOs as want to play.

      1. lglethal Silver badge
        Go

        Re: Counting MNOs is hard

        Technically you're right, but we can all see how well thats worked out with National Rail and its maintenance/care of the nationwide rail infrastructure...

      2. Mage Silver badge
        Flame

        Re: Counting MNOs is hard

        Mobile spectrum, actually ANY spectrum is a very limited resource. Splitting it to different physical operators reduces performance by x2 to x5. Also operators will not increase mast density to improve performance (the ENTIRE concept of Cellular frequency reuse) once they have sufficient coverage. The issue of ROI. Adding more masts / performance doesn't generate more income.

        Just because Network rail is a disaster, doesn't mean the idea of managing and regulating fixed single resources shouldn't be done.

        The old Post Office management of Telegraphs and Phones was done wrong. The solution isn't to go to the opposite extreme and have multiple operators and a Regulator that cares more about income from Operators than coverage, performance or the Consumer.

  18. Alistair Dabbs

    No Plan B

    I keep banging on about this to customers and get ignored every time. For printed newspapers, there is a retainer contract with a backup printer in case the normal presses catch fire, break down, go on strike etc. For their app editions, there's bugger all: when the tech falls over, that's it. I think the problem is that having a Plan B is extremely unfashionable at the moment, in business as in politics.

    1. Semtex451 Silver badge
      Coat

      Re: No Plan B

      But you'll get the blame, so take the initiative and say "Ikabai-Sital". Present them with options.

      FYI: there's an amusing article about Ikabai-Sital you should read here:

      https://www.theregister.co.uk/2015/08/08/all_hail_ikabaisital_destroyer_of_worlds_mender_of_toilets/

  19. Jove Bronze badge

    Unsuitable owner

    The other side to consider here is whether Telefonica is a suitable owner of a UK utility given Spanish stance on Brexit.

    1. A.P. Veening

      Re: Unsuitable owner

      I was wondering if I would find a post blaming it on Brexit, my compliments on the way you split it.

      1. Jove Bronze badge

        Re: Unsuitable owner

        Well now you know.

        It is of concern, and it has been looked at, as is the ownership of other businesses in the UK in cases where another EU member state Government has a significant interest.

    2. David Nash Silver badge

      Re: Unsuitable owner

      It's not a utility, it's a private company offering a service, and not even a monopoly.

      Are you seriously suggesting sanctions against any country where the government doesn't agree with every policy of our government?

      1. Voyna i Mor Silver badge

        Re: Unsuitable owner

        "Are you seriously suggesting sanctions against any country where the government doesn't agree with every policy of our government?"

        That appears to be US government foreign policy right now, so as America's poodle shouldn't we follow suit?

        1. Mandoscottie

          Re: Unsuitable owner

          no, let them spiral into twatdom with the orange fanny.

          We have home grown fannies to screw us over, we dont need to replicate stateside.

      2. Jove Bronze badge

        Re: Unsuitable owner

        It provides a utility service.

        "Are you seriously suggesting sanctions against any country where the government doesn't agree with every policy of our government?"

        EU member states have already discussed sanctions against specific UK businesses. Engage.

    3. Dan 55 Silver badge

      Re: Unsuitable owner

      Given the UK up until now hasn't really cared who owns stuff, just that someone owns it, it's a bit late in the day to get precious about foreign-owned utilities.

      1. Rameses Niblick the Third Kerplunk Kerplunk Whoops Where's My Thribble? Silver badge

        Re: Unsuitable owner

        Indeed, I would love to see what happens if the the non-UK owners of utilities / manufacturing were made to pack up and go home after Brexit. Mass unemployment in most manufacturing industries (Japanese, German, American mostly) and electricity blackouts, since every nuclear power plant in the UK is owned by EDF (French). And the Dartford crossing would be closed down out of spite by the (french) toll taking company. Lets brick up the channel tunnel while we're at it, eh?

        1. TRT Silver badge

          Re: Unsuitable owner

          At least we still make our own bricks.

          1. Roland6 Silver badge

            Re: Unsuitable owner

            >At least we still make our own bricks.

            According to the British Geological Survey the UK isn't self-sufficient in bricks and imported bricks account for a significant percentage of the market...

            Mind you perhaps this might be a benefit of Brexit - we won't be able to build all those rabbit hutches various parties say need to be built...

  20. Anonymous Coward
    Anonymous Coward

    At least..........

    No-one is suggesting running anything important like an Emergency Services Network over a commercially focused mobile provider.

    Oh they are - what's the worse that could happen.

    1. Anonymous Coward
      Anonymous Coward

      Re: At least..........

      You're suggesting Airwave don't care how much money they make and aren't using kit that's out of support? Right.... Running emergency services over commercial networks could be more resiliant than the current setup if they had roaming (there aren't enough blue light users to cause a cascade failure). That doesn't even need network trickery - just a multi IMSI sim or a dual SIM handset.

  21. BinkyTheMagicPaperclip Silver badge

    Mandated roaming for critical services is not a bad idea

    What is a poor idea is roaming everyone off a failed network, and not having a two tier service, as fixed line installations do.

    It's probably forgotten more often these days that in the case of widespread telephone line disruption the average punter will be disconnected, and essential users (doctors, for instance) remain contactable.

    I'd be surprised if this isn't part of the mobile networks, and if not, it needs to be.

    So, in the event of a major mobile network outage, mountain rescue retain their access (they generally use 2G/pagers for alerts, although they may have radios too), bus availability doesn't as there (should be) a timetable printed on the bus shelter.

    You can't work this without a two tier service, because ultimately businesses will work round unreliable networks by implementing their own multi network/SIM solutions.

  22. Anonymous Coward
    Anonymous Coward

    OH YEAH?

    We just had our software developer outfit leisurely let the "Apple Developer Certificate" (whatever that is) for their mobile app expire.

    Consequence: app won't mysteriously start on several hundreds of mobile devices (not even an error message, Apple QUALITY interface there). And these are used in a role which I would personally consider a "high assurance" because if it is not working then lots of dinarii go down the drain per minute.

    Of course, no hotline, developers at home etc.

    No-one was responsible because "you should have noticed that our developer certificate would expire by looking at your Mobile Device Management Platform".

    Yeah, thanks? I guess.

    No, I haven't seen an SLA either.

  23. Mike Lewis

    Never attribute to malice...

    that which can be blamed on outhouse staff.

  24. Anonymous Coward
    Anonymous Coward

    this happens SOOO often, there has to be a better way.

    1. Zmodem

      motherboard batteries you can change without turning the power off

  25. cam

    And IT Auditors get stick for being an unskilled, ineffective, waste of money.

    Maybe pass the cert and compliance over to a compliance manager?

    Just saying.

  26. Anonymous Coward
    Anonymous Coward

    What if all electric meters needed to be notworked?

    What if somebody decided that all energy supplies (consumer, industrial, etc) needed to be remotely managed, and the contractors forgot to (a) build robust connectivity into the scheme (b) forgot to test what happened when the notworking was inevitably unreachable in remote areas (c) forgot to care what happened when the notworking was unusuable across wide areas?

    Who would/should pay the price for this level of incompetence?

    As far as I can tell, the people at the top don't pay the price of failure, at least not in the UK, at least not in the same way as they reward themselves when things "go well".

    Obviously there's no way that the successful rollout of a genuinely robust sensible-data-throughput national network with decent availability and uptime could be considered a prerequisite for a Smart Meter rollout. Oh no. That would never work. Not at board level anyway.

    How many other countries were (not) affected by the Ericsson foulup? Why might that be?

  27. terrythetech
    Happy

    Waiting...

    for the Who, Me? article on this one!

  28. MooseMonkey

    Meltdown

    The only things that melted down more quickly than the O2 network were the O2 customers! I saw no end of "my phone is critical to my business" whinging going on, demands for huge compensation and stories of life changing events.

    To that I say, my £100 smartphone has two SIM cards in it, one O2, one EE.

    I thank you, good night.

    1. Roland6 Silver badge

      Re: Meltdown

      >To that I say, my £100 smartphone has two SIM cards in it, one O2, one EE.

      So do you have both numbers on your business card, or do you use a virtual number and call redirection service?

      Personally, given dual SIM phones aren't generally available in the high st. but unlocked phones are, I have two handsets (latest toy and previous toy), each on different networks (EE and Three) and my tertiary fallback is a quick trip to a local shop where I can pick up a Vodafone/O2 SIM or a suitable MVNO SIM.

      1. Anonymous Coward
        Anonymous Coward

        Re: Meltdown

        >dual SIM phones aren't generally available in the high st

        Three sell a decent range of them.

  29. hairydog

    V2X "vehicle to everything" - really? To pedestrians? cyclists? horse riders? flocks of sheep? cows going to milking? Circus parade elephants? Sleepy kangaroos? Spilled loads? Fallen trees?

    Technology needs to address itself to the real world, not the "simplest case" that the spec had in mind.

    Software and systems should be designed from failure backwards: every function should initially be designed to report and cope with failure, then the "non-failure" case should be added as an exception.

    But this doesn't often happen becasue the developers are so focussed on what they want it to do.

    1. sed gawk

      Bollocks

      But this doesn't often happen becasue the developers are so focussed on what they want it to do.

      Devs these days largely work under the thumb of a fragile project manager, the incentives for the fragile project manager are to ensure that delivery deadlines are met.

      Of course the delivery date is often a fantasy date that is rarely based on the work required for completion.

      In short, shit buggy software on time = bonus.

      High quality, robust software, 10-20% late = no bonus = no chance.

      The quality of the software doesn't matter to delivery managers and so its difficult to prioritise improvements to robustness over delivery dates, that's why the software developed using the fragile process is *fragile*.

      That's okay, keep blaming the dev's and not the line of oversight all the way up to board.

      1. hairydog

        Re: Bollocks

        You titled your reply well: it was indeed bollocks.

        Managers do indeed press developers to make things happens cheap and fast. But that doesn't stop developers having to say "no, it takes longer to do it properly"

        The reality is that it doesn't take much longer. Start with the "cope with error" template and it becomes second nature.

        The extra dev time is compensated for by easier integration testing.

        Few developers even understand the concept of a "failure first" approach, so it looks hard to them and they react with moronic comments like "Bollocks"..

  30. This post has been deleted by its author

    1. Anonymous Coward
      Anonymous Coward

      Re: Throw money at the problem... oh no - wait up....

      Er, what? When did UK mobile networks get a taxpayer subsidy? I've seen them sending billions *to* the government in spectrum fees, but aside from EE's emergency services network contract, I've not seen anything flowing the other way.

      1. TRT Silver badge

        Re: When did UK mobile networks get a taxpayer subsidy?

        BT's hay day.

        1. ARGO

          Re: When did UK mobile networks get a taxpayer subsidy?

          Pre nationalisation? They didn't do mobile then.

          1. TRT Silver badge

            Re: When did UK mobile networks get a taxpayer subsidy?

            Indeed, but it was less than a year after that date that BT-Securicor launched their service just a few days after Vodaphone. Prior to that there had been the GPO's mobile radio telephone service, aka the Carphone service. So I expect that a lot of the investment in infrastructure, masts, planning permissions, operator's licenses, mast site applications, power supplies etc etc had to go in well before the network was actually launched, or made use of the existing GPO/British Telecoms masts. Thus the funding for this was directly from the tax payer, rather than indirectly from the tax payer through their choice to subscribe of course. And then we see with Broadband the government subsidising the initial deployment of infrastructure. Did this not happen with mobile phones too?

  31. Robert Sneddon

    Not-so-hidden subsidies

    One "subsidy" mobile and other network infrastructure companies got from HMG was an abbreviated planning process to dig holes, run cable and build towers everywhere. I can see the point of doing that but it's something that would otherwise cost them lots of cash and stretch the time to market hence their ability to bill customers for new services.

    1. ARGO

      Re: Not-so-hidden subsidies

      Cables and holes aren't generally the mobile company's job - the backhaul is usually provided by Openreach, Virgin, or one of the dedicated business network companies. Of course the cost of that would increase if the overheads did, but it would increase for everyone, not just mobile co's

      Planning permission for towers is abbreviated only in certain cases. A remarkable percentage have to go to appeal to get built. (and the same folk who object sometimes complain about lack of a decent mobile signal!)

  32. Tom 7 Silver badge

    Management are now inspired to make their networks secure

    by employing people more incapable than them. So an IT qualification will now be a hindrance to getting a job in IT!

  33. J J Carter Silver badge
    FAIL

    Churnalism

    Why the lazy and tired snarky remark at The Donald?

  34. Anonymous Coward
    Anonymous Coward

    must have a fallback mode

    just like everybody "should" backup their data regularly. Great idea, if it wasn't for that pesky human nature...

  35. Stephen T

    Prof Stephen Temple

    It is perfectly possible for one UK mobile network to back up another without creating a domino effect. There are two solutions. The first is an official "telephone preference scheme" that has been around for years for the fixed network for a time of national crisis. This is where critical uses can be identified, put on a register and MNO's obliged to have a dormant contract with a second mobile operator to take over the "priority user's traffic" should the mobile network they subscribe to go down.

    The second solution is a commercial solution where the automatic back-up on a second network is a premium service anyone can buy into. The price of the premium service is set so that the number of such premium customers can be handled by the second network. This commercial approach is a dream solution - critical users have peace of mind and mobile operators have a new stream of revenue that is literally money for old rope. What is there not to like about it? Ofcom could do everybody a huge favor by mandating every mobile operator offers a premium back-up offer to their customers.

  36. bigtimehustler

    One thing if it was such a simple cock up, how did sorting a certificate take 24 hours? Should have taken an hour at most to realise that was the problem and a further hour to install the sorted out certificate.

  37. Toni the terrible

    O2 Fail?

    Did anything happen? Everyone is saying there was a failure in the O2 network (due to Certs ceasing etc see above posts). I use O2 but never noticed as my interweb & land line kept on working. Do I get 20p compensation?

  38. mistersaxon

    Pedant mode=on

    "We only realise how pervasive machine-to-machine (M2M) mobile data connections are in our lives until they stop working"

    It's either "we DON'T realise" or "WHEN they stop working". As supplied the sentence only works because I assume I know what you mean, not because you've conveyed that.

    Under the circumstances a communication failure due to a mismatch of standards is . . . ironic? Only missing the bullseye by still just about functioning...

  39. xanda
    Mushroom

    Throw money at the problem... oh no - wait up....

    "The fact is building and operating a nationwide network requires huge capital expenditure..."

    In our limited understanding of the network side of things it does seem to us that enough has been ploughed in already. This being so, one might think that such protocols would have been in place as part of the standard package perhaps, or at the least been factored into the acceptance and testing regime prior to overall commissioning in the field.

    At the consumer level we pay top dollar for MVP kit and services that are fun and shiny but which have apparently poor resilience when put under modest stress.

    At both the network and consumer level, vendors have eschewed the need for decent local/offline fallbacks which would provide much needed continuity in such events; they have sacrificed this irritating niggle at the altar of The Cloud. A prime example of this was 'exposed' in this particular outage: According the Beeb some plumber was unable to use their satnav - presumably Google Maps or the like - to get to jobs.

    Seems the smart device era aint so smart after all.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019