back to article Expired cert... Really? #O2down meltdown shows we should fear bungles and bugs more than hackers

It's a bit of a cliche that "everything's connected", but O2's stunning outage yesterday – chalked up by Swedish kitmaker Ericsson to an expired software certificate – is a reminder of how true that is. Payment terminals croaked, bus displays went blank. Strangers blinked at each other in the street, like Robinson Crusoe …

Re: Graceful reconnect...

Not out of fashion everywhere,

I'm consulting at a UK Uni at present, and I was pleasantly surprised to see something not very far from the classical clean disconnect and reconnect patterns in a recent pull request crossing my desk.

0
0
Silver badge

It will be repeated until answered with "properly engineered systems fail softly".

Which is answered with "fail-safe systems fail by failing to fail safe".

0
0
Bronze badge

"The article explains why this is a bad idea"

James Burke's Connections explained a failover system of the electrical grid in America. One relay tripped because a street was overloaded and it passed the current over to other circuits in a domino effect until the whole state was offline.

If the load is too much for one then it will be too much when added to the next one that's still working.

0
0
SVV
Silver badge

Nice to read a wider look at the issues raised by this screwup

And the image it has painted in my head of Cameron flying into a red faced rage, because his magic smartphone kept failing in his artisan yoghurt eating Cotswold smugster's paradise has put a smile on my face for a few hours at least.

11
1
Silver badge
Facepalm

Incompetance

I've been saying this for over 30 years.

Even most user's computer infections are relying on the user's lack of computer expertise (not disabling Autorun, unwanted services, adding toolbars, not disabling remote content in email viewer, clicking on OK boxes without reading them, opening unexpected documents to see what they are, not hovering to check links etc etc).

Most really bad IT disaster I've seen have been human error. Even HW failures were everything was lost is human error in sense of not having a backup, RAID or Cluster depending on importance of system. Once there was a server moved while running. Two reasons everything lost. 1) The HDDs only had one or two screws. 2) You don't move stuff that's not portable while running. It's not even a good idea to move a laptop with a regular HDD while running, Dropping it is more likely to be fatal to HDD than when off or asleep.

I even wrote a book about an "apocalypse" caused by human error. Faulty patches to BGP on Routers and on HTPP and eMail on servers on same late Friday.

8
1
Silver badge

Re: Incompetance

Also having RAID or a Cluster makes no difference to need for a backup. Most data lost is caused by user error, also RAID or a Cluster is no protection against malware.

A nasty malware may have a timed later activation so that your backups are infected. Thus you can't just keep rotating the backups or just using one USB HDD etc.

You need to keep archived backups off site.

You also don't know how long it might be before user error deletion or mess of data, or patch or new program shows a problem. You may need an earlier backup than you imagine.

Most individuals, small companies and many Corporates have no real "disaster recovery" plan. What if your single shop or office is burgled, burnt down, blown up, flooded. You can buy new stock, office furniture and PCs. What about your accounts, supplier data, customer data / CRM, payroll, etc? Also do not rely on 3rd party "Cloud" CRM, Payroll or accounts. What is their backup, security etc? What do you do if you lose your broadband? How do you migrate to a different supplier. Can you make your own backups in case of error of one of your users, not just the failure of provider?

Cloud services may be essential for a Commerce Web site. Or two co-located servers in two data centres is cheaper than electricity and Fast Broadband to a single office. Cloud services or outsourcing for your core business, your backend data etc is really stupid. Banks are particularly crazy to do this.

8
0
Windows

"Most really bad IT disasters I've seen have been human error"

That would definitely include Windows Vista, then.

And Windows 8.

And Windows 10.

And probably whatever piece of crap they force on us next time.

8
1
Anonymous Coward

Re: Incompetance

This. I am actually working on this kind of problem right now and nobody seems to understand that just because you have global SAN replication/synchronisation and additional backup copies (to the same SAN!) there is still value in having master snapshots and emergency backups per server kept completely off the grid for the sole purpose of that 'once in a career' real DR event such as a data centre fire or flood. My personal preference is the KISS approach and keep a rolling swap/set of air-gapped USB3 SATA disks on standby at your DR site, swapped out quarterly and immediately after the latest NFT patch/DR testing completes on your master data servers.

3
0
Anonymous Coward

Re: you have global SAN

We have a very large SAN in our USA data centre, everything looks good until one day some tech was replacing the backup PSU for routine servicing and got a little confused which was the backup - once they put it all back and system rebooted it was then found the SAN had never saved configurations so it went back to day one.

1
0
Silver badge

Re: Incompetance

"that 'once in a career' real DR event such as a data centre fire or flood."

One of the things about having had your place of work burn down is that you realise such things can actually happen and potentially more than once in a career. Those who haven't experienced one tend to put them in the "won't ever happen" category.

7
0
Silver badge

Re: you have global SAN

"once they put it all back and system rebooted it was then found the SAN had never saved configurations so it went back to day one."

This is why you test your restore/recovery procedures.

0
0

Re: Incompetance

Is this irony?

1
0
Silver badge

Re: Incompetance

No

0
0

Re: Incompetance

You've been saying it wrong. It's incompetence.

0
0
Silver badge

Re: you have global SAN

>This is why you test your restore/recovery procedures.

I've tended to make restore/recovery part of normal day-to-day operations - probably because of my initial training on non-stop and fail-safe computing systems and focus on business continuity. However, I suspect unless you've had your fingers singed (SSO) you probably haven't considered certificate expiry to be an operational risk.

1
0
Anonymous Coward

Counting MNOs is hard

having four MNOs, the UK is more fortunate from this perspective than most nations, which have three

It isn't really four. It depends on how you count them and that turns out to be a lot harder than you might think.

Nowadays there is a lot of sharing going on: sharing of towers, radio, core network, back office and other things. And the sharing is different for different technologies (2G, 3G, LTE). And then there are (secret) roaming agreements where effective national roaming happens in some places (often to provide rural coverage). And the operation is mostly outsourced so the same outsourcer may be operating multiple networks (or parts of them, normally split geographically).

I think the answer is that for this sort of thing there are about 2 1/2 networks in most places in the UK. If I remember correctly there are about three main core networks but they are split up geographically. So most places end up covered by 2 or 3 of them plus, sometimes, a much small piece of network (for example microcells in a city). So, call it 2 1/2!

Anyone got better insight into the effective average number of networks with SGSNs covering a single point in the UK? And how many different SGSN vendors involved? And how many different operations companies?

9
0
Silver badge

Re: Counting MNOs is hard

ONE physical network, properly designed, resilient and regulated is best (A RAN). Then there can be as many MVNOs as want to play.

1
6
Silver badge
Go

Re: Counting MNOs is hard

Technically you're right, but we can all see how well thats worked out with National Rail and its maintenance/care of the nationwide rail infrastructure...

5
5
Silver badge
Flame

Re: Counting MNOs is hard

Mobile spectrum, actually ANY spectrum is a very limited resource. Splitting it to different physical operators reduces performance by x2 to x5. Also operators will not increase mast density to improve performance (the ENTIRE concept of Cellular frequency reuse) once they have sufficient coverage. The issue of ROI. Adding more masts / performance doesn't generate more income.

Just because Network rail is a disaster, doesn't mean the idea of managing and regulating fixed single resources shouldn't be done.

The old Post Office management of Telegraphs and Phones was done wrong. The solution isn't to go to the opposite extreme and have multiple operators and a Regulator that cares more about income from Operators than coverage, performance or the Consumer.

1
1

No Plan B

I keep banging on about this to customers and get ignored every time. For printed newspapers, there is a retainer contract with a backup printer in case the normal presses catch fire, break down, go on strike etc. For their app editions, there's bugger all: when the tech falls over, that's it. I think the problem is that having a Plan B is extremely unfashionable at the moment, in business as in politics.

18
0
Silver badge
Coat

Re: No Plan B

But you'll get the blame, so take the initiative and say "Ikabai-Sital". Present them with options.

FYI: there's an amusing article about Ikabai-Sital you should read here:

https://www.theregister.co.uk/2015/08/08/all_hail_ikabaisital_destroyer_of_worlds_mender_of_toilets/

2
0
Bronze badge

Unsuitable owner

The other side to consider here is whether Telefonica is a suitable owner of a UK utility given Spanish stance on Brexit.

13
7

Re: Unsuitable owner

I was wondering if I would find a post blaming it on Brexit, my compliments on the way you split it.

6
0
Silver badge

Re: Unsuitable owner

It's not a utility, it's a private company offering a service, and not even a monopoly.

Are you seriously suggesting sanctions against any country where the government doesn't agree with every policy of our government?

6
0
Silver badge

Re: Unsuitable owner

Given the UK up until now hasn't really cared who owns stuff, just that someone owns it, it's a bit late in the day to get precious about foreign-owned utilities.

8
0
Silver badge

Re: Unsuitable owner

Indeed, I would love to see what happens if the the non-UK owners of utilities / manufacturing were made to pack up and go home after Brexit. Mass unemployment in most manufacturing industries (Japanese, German, American mostly) and electricity blackouts, since every nuclear power plant in the UK is owned by EDF (French). And the Dartford crossing would be closed down out of spite by the (french) toll taking company. Lets brick up the channel tunnel while we're at it, eh?

9
1
TRT
Silver badge

Re: Unsuitable owner

At least we still make our own bricks.

2
0
Silver badge

Re: Unsuitable owner

"Are you seriously suggesting sanctions against any country where the government doesn't agree with every policy of our government?"

That appears to be US government foreign policy right now, so as America's poodle shouldn't we follow suit?

7
1

Re: Unsuitable owner

no, let them spiral into twatdom with the orange fanny.

We have home grown fannies to screw us over, we dont need to replicate stateside.

1
0
Silver badge

Re: Unsuitable owner

>At least we still make our own bricks.

According to the British Geological Survey the UK isn't self-sufficient in bricks and imported bricks account for a significant percentage of the market...

Mind you perhaps this might be a benefit of Brexit - we won't be able to build all those rabbit hutches various parties say need to be built...

1
0
Anonymous Coward

At least..........

No-one is suggesting running anything important like an Emergency Services Network over a commercially focused mobile provider.

Oh they are - what's the worse that could happen.

12
1
Anonymous Coward

Re: At least..........

You're suggesting Airwave don't care how much money they make and aren't using kit that's out of support? Right.... Running emergency services over commercial networks could be more resiliant than the current setup if they had roaming (there aren't enough blue light users to cause a cascade failure). That doesn't even need network trickery - just a multi IMSI sim or a dual SIM handset.

0
1
Silver badge

Mandated roaming for critical services is not a bad idea

What is a poor idea is roaming everyone off a failed network, and not having a two tier service, as fixed line installations do.

It's probably forgotten more often these days that in the case of widespread telephone line disruption the average punter will be disconnected, and essential users (doctors, for instance) remain contactable.

I'd be surprised if this isn't part of the mobile networks, and if not, it needs to be.

So, in the event of a major mobile network outage, mountain rescue retain their access (they generally use 2G/pagers for alerts, although they may have radios too), bus availability doesn't as there (should be) a timetable printed on the bus shelter.

You can't work this without a two tier service, because ultimately businesses will work round unreliable networks by implementing their own multi network/SIM solutions.

2
1
Anonymous Coward

OH YEAH?

We just had our software developer outfit leisurely let the "Apple Developer Certificate" (whatever that is) for their mobile app expire.

Consequence: app won't mysteriously start on several hundreds of mobile devices (not even an error message, Apple QUALITY interface there). And these are used in a role which I would personally consider a "high assurance" because if it is not working then lots of dinarii go down the drain per minute.

Of course, no hotline, developers at home etc.

No-one was responsible because "you should have noticed that our developer certificate would expire by looking at your Mobile Device Management Platform".

Yeah, thanks? I guess.

No, I haven't seen an SLA either.

6
1

Never attribute to malice...

that which can be blamed on outhouse staff.

1
0
Anonymous Coward

this happens SOOO often, there has to be a better way.

1
0

motherboard batteries you can change without turning the power off

1
0
cam

And IT Auditors get stick for being an unskilled, ineffective, waste of money.

Maybe pass the cert and compliance over to a compliance manager?

Just saying.

2
0
Anonymous Coward

What if all electric meters needed to be notworked?

What if somebody decided that all energy supplies (consumer, industrial, etc) needed to be remotely managed, and the contractors forgot to (a) build robust connectivity into the scheme (b) forgot to test what happened when the notworking was inevitably unreachable in remote areas (c) forgot to care what happened when the notworking was unusuable across wide areas?

Who would/should pay the price for this level of incompetence?

As far as I can tell, the people at the top don't pay the price of failure, at least not in the UK, at least not in the same way as they reward themselves when things "go well".

Obviously there's no way that the successful rollout of a genuinely robust sensible-data-throughput national network with decent availability and uptime could be considered a prerequisite for a Smart Meter rollout. Oh no. That would never work. Not at board level anyway.

How many other countries were (not) affected by the Ericsson foulup? Why might that be?

0
0
Happy

Waiting...

for the Who, Me? article on this one!

2
0

Meltdown

The only things that melted down more quickly than the O2 network were the O2 customers! I saw no end of "my phone is critical to my business" whinging going on, demands for huge compensation and stories of life changing events.

To that I say, my £100 smartphone has two SIM cards in it, one O2, one EE.

I thank you, good night.

0
0
Silver badge

Re: Meltdown

>To that I say, my £100 smartphone has two SIM cards in it, one O2, one EE.

So do you have both numbers on your business card, or do you use a virtual number and call redirection service?

Personally, given dual SIM phones aren't generally available in the high st. but unlocked phones are, I have two handsets (latest toy and previous toy), each on different networks (EE and Three) and my tertiary fallback is a quick trip to a local shop where I can pick up a Vodafone/O2 SIM or a suitable MVNO SIM.

0
0

V2X "vehicle to everything" - really? To pedestrians? cyclists? horse riders? flocks of sheep? cows going to milking? Circus parade elephants? Sleepy kangaroos? Spilled loads? Fallen trees?

Technology needs to address itself to the real world, not the "simplest case" that the spec had in mind.

Software and systems should be designed from failure backwards: every function should initially be designed to report and cope with failure, then the "non-failure" case should be added as an exception.

But this doesn't often happen becasue the developers are so focussed on what they want it to do.

1
1

Bollocks

But this doesn't often happen becasue the developers are so focussed on what they want it to do.

Devs these days largely work under the thumb of a fragile project manager, the incentives for the fragile project manager are to ensure that delivery deadlines are met.

Of course the delivery date is often a fantasy date that is rarely based on the work required for completion.

In short, shit buggy software on time = bonus.

High quality, robust software, 10-20% late = no bonus = no chance.

The quality of the software doesn't matter to delivery managers and so its difficult to prioritise improvements to robustness over delivery dates, that's why the software developed using the fragile process is *fragile*.

That's okay, keep blaming the dev's and not the line of oversight all the way up to board.

4
0

Re: Bollocks

You titled your reply well: it was indeed bollocks.

Managers do indeed press developers to make things happens cheap and fast. But that doesn't stop developers having to say "no, it takes longer to do it properly"

The reality is that it doesn't take much longer. Start with the "cope with error" template and it becomes second nature.

The extra dev time is compensated for by easier integration testing.

Few developers even understand the concept of a "failure first" approach, so it looks hard to them and they react with moronic comments like "Bollocks"..

0
0
Mushroom

Throw money at the problem... oh no - wait up....

"The fact is building and operating a nationwide network requires huge capital expenditure..."

...and they've had it - in spades, buckets, sheds and lorry loads.

It's seen from top to bottom: Networks have held Gov to ransom for years in holding out for ever increasing tax payer cash while failing to deliver the requisite value. At the local level i.e. us, we pay top dollar for MVP kit and services that are fun and shiny but which have poor resilience and value when put under modest stress.

Having decent local/offline fallbacks just hasn't been a priority for vendors - they've sacrificed this irritating niggle at the altar of The Cloud. A prime example was 'exposed' in this most recent outage: According the Beeb some plumber was unable to use their satnav - presumably Google Maps - to get to jobs.

Seems the smart device era aint so smart after all.

2
2
Anonymous Coward

Re: Throw money at the problem... oh no - wait up....

Er, what? When did UK mobile networks get a taxpayer subsidy? I've seen them sending billions *to* the government in spectrum fees, but aside from EE's emergency services network contract, I've not seen anything flowing the other way.

1
0
TRT
Silver badge

Re: When did UK mobile networks get a taxpayer subsidy?

BT's hay day.

0
1

Re: When did UK mobile networks get a taxpayer subsidy?

Pre nationalisation? They didn't do mobile then.

0
0

Not-so-hidden subsidies

One "subsidy" mobile and other network infrastructure companies got from HMG was an abbreviated planning process to dig holes, run cable and build towers everywhere. I can see the point of doing that but it's something that would otherwise cost them lots of cash and stretch the time to market hence their ability to bill customers for new services.

0
1

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2018