At the BCS CMSG conference in London earlier this year, Unisys CM manager Michel Delran spoke about how to design and implement a successful configuration management process and how a configuration management database can save you millions. He began with the real-life cautionary tale of a phone company which lost millions as a …
Non technical alternative
One place I was employed was going through the risk-assessment process. The finance guys were (as usual) baulking at the cost of it all and suggested that it would be cheaper to insure against the loss, rather than prevent it through technical / architectural means.
This view gained a lot of traction and would have been a hell of a lot simpler to implement: just sign a larger cheque to the corporate insurance people, than any of the proposed IT solutions (which in truth, nobody really understood - least of all the technical architects who were proposing it). However, that plan fell apart when someone from tthe legal dept. piped up that there was a statutory reqiurement to have archives, backups, DR, plans and provisions in place that could be audited.
It does make you wonder whether this particular phone compmany could / would / did claim any of those losses back off their fire insurance? If so, was it really their loss, at all?
The guy is obviously selling a Change Management soltuion, which is why he foucses on that. But I would suggest that this was a Business Continuity / Disaster Recovery plan failure, rather than a configuration change issue.
But it does highlight the need to plan, implement and test something to get you out of the brown stuff after the HVAC has spread it all over.
"No-one plans to fail; they just fail to plan"
Agreed and also I would hazard that there was a monitoring fail as well, because how long did it take them to notice the server was missing? Were the hands on support not able to notice the missing/melted asset and route round the problem?
All these 'enterprise' tools for network management and asset management always seem to be based on proprietary technologies and so badly designed that they are either just accounting tools or don't provide 100% coverage. Frankly I don't even think SDN is up to much because these boys are too busy protecting their proprietary tools.
"I would suggest that this was a Business Continuity / Disaster Recovery plan failure, rather than a configuration change issue."
Layers of the same onion; difficult to properly build your BCP and DR approach if you don't know what you've got and what it links to. Change management should update your CMDB which should then flag that something has changed to your recovery team.
In the end it may well have been the result of outsourcing. Instead of a small team of people who worked for the Telecoms company managing things and knowing what was required and looking out for related problems they had desperate offshore teams who just run through their check-lists and pronounce that the job is done.
I've seen this type of problem many times before and yes you can define the contract better and improve check-lists but it's never as good as having a bunch of local people who see it as their responsibility to get things working and to keep them that way.
Was at the conference in question. Oddly, Delran was one of the few presenters not selling anything.
"Was at the conference in question. Oddly, Delran was one of the few presenters not selling anything."
Unisys CM manager Michel Delran, setting out a case for CM. It is all part of the sales pitch, even if he isn't talking prices.
@ElNumbre - That's a chicken-gum and chewing-wire approach. It also assumes recovery is in-house. Most likely it was outsourced to whomever ran their datacenter.
Do it properly the first time, which means consider your business criticality and DR requirements during design.
If you're dumb enough to sign off on a design that puts all your customer payment details on one server with no HA and no backup then you have problems that are beyond the ability of software to remedy.
@ Matt 21
Could be. Could just as easily be a big company with too many people protecting their individual silos and not actually worrying about the overall health of the business. Honestly, given that it was a Telco I'd bet on that and not outsourcing.
Re: dumb enough to sign off on a design
The design probably did have those elements. But it probably wasn't tested, so they didn't find the plan wasn't properly implemented so the alternate site couldn't be turned on.
I worked for a company that spent a fair chunk of change to have an onsite backup generator with a 24 hour fuel supply installed. If the power in the building failed it was rigged to take over for the servers. Servers were on 30 minute UPS units so it looked reasonably solid. Came into work one day and there was a leak in the kitchen on an upper floor. As a result water was flowing down a pipe in primary physical services core - right over the breaker box. Eventually the breakers shorted and all the magic smoke got out. We evacuated the building. It took management almost two hours to determine we would not be re-entering the building that day. Nobody on the network team was worried because the backups were all in place. Sr. Network Engineer drove home connected up his work laptop and dialed in to perform a controlled shutdown of the facilities. Only he couldn't reach anything. It turned out the facilities management people who thought nothing of the water running over the circuit box in the morning decided that when the alternate power supply kicked in they needed to manually turn it off. So when the UPS units were drained everything went down hard. Sr. Engineer commented that if he had known they were going to cut the power he would have taken the laptop to the local Starbucks and used their WiFi to shut everything down before the UPS units were drained.
Short description - the plan worked fine right up to the point where a meatsack did something unexpected.
£30 switch? do they mean network switch or power switch?
Well if it was networks switch, we think we know where the problem was.
Obviously the solution is to invent yet another process. But who should we buy the consultancy from, I wonder?
Also is it me or does describing someone as a "champion" provoke a very BOFH-ish reaction?
If only they'd had a Jeff Bezos style CEO who'd go round their data centre unplugging cables at random and shouting at any team who had downtime on their systems....
Sounds like a guy we had at ICL (remember them?) in charge of testing user interfaces. When a software release was presented to him, the first thing he'd do was just slap his hand down on the keyboard a few times and see what happened. That got rid of about 50% of them with no further effort ...
Are companies really responsible...
for handling "Chip and PIN" data collected at the POS?
Re: Are companies really responsible...
Well someone has to handle it, those chips and pins won't find their way to charges on a credit card on their own.
Re: Are companies really responsible...
This is the sentence in the piece.
"Initially no one thought it important but this small fire resulted in an £8m loss of sales as the data centre contained the one server that was responsible for processing company’s Chip and PIN information on credit card purchases across Europe."
I find it confusing.
I swear i have heard this before...
Might have been one of those motivational goons that many corporations got during their conferences, but the story is so eerily familiar it is very Déjà vu without the French bits...
maybe it is the same bullcrap story repeated many a time when someone wants to make a point that a relatively inexpensive and seemingly unimportant "thing" is big enough to take down a giant of a corporation or vitally important system/procedure
What would Trevorpott do?
Aside from not get into that stupid an error in the first place?
"And it wasn’t even the server that failed, Delran said, just a £30 switch on the server that burnt out as a result of the fire."
That doesn't sound dodgy to anyone else? A server with a built-in switch? A massively important server that they couldn't simply move the drives, and/or functions to another server? A cheap switch ("on the server"?!?!) that they couldn't replace?
I'd love el Reg to get in touch with Michel Delran & press him for more details of this, pardon my french, bullhonkey.
Re: Citation needed
I don't know Delran or the supposed phone company involved. I don't even really know exactly what 'configuration management' is.
BUT, I suspect the point is that the phone company didn't actually understand the configuration of their systems such that they didn't know that there was a server responsible for this process. Perhaps they thought it was handled elsewhere, by a third party.
The idea of implementing a 'configuration management' process is a bit odd but typical. The problem is that everything is managed by segmented teams with defined 'processes'. Sounds good but the result is that there is no one who really understands the system as a whole. Banging another 'process' over the top of that doesn't really address the issue.
If this is a 100% true story, I suspect that the phone company in question had lots of processes in place, managed by teams who didn't communicate with each other and this server just got 'lost' as it didn't fit neatly into those processes and/or teams.
While these comments are often quite well populated by people who are about doing everything exactly by the book, I'd wager almost everyone who's been is IT long enough has 'lost' or 'found' a server they never knew existed. Usually it's during a change - often a migration but sometimes a failure.
As one of the posters remarked - there's really no substitute for a dedicated sysadmin (or team) who understands the system and makes it his/her job to ensure that things work. The larger the system, the less that is possible but processes will never be able to replace the efforts and first-hand knowledge of that one odd dude with the ill-fitting shirt and abrasive attitude complaining about every change to the system.
And, regarding the specific question of the "£30 switch", perhaps it was some power switch? Or perhaps he meant 'network interface'?
Re: Citation needed
He doesn't give a specific date, but sure that's entirely believable. Of course I'm old enough to remember when motherboards were REAL motherboards and not a damn thing useful was pre-wired to it. Depending on what you were doing, ISA NICs ran about $30. Fire ignites the plastic sheath on the ethernet cable, burns up to the card, fries the card but nothing else in the server. Turns out the server was old and all the spares were PCI. So no NIC to replace it.
What I don't get is nobody being able to run to the local computer shop to buy one.
My introduction to configuration management was an e-mail from someone I had never heard of, with a meeting invitation that consisted almost exclusively of acronyms that I had not encountered before. Naturally, I assumed that this was a misdirected e-mail meant for someone else in the company with the same surname.
meant for someone else in the company
Surely you mean "meant for another internal resource in the company"
Naturally, I assumed that this was a misdirected e-mail meant for
someone else brain dead PHB in the company with the same surname.
This sounds like a corruption of a workplace event that happened in Oz.
On the railway network, a router in charge of sending commands to the rail switches (letting the trains change tracks) failed and the backup router was somehow plugged into itself.
As a result, all trains on one of the main arteries were brought to a halt and there was massive commuter chaos. Apparently a contractor unplugged a cable from the backup router, forgot where he unplugged it from and just plugged it back in any old place he could find.
The underlying message is the same. Don't use contractors :)
Know the business
Comments above are valid and highlight what can be missed, however if we go beyond the technical causes, the one part missing is an appreciation of the business and the under pinning technology that will deliver these key outcomes. If there had been an appreciation of this, and you can use Config to deliver this as well as other options, then not only would there have been a chance of greater appreciation of the risk but steps could have been made to mitigate this scenario. Mapping end to end the key business processes with key outcomes and then adding in the technology footprint supporting these. This can be used in a number of areas - risk, SLA management, Change Management, ROI, to name a few.
In relation to the 'railway network' example I am a supporter of taking the human chance for failure out where we can, so could there have been something as simple as labelling or schematic or such, given the scenario happens on occasion. I have seen these examples in the past and unfortunately it feels we have to learn the hard way, but it would be more remiss if we did not apply the learning across the estate. So in this example was there a solution and did the organisation look at where they had a similar set up and apply that learning ? Then did they take it up a level and identify similar situations based on people plugging a cable into the wrong hole.
Both are easy to write about and harder to get after but can be done.
Re: Know the business
You can have an appreciation of the business and underpinning technology and still have these kinds of screwups. In particular I recall my first day as an actual wet behind the ears IT worker. Rode with the big boss who took an important call from the manager of a bank first thing in the morning. Seems the hard drive on their Fed Funds computer died the previous day and he needed a new one. Big boss moved heaven and earth, got a new drive and us to the site. Went to repair it, the drive was MFM and we had an IDE. I'm quite sure the manager knew how critical the system was. But the card for the Fed Funds transfers was so damned expensive and the system just always worked so they never upgraded it.
We did actually manage to save his ass with an FDisk and a whole lot of jury rigging to get data off the drive before we put it back on (thank God it was DOS). But we told him to replace the system because we couldn't guarantee another miracle.
- Vid Hubble 'scope scans 200,000-ton CHUNKY CRUMBLE ENIGMA
- Bugger the jetpack, where's my 21st-century Psion?
- Google offers up its own Googlers in cloud channel chumship trawl
- Interview Global Warming IS REAL, argues sceptic mathematician - it just isn't THERMAGEDDON
- Apple to grieving sons: NO, you cannot have access to your dead mum's iPad