Crashed and alone in a remote location: When paid help is no help

Monday 25th July 2016 10:17 GMT MJI

Interesting read

Real life stories like this always worth a nose

12 0 Reply
1. Monday 25th July 2016 13:50 GMT NotBob
  
  Re: Interesting read
  
  Indeed, but the title seems to have nothing to do with the content.
  
  3 0 Reply
Monday 25th July 2016 10:40 GMT John Hawkins

Wilds of Yorkshire

"The Third World" sketch in the Python film "Meaning of Life" springs to mind here for some reason...

4 0 Reply
1. Monday 25th July 2016 11:03 GMT John G Imrie
  
  Re: Wilds of Yorkshire
  
  Every sperm is sacred.
  
  5 0 Reply
  1. Monday 25th July 2016 11:53 GMT Antron Argaiv
    
    Re: Wilds of Yorkshire
    
    Lost me job down 't mill.
    
    It's medical experiments for the lot o' you...
    
    7 0 Reply
2. Monday 25th July 2016 14:03 GMT gypsythief
  
  Re: Wilds of Yorkshire
  
  That was actually filmed only a couple of miles down the road from where this story was based.
  
  I've got to say though that Twistleton Scar is distinctly more wild than the rolling drumlins around the location in question.
  
  Still, nice to have a story from my backyard!
  
  2 0 Reply
Monday 25th July 2016 10:50 GMT itzman

24 hour service response...

Yup Cisco tried to sell this to my customer. 'Anywhere in the UK, sir'.

The customer was sceptical 'and how doe's that work in a gale when Guernsey airport is closed and the ferry cant dock, for three days due to high wind and seas?'

Instead, they carried a complete set of spares, themselves.

13 0 Reply
1. Monday 25th July 2016 11:46 GMT Uncle Slacky
  
  Re: 24 hour service response...
  
  Easy - Guernsey's not in the UK...
  
  15 1 Reply
Monday 25th July 2016 10:59 GMT Steve Davies 3

The Server CPU Swap game

There we were working away in Central Asia when one of the Two servers died. We found that it was a CPU board failure. The nearest one was in Moscow (3 time zones away). After several long phone calls we dispatched one of the client team to the Airport. He caught a flight to Moscow where he was met by the Field Service Manager. A CPU Board swap took place in Sheremetevo Airport and the return flight was duly caught. The Local went because he didn't need a visa to enter Russia. Us westerners woul dhave needed one. The airfare at the time for locals was 1/4 of that of us rich westerners.

A little under 10 hours after the crash the System was up and running again.

The IT director took us out to dinner for fixing the system in the way we did.

This was in the Mid 1990's. Those were the days.

15 0 Reply
1. Monday 25th July 2016 12:48 GMT Rich 11
  
  Re: The Server CPU Swap game
  
  Those were the days.
  
  Those were also the days when one or another Babyflot which had skimped on its maintenance schedule would have a bird drop out of the sky every month. Did your local ask for danger pay?
  
  3 0 Reply
Monday 25th July 2016 11:51 GMT PickledAardvark

Quality service from DEC

An employer in the 1980s had scheduled an upgrade to a VAX. The engineer was supposed to be on site for a few hours during which the system would be unavailable -- timed to cause minimum impact to customers in Europe and Japan.

Everything was going well until the engineer stood on the last board to be fitted. Ouch. DEC found a replacement in Manchester and it arrived three hours after the accident. That's a pretty good time for handling a distress call after normal working hours, finding the board in a warehouse and driving it to the Midlands. The VAX was back in service later than scheduled but no customer complained.

It was an early learning experience:

* Competent plans go wrong in unexpected ways;

* Great suppliers are the ones who do a good job when things go wrong, not organisations that (unrealistically) never make mistakes;

* Set honest expectations -- with staff and customers -- and be frank about errors and problems.

20 0 Reply
1. Monday 25th July 2016 11:59 GMT Dabooka
  
  Re: Quality service from DEC
  
  *Don't leave kit lying around on the floor
  
  I still haven't come to terms with that last one depsite the 'learning opportunuties'
  
  10 0 Reply
  1. Monday 25th July 2016 13:29 GMT PickledAardvark
    
    Re: Quality service from DEC
    
    Not a lot of room between the back wall of the server room and the server rack. You are right though -- don't put kit on the floor unless there is nowhere else to put it. It's human to trip over.
    
    My argument was about how the organisation providing a service responded to a foul up.
    
    5 0 Reply
2. Monday 25th July 2016 12:38 GMT Anonymous Coward
  
  Re: Quality service from DEC
  
  DEC found a replacement in Manchester and it arrived three hours after the accident. That's a pretty good time for handling a distress call after normal working hours, finding the board in a warehouse and driving it to the Midlands.
  
  DEC were good at that, sadly it's probably the cost of that sort of service that killed them.
  
  Our office in Belfast was bombed late one Friday afternoon (we had the misfortune to share the building with a tax office). No-one was hurt, everyone evacuated in time, and the servers all came back up OK on Saturday once we had the all-clear, but the offices were uninhabitable (smashed windows, ceilings down) and most of the terminals on peoples' desks were wrecked.
  
  Our boss put the DR plan into effect on Friday evening, and phoned DEC. Saturday lunchtime, while our own guys were cabling up spare space in the building next door, DEC arrived with a vanload of new terminals and other kit, driven up from Dublin. Local DEC guys helped get them set up, and by 9am Monday morning everyone had a desk and working terminal.
  
  I'm not sure it would happen that smoothly these days...
  
  10 0 Reply
Monday 25th July 2016 14:07 GMT HmmmYes

Well, it does reiterate what a redundant system is:

One box here, the other box over there. Hopefully on a different network + power supply.

In another building would be good.

Its not a good idea to have a 'redundant' server that can wiped out by a single cup of coffee.

7 0 Reply
Monday 25th July 2016 15:11 GMT captain_solo

First off, it's always almost faster to listen to your user base and their ability to detect service failures than to rely on autonomic monitoring systems. Even remote telemetry solutions like Sun/Oracle, etc have are generally slower than the user picking up the phone to cut a ticket. Still helps with failures that don't cause an outage that you might not notice until you scrubbed a log, but, when the server completely craps out you will likely know before your monitoring tools do.

Remote locations often consider an onsite parts agreement so that critical components are in the DC already for the engineer or the customer themselves to use to restore service without waiting for delivery of parts that could be delayed due to weather, traffic, or because the one part you need is out of stock at your local stocking location.

Most of the time its probably cheaper (long term TCO) to have N+1 redundancy than to rely solely on Premium support SLA to keep you in business. Depending on the costs of an outage you might be able to get by with business hours support on gear that you can afford to lose availabilty on for a few hours. Clustering, load-balancing, now "serverless" application designs or VM/container mobility strategies can buy you time to diagnose and restore individual nodes without having to make the panic call to the vendor at 0-dark-thirty. Of course back in the day of this story there were less options on that front and the redundant gear tended to be a little pricey to be left idle.

Cool Story Bro

2 0 Reply
Monday 25th July 2016 16:20 GMT Tom 7

Tisleton Scars - not very remote

I could be in the Hill Inn in twenty minutes from there.

0 0 Reply
Monday 25th July 2016 16:48 GMT Destroy All Monsters

That photo looks like something from "Dear Esther", with a better engine

Effing rooohmanthic!!

0 0 Reply
Monday 25th July 2016 16:49 GMT Matt Bryant

Failed CPU crashing server, not uncommon.

IIRC, this was an issue for the different UNIX flavours of the period, they could swap a failed CPU out as long as that wasn't the monarch CPU running some of the kernel strings. TBH, it was a great way to scare manglement and get budget for a second system and clustering software, to point out that in a 4-way server a CPU failure was 25% likely to be the monarch, a crash and a total loss of service. "25%" sounded scary, I just used to omit the small likelyhood of a CPU failure into the maths.

As for no "SSDs" - ahem - yes, there were solid state devices available. In 2001 I was using Texas Memory Systems' Ramsan solid state boxes to boost Oracle databases.

2 0 Reply
1. Tuesday 26th July 2016 19:02 GMT Marshalltown
  
  Re: Failed CPU crashing server, not uncommon.
  
  Mmmm, unless the system somehow rolled random numbers during boot up, the odds are that the very same CPU was boss after every boot up simply because of the physical layout of the system. That would mean that one CPU would likely see greater wear and tear so-to-speak than all the others. So those 25% odds were probably weighted toward the house more than you might expect.
  
  1 0 Reply
Monday 25th July 2016 17:03 GMT Anonymous Coward

Hmmm

Then one evening at about 6:45 I was having dinner

I do think sysops should really get into the habit of saying things like "18:45 ZULU". It lightens the load on ticket wrestlers and possibly lawyers...

3 0 Reply
1. Monday 25th July 2016 18:20 GMT Will Godfrey
  
  Re: Hmmm
  
  Who on earth wants to 'lighten the load' on lawyers!
  
  5 0 Reply
  1. Monday 25th July 2016 19:55 GMT Crazy Operations Guy
    
    Re: Hmmm
    
    "Who on earth wants to 'lighten the load' on lawyers!"
    
    I worked for a law firm: a smaller load on the lawyers means that they in the office less; being in the office less means that they don't have quite as long to break their systems in new and exciting ways.
    
    5 0 Reply
Monday 25th July 2016 19:16 GMT OzBob

Interesting from the point of view of support

Always found support in the midlands to be 3-4 hours for one vendor, fascinating to hear that Yorkshire has much better response (but I guess they were paying for it).

1 0 Reply
Monday 25th July 2016 19:30 GMT Stoneshop

SSD wasn't even heard of back then

Well, NAND flash SSD, maybe.

Basically, core memory is SSD too. And in the 1990's several manufacturers had a couple of solid state drives in their program DEC had one, physically the size of a HSC50 (can't recall the model number; ESE50?) which was essentially a backplane filled with 150MB worth of DRAM boards and a SDI interface, plus a MVAX board with an RD54 hooked up and an UPS. If the power went out, the UPS was to keep the lot running while the memory contents were transferred to disk. Later they had a drive with a 3.5" form factor, SCSI interface, static RAM and a rechargeable battery. Couple hundred MB, IIRC. No idea of the list price of either, but definitely well over that of their size in spinning rust.

2 0 Reply
Tuesday 26th July 2016 19:15 GMT Marshalltown

Service? What is this - service?

I have only worked for one business where the owner was willing to pay for a service contract. For the rest, we made up a song, "The Electron' Swap" to cover how "service" was done:

"I entered the office late one night,

The hardware systems were a ghastly sight,

Our two 'hardware specialists' had their screwdrivers out,

there were pieces of gear all strewn about.

They did the swap, the electron' swap..."

The nearest Frys was over an hour away.

1 0 Reply