back to article Wait... who broke that? Things you need to do to make your world diagnosable

You only ever discover the inadequacy of your system management, monitoring and diagnosis tools when something goes wrong and there's a gulf between what you want to do and what you need to do. Here are 10 things you can do to maximise your chances of diagnosing the problem when the brown stuff hits the ventilator. Ladies and …

  1. Duncan Macdonald
    Flame

    Nice when you have the resouces

    In all too many outfits, the person on call is the ONLY system administrator - and he (or she) has other things to do as well.

    Small companies (under 200 people) are lucky if they have more than 2 IT professionals - often there is only 1 with a part time assistant to cover holidays.

    Even in larger companies (that should know better) the jobs of system/database/network administrator are all too often regarded as an overhead to be reduced as far as possible. The result is a IT department and systems that should be like a warship (able to take a lot of problems (human or technical) and keep going) is more like a Panamanian freighter (only able to cope if nothing goes wrong). Some beancounters try to move support to an offshore team (e.g. in India) and wonder why they have IT problems down the line.

    1. RIBrsiq
      Pint

      Re: Nice when you have the resouces

      On the plus side, being a one person show means that one can legitimately claim to be the first, last and only... everything, really, on the technical side of things.

      1. Anonymous Coward
        Anonymous Coward

        Re: Nice when you have the resouces

        to be the first, last and only

        Wasn't it "first, last and everything" or was that Barry White?

        1. Roq D. Kasba

          Re: Nice when you have the resouces

          Smaller businesses have to prioritise. If a server down is business-critical and requires 24h scrambling and pulling people back from holiday, etc., then it needs to prioritise and spend the level of money that kind of service level costs. If that means a guy on retainer to cover holidays and weekends, that's the cost.

          You can't have it both ways, five nines for two nines prices, you need to be frank and real. It's no different from any other insurance policy, higher premiums for better cover.

          1. yoganmahew

            Re: Nice when you have the resouces

            @Roq

            "Smaller businesses have to prioritise."

            It's not any different for larger business. Cost is the reason for moving from legacy systems, many of which already embody these features. Hence the reason the legacy systems are expensive...

          2. JerseyDaveC

            Re: Nice when you have the resouces

            Yup. And of course the beancounters opt for the 8x5 support option because it's way cheaper than the 24x7 one ... right up until something important dies on the Easter weekend and they expect someone to respond.

            1. Anonymous Coward
              Anonymous Coward

              Re: Nice when you have the resouces

              > Yup. And of course the beancounters opt for the 8x5 support option because it's way cheaper than the 24x7 one

              Yeah, we have customers like that - and they are quite happy to be abusive to the support guy who's on call when the big cheese can't do something that's important to him (not the rest of the company) at 6am UK time.

              But several of those items are things I've been preaching about for a decade - and being ignored. Like it's sooooo hard to get the servers on the right time, well actually for us it does seem like it's too hard.

              Hmm, better post that as anon !

        2. Doctor_Wibble
          Trollface

          Re: Nice when you have the resouces

          >> to be the first, last and only

          > Wasn't it "first, last and everything" or was that Barry White?

          No I think it's just a mis-quote of Chesney Hawkes.

          .

          Good 'reminder of what should be' article, though I echo remarks about the realities 'as enjoyed in person'...

      2. Alistair
        Windows

        Re: Nice when you have the resouces

        First, last, and everything, till the bus comes round the corner.

        If you in that spot, your documentation better be impeccable.

        Why, hello Mr. Murphy.

  2. Steve Davies 3 Silver badge

    The whole article could have been replaced by:-

    What can go wrong, will go wrong.

    That leads to

    1) So be prepared for IT, whatever IT is.

    2) you can't cover all the possible bases so make sure that when that shit happens there is someone else to carry that particular can.

  3. jake Silver badge

    What else is new.

    IBM preached this in the 1950s

    MaBell preached this in the 1960s

    DEC preached this in the 1970s

    ... etc. etc.etc. etc.

    The more we forget history ...

  4. John Smith 19 Gold badge
    Go

    different times on their internal clocks, whatever logs..will be almost impossible to collate

    One of those "Oh a couple of minutes between devices, what's the harm?" points.

    I thought it was quite easy to set up a time server so all devices sync their clocks to it.

    Apparently not.

    His last point (lessons learned) is often ignored but it's the way you avoid doing this all over again in x weeks/months//years time.

    1. Destroy All Monsters Silver badge

      Re: different times on their internal clocks, whatever logs..will be almost impossible to collate

      Time SNAFU was even mentioned in "The Cuckoo's Egg" if I remember well.

      1. JerseyDaveC

        Re: different times on their internal clocks, whatever logs..will be almost impossible to collate

        it certainly gets a mention in Shimomura's "Takedown" - that's where I learned it all those years ago and it's a lesson that has been useful many times.

        1. Destroy All Monsters Silver badge

          Re: different times on their internal clocks, whatever logs..will be almost impossible to collate

          But Shimo's "Takedown" is basically a hagiography his cool self, written by that NYT ICT hack Markoff. Don't buy, possible leaf through it when you find it at the disposal centre.

          Better read Jonathan Littman's "The Fugitive Game", even if Kevin M. says it ain't what really happened. At least it reads like journalism.

        2. jake Silver badge

          Re: different times on their internal clocks, whatever logs..will be almost impossible to collate

          Professionals sync their entire network to one of several atomic clocks.

          (Example, ntp.org ... I use something different.)

          My network time-keeper checks in on "ntp.org" once a week (and is accurate to under a quarter of a second every six months). The rest of the kit take clock from that box daily. It's close enough for government work, so it's close enough for me.

          Claiming "we can't match up the various machine logs to exact times" is a cop-out from an extremely clueless netadmin staff.

  5. circusmole
    Alert

    What about...

    ...strict control of administrator rights and a proper, enforced change management process?

    1. storner
      Meh

      Re: What about...

      What about it? In the words of Mahatma Gandhi (when asked about western civilzation): "I think It would be a good idea".

      He wasn't impressed. Me neither.

      Change management usually means that updates are slow to trickle out. Just ain't gonna fly these days with developer teams rolling out hourly software updates and managers screaming to get the hottest new whizbang thing on the production systems. So it gets overridden by some PHB ("it's just a small UI change!") and things break.

      The article is about cleaning up when things go bad, not preventing them (save for the post-mortem analysis). For that it is a pretty nice list, although I think most experienced sysadmins could write it in about 10 minutes.

  6. Doctor Syntax Silver badge

    GMT?

    UTC!

    1. Anonymous Coward
      Anonymous Coward

      Re: GMT?

      Later!

  7. Destroy All Monsters Silver badge

    Unleash Chaos Monkey!

    DO it!

    (Not that I have dared to do this yet, I'm chicken this way)

  8. Anonymous Coward
    Anonymous Coward

    Oh yeah...

    Secondary comms infrastructure: a little reserve e-mail server, a well-maintained list of mobile phones, fixed phones and pagers, maybe even the walky-talkys...

    1. Duncan Macdonald
      Facepalm

      Re: Oh yeah...

      But on a budget that barely pays for 2 tin cans and a bit of string to go between them !!!

      How many of you have taken to doing careful salvage of no longer used equipment to get the spares to keep production systems running ?

      For all to many firms - disaster recovery is something they only think about once the disaster has occurred.

      (Really small firms (1-5 people) may not even know what a backup is ! I have had to recover data from a corrupted hard disk more than once because the people concerned did not have a backup !)

  9. Anonymous Coward
    Anonymous Coward

    It is very important to check your procedures in the event of a crisis. After a power loss at a previous employer (which took out the whole campus) it was discovered that the ATS to move the DC power supply from the UPS to generator didn't work. I think it was only 1.5 to 2 years old at the time. This lead to a complete loss of our main DC which was also used as a DR site for 2 other institutions. Oops.

    During the same incident, while trying to restore services, the paper copy of the passwords locked in a safe was required, this was kept in a secure location. So secure that the door had to be brought down by a member of estates with a sledge hammer as the door had failed shut and required the mag lock to be powered to open. This eventually didn't matter as it was discovered that the person who printed the passwords only checked the first few pages of the document after printing and didn't realise that the printer had run out of ink half way through leading to several pages missing.

    That was a fun day.

    1. Disk0
      Boffin

      ...required the mag lock to be powered to open

      ...would be a beautiful example of failing design logic if it wasn't such a horrific blunder...

      1. Lord Elpuss Silver badge

        Re: ...required the mag lock to be powered to open

        I was thinking that a safe that could be opened by a sledgehammer, wasn't actually that safe at all.

  10. Alistair
    Windows

    major failures

    One *has* to do postmortem on anything that pulls down a system for a serious SLA violation. In my experience 85% of the majors I've seen can be attributed to human error (i.e. keyboard stupidity), roughly 8% can be attributed to the likes of a teenage house party (Holy s&&t we didn't expect that many of those), which is in itself human error. The remainder can be attributed to the comings and goings of Murphy - hardware failure cascades that were just so far off the map that there was no reasonable expectation that the series of events could ever occur.

    Planning things ahead of time, I try to find time to walk through systems with someone non technical enough to ask what would to a techie be really stupid questions. Because we (as techies) tend not to ask those questions. And sometimes, if you stop and think about the question it might *just* be a good thing to cover.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like