back to article Sysadmin’s plan to manage system config changes backfires spectacularly

Welcome once more to Who, Me?, the column for Reg readers to get their worst deeds off their chest. This week, "Ryan" tells us about a time many years ago when he got a little bit cocky with root-level commands. At the time, he was the senior systems and network administrator for a major research lab. "I administered a …

Page:

    1. Loyal Commenter Silver badge

      Re: Why use a revision control system?

      If there's something broken, one would restore from tape (or at least, restore the offending file from a tape).

      Hahaha. No.

      The reason why this is wrong is nicely illustrated by the following hypothetical situation:

      You get in at 8am and find there is some urgent configuration work to do and your cient needs it all working by the end of the day. The changes aren't simple, and after making and testing several revisions, you're finally ready to go at 4:30 pm. You're just about to run your scripts, and you discover that you've accidentally deleted the folder they are in because windows explorer had the focus when you thought you were hitting the Delete key in a Word document (because you are documenting everything, and you are working over a laggy connection to a VM in another office). Do you:

      a) Restore from a tape backup and repeat 8.5 hours work. This will take 24 hours to retrieve the backup tape from the secure off-site storage, followed by 3 hours to verify, find and restore the file in question. Or it will do, once you have got management authorisation to make a request to have the tape retrieved. Lets hope the backup compelted succesfully, eh?

      b) Retrieve the last good version of your script from version control and reapply the last 0.5 hours of work.

      Tape backups and version control systems are different tools, for different jobs, and both have their places. I wouldn't use a git repository for database backups, and I wouldn't use tape for version control.

  1. Terje

    I can thing of several reasons to use revision control for config files.

    It makes it very easy to set up a new unit in a specific environment while keeping tab on what changes are made (Say all computers in lab B where you add a new computer have the same config but those in lab A are different), you check out the correct branch (Lab B) and get all the correct configs for it.

    If down the line you find some issue that is not immediately apparent you can easily see what have changed in the config since it last worked no matter how long ago the change was made.

  2. Saruman the White Silver badge

    Other screw-ups

    Many, many years ago when I was young and working in my first job, I was given responsibilty for managing a small network of Sun workstations (one of which was a diskless node). One day I decided I need to clean up /tmp since it was close to being full and causing problems, so I logged on as "root" and entered the command "rm -rf / tmp". Note the significant space!

    Control-C followed after about 2 seconds, but there was not enough of the system left to be usable (although I was able to dump the data directries to tape prior to a full re-install).

    1. Waseem Alkurdi
      Thumb Up

      Re: Other screw-ups

      Safe aliases for 'rm' are a good thing to prevent this!

      1. Chairman of the Bored

        Re: Other screw-ups

        @Waseem, aye! Excellent point. Another trick to to have some zero length files in all directories you care about called '-i'

        If someone is running a rampaging rm -rf, having defeated the safe alias because of {reasons}, this may force rm back into interactive mode.

      2. CAPS LOCK

        "Safe aliases for 'rm' are a good thing to prevent this!"

        Testify! After, shall I say, bitter experience, I learned the wisdom of creating a 'del' command with 'mv'. I haven't created an alternate version of 'rm' just in case I need it to function as standard.

      3. Nick Kew

        Re: Other screw-ups

        Safe aliases for 'rm' are a good thing to prevent this!

        Aliases for standard system commands are pure evil. They bugger up expectations, both for those who know the standard commands and may react unpredictably to unexpected behaviour, and for those who come new to the aliases and are then surprised by the real thing.

        If you want an "rm" you consider safe, use something else for the alias. "del", for instance.

    2. Danny 2

      Re: Other screw-ups

      One of my most embarrassing mistakes was working at a Cisco-kid software testing on Solaris.

      One young tester was distractingly chatting away while typing about how some idiot at his university had rm -rf'ed and ruined his project. And as he was telling the anecdote the young tester rm -rf'ed his own work, and admitted it.

      I was trying to tune out his monologue but thought, "What a bloody idiot". And then I rm -rf'ed my own system.

      Warnings are like ear-worms, they sink into your subconscious. When you are not paying full attention and you hear the last thing you want to do, it becomes the thing you do next.

  3. Rich 11
    Joke

    When was the last time your best-laid plans went very awry?

    Gallipoli. I really should have made sure the landing craft were loaded into the cargo holds last.

  4. MacroRodent

    SCCS hits you

    The version control system must have been SCCS, which was for years the standard tool for this on Unix. It has this weird default of removing the edited copy of the file when you check in the changes. There is an option to immediately check out the read-only copy, but it is not the default behaviour.

    1. ibmalone

      Re: SCCS hits you

      Thanks, I was puzzling over why not checking out a read-only version by itself could have caused this.

      1. Peter Gathercole Silver badge

        Re: SCCS hits you

        The problem (or maybe it's a strength) with SCCS is that you have embedded tags that are expanded, normally with dates, versions etc. as the file is checked out readonly. With SCCS, they are surronded by % or some such. (RCS does use similar but incompatible tags, I'm not sure about other systems).

        The problem is that in some cases, these tags can mean something to other tools, and may also expect to use % as a special character, in which case deploying an un0checked in copy may cause undesirable effects.

        Of course, one solution to this is to use it with "make", which would allow you to perform additional processing around the versioning system. I'm not sure I remember how I did it, but I'm pretty certain when I used make and SCCS in anger, I had a method where I could spot that it was not checked in. Make is slightly aware of SCCS.

        But of course, you can't meaningfully compare SCCS with modern tools. I'm sure it wasn't the first versioning system around, but it must have been one of the earliest, dating back to the early 1970's. It was not meant to work with vast software development projects with many people working on them, but for it's time, it did a pretty good job (Bell Labs. used it to develop UNIX).

        Each iteration of version control since, like CVS, RCS, arch, Subversion, Git et. al. has expanded on the functionality, meaning that as the grandaddy of them all, SCCS cannot come out favorably in any comparison.

        But I still use it on occasion, as it is normally installed on AIX, even when nothing else is.

        1. MacroRodent

          Re: SCCS hits you

          Tag expansion also happens in RCS, CVS and Subversion (in the latter it has to be enabled in the properties of the file). The difference is that the tag trigger notation in these ($id: ,,,,$ and some others) stays in the file, in SCCS the magic strings expand to version numbers without the triggering character sequence.

          Git lost this feature, because it is seriously contrary to its idea of identifying versions with a hash of the file contents. Expanding a version tag would make the file be of a different version in the eyes of Git. A loss, because the embedded file version numbers have often saved my sanity by allowing a compiled program identify what file versions it has been put together from.

    2. cdegroot

      Nothing new...

      My thought as well. I've used CVS for the same in the '90s, and it worked quite well - I hated the guts of SCCS and always tried to stick with RCS instead which didn't have the anal locking that SCCS sported.

      I've never gotten around unleashing git on /etc/ though (although my "dotfiles" are there and it's very nice). There's enough stuff in there to make it maybe worth a try, although these days Chef/Puppet/Ansible/Salt/... are probably more appropriate.

  5. Chairman of the Bored

    Ok, we need some beer over here!

    Two pints:

    One for the OP to have the courage to admit the mistake, and the second for his management to have the wisdom to chalk this up to a learning experience

    Cheers!

  6. ibmalone

    I'm missing something...

    Why was a writeable fstab so fatal? Having it non-root writeable isn't good, but I wouldn't expect a writeable fstab it to get wiped on boot on a modern system (every Linux I've I've seen has had it 644). Something different about Sun?

    1. Anonymous Coward
      Anonymous Coward

      Re: I'm missing something...

      I don't know the system but I presumed once checked in the file was locked or removed until it was checked out RO. Probably just a quirk of the vcs.

    2. OldCrow

      Re: I'm missing something...

      One of those older version-control systems that imitated a physical pile of cards. A check-in removes the file from your disk.

      I'm sure it had SOME kind of logical reason for doing that beyond trying to imitate carbon-copy shifting, but I wouldn't know what the reason is.

    3. Doctor Syntax Silver badge

      Re: I'm missing something...

      "Why was a writeable fstab so fatal?"

      I think what you've missed was that the revision control removed the file when checking in. That's why it had to be checked out again.

      Checking out read only would be a side issue. It would mean that the revision control system wouldn't have the version locked and it would also mean that the running version couldn't get edited to a state inconsistent with the version the revision control system had marked current.

      1. ibmalone

        Re: I'm missing something...

        I think what you've missed was that the revision control removed the file when checking in. That's why it had to be checked out again.

        Thanks, yes, looks like another commenter has fingered SCCS as the culprit. Never met a VCS that does that, but I'm sure it made sense to somebody at the time o.0

        Knowing that makes the whole thing seem a lot more rickety. I suppose I might have taken to copying the file and checking in the copy instead, but there's only one way to learn that kind of paranoia...

        1. Doctor Syntax Silver badge

          Re: I'm missing something...

          " I suppose I might have taken to copying the file and checking in the copy instead"

          I might have written a script that did the check-in/check-out as a single command. That's assuming there wasn't an option - as per the comment on SCCS - in which case just get used to that as the normal way to do things.

          1. ibmalone
            Joke

            Re: I'm missing something...

            That's assuming there wasn't an option - as per the comment on SCCS - in which case just get used to that as the normal way to do things.

            Steady on there!

        2. Anonymous Coward
          Headmaster

          Re: I'm missing something...

          I suppose I might have taken to copying the file and checking in the copy instead

          And in fact if you aren't doing that you are probably taking risks which you should not be taking, unless your VCS is very, very carefully written. For quite a significant number of files in /etc it is absolutely essential that a sane copy of the file exists all the time, and making sure that this is true is quite fiddly. As an example you need to deal with the filesystem filling as you save the file: if that happens you mustleave the original in place.

          The trick to doing this right is typically: copy the file to a different name in the same directory ensuring all the permissions are right; modify this file to be correct; copy the original file again to a backup (alternative: make a hard link to it), then rename the new file to the original. This is safe because renames are atomic: they either happen or they don't, and you are not allocating space in the filesystem at the point of the rename, and nor are you increasing the number of inodes in use.

          (Someone is now going to point out I have got some part of this wrong, which I may have: the point is that it's not safe just to overwrite the file because you can end up with a partial copy.)

  7. Chairman of the Bored

    My worst config error?

    Been so many, but I think the worst one in terms of financial impact was dd'ing a hard drive image over a live, mission critical volume. An encrypted volume at that.

    This was my firm so I couldn't very well fire myself. Backups worked (*), but we were out many man-hours of work.

    But I was a late on a deliverable and had to tell the customer it was because I had personally screwed up.

    Causative factors: impatience, overconfidence, lacking a questioning attitude. Performing a rather aggressive admin action on a production system. dd is a fairly blunt instrument, could have chosen a better tool.

    Things that went well: Having a comprehensive, tested backup. Honesty with customer and staff paid off in the long run.

    (*) Wish I had made a binary image of the boot sector and anti-forensic stripes of the encrypted volume key store though, might have been able to save some information

    1. Anonymous Coward
      Anonymous Coward

      Re: My worst config error?

      IMHO, the OS should protect you from that by refusing to permit writes to a device that's mounted. There's no possible scenario where this could be useful.

    2. Anonymous Coward
      Linux

      Re: My worst config error?

      @Chairman of the Bored ".. dd'ing a hard drive image over a live, mission critical volume .."

      Yea, if you had stuck with the industry standard Windows, this kind of thing would never happen.

  8. Anonymous Coward
    Anonymous Coward

    Set the clock failed.

    To this day, I work in a PC where the clock is 15 minutes ahead. And there is a command prompt on login saying something on the lines of "user has no permission" to set the machine clock. A windows command prompt with NET TIME done with user's permissions that NEVER worked for anybody.

    1. Anonymous Coward
      Anonymous Coward

      Re: Set the clock failed.

      "And there is a command prompt on login saying something on the lines of "user has no permission" to set the machine clock."

      Sounds like your computer is part of an AD. The workstations get their time sync'd from the AD servers so they must have the wrong time.

      1. Danny 14

        Re: Set the clock failed.

        gpo for ntp too. you can set the ntp server to be outside your own dc, that gets fun when the two are out of sync.

        1. Trixr

          Re: Set the clock failed.

          Which is why your domain should be synced to a RELIABLE time source. And so too with any non-domain clients.

          If they're in the same network, the upstream timesource should be the same for the domain time source (the PDC Emulator) and non-domain clients. It's not rocket science.

        2. Trixr

          Re: Set the clock failed.

          If you're in a domain, why on earth would you be setting a different NTP time source on your domain clients via GPO? I can't describe how poor a practice that would be.

          (The only excuse would be if you're not using Windows NTP client at all and you're using another NTP client with better precision. In which case you should sync your DCs from the same time source).

          1. Anonymous Coward
            Linux

            Re: Set the clock failed.

            @Trixr "If you're in a domain, why on earth would you be setting a different NTP time source on your domain clients via GPO?"

            Sometimes in reading Microsoft documentation, I get the feeling I'm reading from the secret scriptures of some obscure cult, that's cult with an ‘ L’ :]

  9. Anonymous Coward
    Anonymous Coward

    zfs snapshots

    if this was an oracle box, where are the snapshots?

    1. Stevie

      Re: zfs snapshots

      zfs? This is some new magic not available in solaris 9.

      8o)

  10. Stevie

    Bah!

    “Overconfident Sun SA”.

    Redundant phrasing, from my personal experience.

  11. Anonymous Coward
    Anonymous Coward

    Ah, SUN's pizza boxes

    In the very early years of the Net (think pre-URL) I was tasked with building and installing SUN based firewalls.

    Now, I personally *loved* the beautiful engineering that could be seen inside the pizza box design, but it had one b*stard of a gotcha involving a connected terminal. If you would switch it off before you disconnected, it would issue a STOP instruction to the system so it would basically be off as far as functionality is concerned.

    That's an *excellent* thing to forget when doing an install for which you have to drive 6 hours to get there, in the days when mobile phones were luxury items only given to directors which wanted to get into weightlifting. Oh, the joy of checking your email on return.

    I don't know who dreamt that up, but he must have been the one to originate the BOFH DNA.

    1. Down not across

      Re: Ah, SUN's pizza boxes

      Now, I personally *loved* the beautiful engineering that could be seen inside the pizza box design, but it had one b*stard of a gotcha involving a connected terminal. If you would switch it off before you disconnected, it would issue a STOP instruction to the system so it would basically be off as far as functionality is concerned.

      Close but no cigar. Some, not all, serial terminals effectively send BREAK when powered off. This is usually caused by combination of the RS-232 driver and power supply causing logic low that appears as BREAK. SunOBP goes into PROM monitor on break. You can recover by typing 'go' and system should resume.

      I don't know who dreamt that up, but he must have been the one to originate the BOFH DNA.

      Hate to disappoint, but it is due to bad(cheap) design/engineering terminals (and many terminal/console servers) and the way they have implemented RS-232. As an example Cisco 2511 would send break, whereas 26xx/36xx/28xx/38xx with NM or HWIC async cards IIRC don't. Likewise ISTR Cyclades mostly worked. Then there are some that send break when powered ON, just to be awkward.

      1. Anonymous Coward
        Anonymous Coward

        Re: Ah, SUN's pizza boxes

        Ah, nice to know at last the detail.

        Yes, telling a client to type "go" was the cure, but it still was rather annoying. Lesson learned, though, also because customers sometimes couldn't resist switching the screen on (I think we mainly had WYSE terminals hooked up). They didn't know that the off switch was a tad too thorough, so that would result in a support call where, naturally, nobody would admit to having taken a peek..

  12. Will Godfrey Silver badge
    Linux

    Don't test it

    Never try to test your own automation if it's marginally more than trivial. Get someone else to try it - with as little information as is reasonable. If it doesn't screw up you can be cautiously optimistic.

  13. redwine

    Wanna major problem everywhere very quickly?

    ... config management!

  14. Anonymous Coward
    Anonymous Coward

    We are safe in your hands?!

    My goodness, most of you lot are making it up as you go along! Call this a profession?! Bloody dangerous school boys, no wonder IT is a mess these days!

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like