Sysadmin’s plan to manage system config changes backfires spectacularly • The Register Forums

Monday 3rd December 2018 17:29 GMT Loyal Commenter

Re: Why use a revision control system?

If there's something broken, one would restore from tape (or at least, restore the offending file from a tape).

Hahaha. No.

The reason why this is wrong is nicely illustrated by the following hypothetical situation:

You get in at 8am and find there is some urgent configuration work to do and your cient needs it all working by the end of the day. The changes aren't simple, and after making and testing several revisions, you're finally ready to go at 4:30 pm. You're just about to run your scripts, and you discover that you've accidentally deleted the folder they are in because windows explorer had the focus when you thought you were hitting the Delete key in a Word document (because you are documenting everything, and you are working over a laggy connection to a VM in another office). Do you:

a) Restore from a tape backup and repeat 8.5 hours work. This will take 24 hours to retrieve the backup tape from the secure off-site storage, followed by 3 hours to verify, find and restore the file in question. Or it will do, once you have got management authorisation to make a request to have the tape retrieved. Lets hope the backup compelted succesfully, eh?

b) Retrieve the last good version of your script from version control and reapply the last 0.5 hours of work.

Tape backups and version control systems are different tools, for different jobs, and both have their places. I wouldn't use a git repository for database backups, and I wouldn't use tape for version control.

11 0 Reply

Monday 3rd December 2018 09:57 GMT Terje

I can thing of several reasons to use revision control for config files.

It makes it very easy to set up a new unit in a specific environment while keeping tab on what changes are made (Say all computers in lab B where you add a new computer have the same config but those in lab A are different), you check out the correct branch (Lab B) and get all the correct configs for it.

If down the line you find some issue that is not immediately apparent you can easily see what have changed in the config since it last worked no matter how long ago the change was made.

15 0 Reply

Monday 3rd December 2018 10:20 GMT Saruman the White

Other screw-ups

Many, many years ago when I was young and working in my first job, I was given responsibilty for managing a small network of Sun workstations (one of which was a diskless node). One day I decided I need to clean up /tmp since it was close to being full and causing problems, so I logged on as "root" and entered the command "rm -rf / tmp". Note the significant space!

Control-C followed after about 2 seconds, but there was not enough of the system left to be usable (although I was able to dump the data directries to tape prior to a full re-install).

29 0 Reply

Monday 3rd December 2018 10:30 GMT Waseem Alkurdi

Re: Other screw-ups

Safe aliases for 'rm' are a good thing to prevent this!

15 3 Reply
1. Monday 3rd December 2018 11:12 GMT Chairman of the Bored
  
  Re: Other screw-ups
  
  @Waseem, aye! Excellent point. Another trick to to have some zero length files in all directories you care about called '-i'
  
  If someone is running a rampaging rm -rf, having defeated the safe alias because of {reasons}, this may force rm back into interactive mode.
  
  12 2 Reply
2. Monday 3rd December 2018 19:05 GMT CAPS LOCK
  
  "Safe aliases for 'rm' are a good thing to prevent this!"
  
  Testify! After, shall I say, bitter experience, I learned the wisdom of creating a 'del' command with 'mv'. I haven't created an alternate version of 'rm' just in case I need it to function as standard.
  
  8 0 Reply
3. Monday 3rd December 2018 22:14 GMT Nick Kew
  
  Re: Other screw-ups
  
  Safe aliases for 'rm' are a good thing to prevent this!
  
  Aliases for standard system commands are pure evil. They bugger up expectations, both for those who know the standard commands and may react unpredictably to unexpected behaviour, and for those who come new to the aliases and are then surprised by the real thing.
  
  If you want an "rm" you consider safe, use something else for the alias. "del", for instance.
  
  7 0 Reply
Tuesday 4th December 2018 10:55 GMT Danny 2

Re: Other screw-ups

One of my most embarrassing mistakes was working at a Cisco-kid software testing on Solaris.

One young tester was distractingly chatting away while typing about how some idiot at his university had rm -rf'ed and ruined his project. And as he was telling the anecdote the young tester rm -rf'ed his own work, and admitted it.

I was trying to tune out his monologue but thought, "What a bloody idiot". And then I rm -rf'ed my own system.

Warnings are like ear-worms, they sink into your subconscious. When you are not paying full attention and you hear the last thing you want to do, it becomes the thing you do next.

1 0 Reply

Monday 3rd December 2018 10:32 GMT Rich 11

When was the last time your best-laid plans went very awry?

Gallipoli. I really should have made sure the landing craft were loaded into the cargo holds last.

17 0 Reply

Monday 3rd December 2018 10:51 GMT MacroRodent

SCCS hits you

The version control system must have been SCCS, which was for years the standard tool for this on Unix. It has this weird default of removing the edited copy of the file when you check in the changes. There is an option to immediately check out the read-only copy, but it is not the default behaviour.

22 0 Reply

Monday 3rd December 2018 11:35 GMT ibmalone

Re: SCCS hits you

Thanks, I was puzzling over why not checking out a read-only version by itself could have caused this.

13 0 Reply
1. Tuesday 4th December 2018 09:00 GMT Peter Gathercole
  
  Re: SCCS hits you
  
  The problem (or maybe it's a strength) with SCCS is that you have embedded tags that are expanded, normally with dates, versions etc. as the file is checked out readonly. With SCCS, they are surronded by % or some such. (RCS does use similar but incompatible tags, I'm not sure about other systems).
  
  The problem is that in some cases, these tags can mean something to other tools, and may also expect to use % as a special character, in which case deploying an un0checked in copy may cause undesirable effects.
  
  Of course, one solution to this is to use it with "make", which would allow you to perform additional processing around the versioning system. I'm not sure I remember how I did it, but I'm pretty certain when I used make and SCCS in anger, I had a method where I could spot that it was not checked in. Make is slightly aware of SCCS.
  
  But of course, you can't meaningfully compare SCCS with modern tools. I'm sure it wasn't the first versioning system around, but it must have been one of the earliest, dating back to the early 1970's. It was not meant to work with vast software development projects with many people working on them, but for it's time, it did a pretty good job (Bell Labs. used it to develop UNIX).
  
  Each iteration of version control since, like CVS, RCS, arch, Subversion, Git et. al. has expanded on the functionality, meaning that as the grandaddy of them all, SCCS cannot come out favorably in any comparison.
  
  But I still use it on occasion, as it is normally installed on AIX, even when nothing else is.
  
  1 0 Reply
  1. Tuesday 4th December 2018 10:34 GMT MacroRodent
    
    Re: SCCS hits you
    
    Tag expansion also happens in RCS, CVS and Subversion (in the latter it has to be enabled in the properties of the file). The difference is that the tag trigger notation in these ($id: ,,,,$ and some others) stays in the file, in SCCS the magic strings expand to version numbers without the triggering character sequence.
    
    Git lost this feature, because it is seriously contrary to its idea of identifying versions with a hash of the file contents. Expanding a version tag would make the file be of a different version in the eyes of Git. A loss, because the embedded file version numbers have often saved my sanity by allowing a compiled program identify what file versions it has been put together from.
    
    2 0 Reply
Monday 3rd December 2018 14:09 GMT cdegroot

Nothing new...

My thought as well. I've used CVS for the same in the '90s, and it worked quite well - I hated the guts of SCCS and always tried to stick with RCS instead which didn't have the anal locking that SCCS sported.

I've never gotten around unleashing git on /etc/ though (although my "dotfiles" are there and it's very nice). There's enough stuff in there to make it maybe worth a try, although these days Chef/Puppet/Ansible/Salt/... are probably more appropriate.

3 0 Reply

Monday 3rd December 2018 10:52 GMT Chairman of the Bored

Ok, we need some beer over here!

Two pints:

One for the OP to have the courage to admit the mistake, and the second for his management to have the wisdom to chalk this up to a learning experience

Cheers!

21 0 Reply

Monday 3rd December 2018 10:52 GMT ibmalone

I'm missing something...

Why was a writeable fstab so fatal? Having it non-root writeable isn't good, but I wouldn't expect a writeable fstab it to get wiped on boot on a modern system (every Linux I've I've seen has had it 644). Something different about Sun?

4 0 Reply

Monday 3rd December 2018 11:40 GMT Anonymous Coward

Re: I'm missing something...

I don't know the system but I presumed once checked in the file was locked or removed until it was checked out RO. Probably just a quirk of the vcs.

6 0 Reply
Monday 3rd December 2018 13:06 GMT OldCrow

Re: I'm missing something...

One of those older version-control systems that imitated a physical pile of cards. A check-in removes the file from your disk.

I'm sure it had SOME kind of logical reason for doing that beyond trying to imitate carbon-copy shifting, but I wouldn't know what the reason is.

6 0 Reply
Monday 3rd December 2018 13:08 GMT Doctor Syntax

Re: I'm missing something...

"Why was a writeable fstab so fatal?"

I think what you've missed was that the revision control removed the file when checking in. That's why it had to be checked out again.

Checking out read only would be a side issue. It would mean that the revision control system wouldn't have the version locked and it would also mean that the running version couldn't get edited to a state inconsistent with the version the revision control system had marked current.

4 0 Reply
1. Monday 3rd December 2018 13:58 GMT ibmalone
  
  Re: I'm missing something...
  
  I think what you've missed was that the revision control removed the file when checking in. That's why it had to be checked out again.
  
  Thanks, yes, looks like another commenter has fingered SCCS as the culprit. Never met a VCS that does that, but I'm sure it made sense to somebody at the time o.0
  
  Knowing that makes the whole thing seem a lot more rickety. I suppose I might have taken to copying the file and checking in the copy instead, but there's only one way to learn that kind of paranoia...
  
  3 0 Reply
  1. Tuesday 4th December 2018 01:28 GMT Doctor Syntax
    
    Re: I'm missing something...
    
    " I suppose I might have taken to copying the file and checking in the copy instead"
    
    I might have written a script that did the check-in/check-out as a single command. That's assuming there wasn't an option - as per the comment on SCCS - in which case just get used to that as the normal way to do things.
    
    2 0 Reply
    1. Tuesday 4th December 2018 08:35 GMT ibmalone
      
      Re: I'm missing something...
      
      That's assuming there wasn't an option - as per the comment on SCCS - in which case just get used to that as the normal way to do things.
      
      Steady on there!
      
      1 0 Reply
  2. Tuesday 4th December 2018 18:35 GMT Anonymous Coward
    
    Re: I'm missing something...
    
    I suppose I might have taken to copying the file and checking in the copy instead
    
    And in fact if you aren't doing that you are probably taking risks which you should not be taking, unless your VCS is very, very carefully written. For quite a significant number of files in /etc it is absolutely essential that a sane copy of the file exists all the time, and making sure that this is true is quite fiddly. As an example you need to deal with the filesystem filling as you save the file: if that happens you mustleave the original in place.
    
    The trick to doing this right is typically: copy the file to a different name in the same directory ensuring all the permissions are right; modify this file to be correct; copy the original file again to a backup (alternative: make a hard link to it), then rename the new file to the original. This is safe because renames are atomic: they either happen or they don't, and you are not allocating space in the filesystem at the point of the rename, and nor are you increasing the number of inodes in use.
    
    (Someone is now going to point out I have got some part of this wrong, which I may have: the point is that it's not safe just to overwrite the file because you can end up with a partial copy.)
    
    1 0 Reply

Monday 3rd December 2018 11:06 GMT Chairman of the Bored

My worst config error?

Been so many, but I think the worst one in terms of financial impact was dd'ing a hard drive image over a live, mission critical volume. An encrypted volume at that.

This was my firm so I couldn't very well fire myself. Backups worked (*), but we were out many man-hours of work.

But I was a late on a deliverable and had to tell the customer it was because I had personally screwed up.

Causative factors: impatience, overconfidence, lacking a questioning attitude. Performing a rather aggressive admin action on a production system. dd is a fairly blunt instrument, could have chosen a better tool.

Things that went well: Having a comprehensive, tested backup. Honesty with customer and staff paid off in the long run.

(*) Wish I had made a binary image of the boot sector and anti-forensic stripes of the encrypted volume key store though, might have been able to save some information

21 0 Reply

Monday 3rd December 2018 20:02 GMT Anonymous Coward

Re: My worst config error?

IMHO, the OS should protect you from that by refusing to permit writes to a device that's mounted. There's no possible scenario where this could be useful.

5 0 Reply
Tuesday 4th December 2018 02:30 GMT Anonymous Coward

Re: My worst config error?

@Chairman of the Bored ".. dd'ing a hard drive image over a live, mission critical volume .."

Yea, if you had stuck with the industry standard Windows, this kind of thing would never happen.

0 5 Reply

Monday 3rd December 2018 13:28 GMT Anonymous Coward

Set the clock failed.

To this day, I work in a PC where the clock is 15 minutes ahead. And there is a command prompt on login saying something on the lines of "user has no permission" to set the machine clock. A windows command prompt with NET TIME done with user's permissions that NEVER worked for anybody.

4 0 Reply

Monday 3rd December 2018 15:27 GMT Anonymous Coward

Re: Set the clock failed.

"And there is a command prompt on login saying something on the lines of "user has no permission" to set the machine clock."

Sounds like your computer is part of an AD. The workstations get their time sync'd from the AD servers so they must have the wrong time.

2 0 Reply
1. Monday 3rd December 2018 18:22 GMT Danny 14
  
  Re: Set the clock failed.
  
  gpo for ntp too. you can set the ntp server to be outside your own dc, that gets fun when the two are out of sync.
  
  1 0 Reply
  1. Tuesday 4th December 2018 02:28 GMT Trixr
    
    Re: Set the clock failed.
    
    Which is why your domain should be synced to a RELIABLE time source. And so too with any non-domain clients.
    
    If they're in the same network, the upstream timesource should be the same for the domain time source (the PDC Emulator) and non-domain clients. It's not rocket science.
    
    2 0 Reply
  2. Tuesday 4th December 2018 02:28 GMT Trixr
    
    Re: Set the clock failed.
    
    If you're in a domain, why on earth would you be setting a different NTP time source on your domain clients via GPO? I can't describe how poor a practice that would be.
    
    (The only excuse would be if you're not using Windows NTP client at all and you're using another NTP client with better precision. In which case you should sync your DCs from the same time source).
    
    2 0 Reply
    1. Tuesday 4th December 2018 02:45 GMT Anonymous Coward
      
      Re: Set the clock failed.
      
      @Trixr "If you're in a domain, why on earth would you be setting a different NTP time source on your domain clients via GPO?"
      
      Sometimes in reading Microsoft documentation, I get the feeling I'm reading from the secret scriptures of some obscure cult, that's cult with an ‘ L’ :]
      
      3 0 Reply

Monday 3rd December 2018 13:31 GMT Anonymous Coward

zfs snapshots

if this was an oracle box, where are the snapshots?

0 0 Reply

Monday 3rd December 2018 14:29 GMT Stevie

Re: zfs snapshots

zfs? This is some new magic not available in solaris 9.

8o)

6 0 Reply

Monday 3rd December 2018 14:32 GMT Stevie

Bah!

“Overconfident Sun SA”.

Redundant phrasing, from my personal experience.

2 0 Reply

Monday 3rd December 2018 15:56 GMT Anonymous Coward

Ah, SUN's pizza boxes

In the very early years of the Net (think pre-URL) I was tasked with building and installing SUN based firewalls.

Now, I personally *loved* the beautiful engineering that could be seen inside the pizza box design, but it had one b*stard of a gotcha involving a connected terminal. If you would switch it off before you disconnected, it would issue a STOP instruction to the system so it would basically be off as far as functionality is concerned.

That's an *excellent* thing to forget when doing an install for which you have to drive 6 hours to get there, in the days when mobile phones were luxury items only given to directors which wanted to get into weightlifting. Oh, the joy of checking your email on return.

I don't know who dreamt that up, but he must have been the one to originate the BOFH DNA.

4 0 Reply

Tuesday 4th December 2018 00:21 GMT Down not across

Re: Ah, SUN's pizza boxes

Now, I personally *loved* the beautiful engineering that could be seen inside the pizza box design, but it had one b*stard of a gotcha involving a connected terminal. If you would switch it off before you disconnected, it would issue a STOP instruction to the system so it would basically be off as far as functionality is concerned.

Close but no cigar. Some, not all, serial terminals effectively send BREAK when powered off. This is usually caused by combination of the RS-232 driver and power supply causing logic low that appears as BREAK. SunOBP goes into PROM monitor on break. You can recover by typing 'go' and system should resume.

I don't know who dreamt that up, but he must have been the one to originate the BOFH DNA.

Hate to disappoint, but it is due to bad(cheap) design/engineering terminals (and many terminal/console servers) and the way they have implemented RS-232. As an example Cisco 2511 would send break, whereas 26xx/36xx/28xx/38xx with NM or HWIC async cards IIRC don't. Likewise ISTR Cyclades mostly worked. Then there are some that send break when powered ON, just to be awkward.

3 0 Reply
1. Wednesday 5th December 2018 11:13 GMT Anonymous Coward
  
  Re: Ah, SUN's pizza boxes
  
  Ah, nice to know at last the detail.
  
  Yes, telling a client to type "go" was the cure, but it still was rather annoying. Lesson learned, though, also because customers sometimes couldn't resist switching the screen on (I think we mainly had WYSE terminals hooked up). They didn't know that the off switch was a tad too thorough, so that would result in a support call where, naturally, nobody would admit to having taken a peek..
  
  1 0 Reply

Monday 3rd December 2018 17:12 GMT Will Godfrey

Don't test it

Never try to test your own automation if it's marginally more than trivial. Get someone else to try it - with as little information as is reasonable. If it doesn't screw up you can be cautiously optimistic.

9 0 Reply

Tuesday 4th December 2018 14:07 GMT redwine

Wanna major problem everywhere very quickly?

... config management!

0 0 Reply

Tuesday 4th December 2018 23:11 GMT Anonymous Coward

We are safe in your hands?!

My goodness, most of you lot are making it up as you go along! Call this a profession?! Bloody dangerous school boys, no wonder IT is a mess these days!

1 0 Reply

Topics

Special Features

Vendor Voice

Resources

COMMENTS

Page:

Re: Why use a revision control system?

Other screw-ups

Re: Other screw-ups

Re: Other screw-ups

"Safe aliases for 'rm' are a good thing to prevent this!"

Re: Other screw-ups

Re: Other screw-ups

SCCS hits you

Re: SCCS hits you

Re: SCCS hits you

Re: SCCS hits you

Nothing new...

Ok, we need some beer over here!

I'm missing something...

Re: I'm missing something...

Re: I'm missing something...

Re: I'm missing something...

Re: I'm missing something...

Re: I'm missing something...

Re: I'm missing something...

Re: I'm missing something...

My worst config error?

Re: My worst config error?

Re: My worst config error?

Set the clock failed.

Re: Set the clock failed.

Re: Set the clock failed.

Re: Set the clock failed.

Re: Set the clock failed.

Re: Set the clock failed.

zfs snapshots

Re: zfs snapshots

Bah!

Ah, SUN's pizza boxes

Re: Ah, SUN's pizza boxes

Re: Ah, SUN's pizza boxes

Don't test it

Wanna major problem everywhere very quickly?

We are safe in your hands?!

Page:

POST COMMENT House rules

Enter your comment

Add an icon

Other stories you might like

Windows 95 support chap skipped a step and sent user into Micro-hell

You break it, you ... run away and hope somebody else fixes it

DBA made ten years of data disappear with one misplaced parameter

Yes, I did just crash that critical app. And you should thank me for having done so

Intern with superuser access 'promoted' himself to CEO

Health system network turned out to be a house of cards – Cisco cards, that is

If we plug this in without telling anyone, nobody will know we caused the outage

Self-taught-techie slept on the datacenter floor, survived communism, ended a marriage

'Crash test dummy' smashed VIP demo by offering a helping hand

Developer's default setting created turbulence in the flight simulator

One person's shortcut was another's long road to panic

Poor communication led to complete lack of communication

About Us

Our Websites

Your Privacy