The documentation should point you to tested scripts or some other form of automation that makes the mundane repeatable, without the risk of fat fingers messing things up.
Everyone I speak to about system security seems to panic about malware, cloud failure system crashes and bad patches. But the biggest threat isn’t good or bad code, or systems that may or may not fail. It’s people. What we call Liveware errors range from the mundane to the catastrophic and they happen all the time at all levels …
The documentation should point you to tested scripts or some other form of automation that makes the mundane repeatable, without the risk of fat fingers messing things up.
And this is also why monolithic, non-patchable/scriptable/integrable "my way or the highway and here is a support contract" so-called "applications" are the devil's work.
Lego bricks is the only way!
I was once put in charge of creating all the "How To Guide" manuals for the company I worked for at the time. I thought it a serious pain in the arse, but I was being paid by the hour so sat down to start writing. It took months to go through everything we did, how we did it, & create step by step, "If you do this instead then you'll get this", complete guides for everything. Including such mundane tasks as solving printer issues, email issues, "I can't connect to my network share!", and other such problems.
I was later told by a coworker that the guides were the biggest boost to productivity they had seen in years, because the staff could grab the appropriate binder, flip open to the right page, & fix the problem themselves in less time than it took to ring up the Help Desk, explain what was wrong, & listen to them walk you through their script... The same scripts that I had included in the guide so the folks didn't HAVE to call the Help Desk unless it failed.
About a month after I left the company I learned from a different (now Ex) coworker that Manglement had axed the guides I had so painstakingly created. Why? Because it made the Help Desk folks look bad "because they had nothing to do".
I still shake my head in disbelief at the stupidity of Manglement & their inability to figure out that the immediate drop in office productivity exactly matched the increase of Help Desk call volumes. Strange coincidence, no?
So I concur that the procedural guides can be a boon. They can help the normal folks to do their own troubleshooting BEFORE having to call up the Help Desk. If you've already wiggled all the cables, checked all the settings, & a reboot hasn't done the trick, you can tell the Help Desk person to skip past those steps in the script to save time. "I've already done steps one through seven to no avail. What's next?" tends to derail them, but it beats having to sit there & pretend to be doing what they ask while drumming your fingers on the desk waiting for them to catch up.
Guides should also say why things are done this way and the risks involved if they aren't, especially if regulatory or legal requirements are involved. Not only does it mean the readers have a better understanding of what they're supposed to be doing, it enables a review if circumstances change. It also pre-empts manglement's bright ideas - and is evidence to deflect the inevitable shit-storm when they ignore it.
"So I concur that the procedural guides can be a boon. They can help the normal folks to do their own troubleshooting BEFORE having to call up the Help Desk."
If you're working with intelligent people there's a lot to be gained by teaching them to be self-sufficient, both in terms of saving you time fixing routine problems, and in terms of making them feel more in control of their work environment.
And, in a perfect world, they quickly become the defacto Help Desk for everyone sitting with two cubicles of them!
Procedure guides. They are nice when they are up to date.
I once had to be rotated temporarily into a different unit to cover for staff being on maternity leave/quitting/being seconded elsewhere.
"But it's OK", I was told, "just follow these signed off Standard Operating Procedures"!
So I do. Until it turns out one is now out of date due to some system change and actually following it leads to silent data quality errors in a (random natch) small percentage of records.
Which of course was my fault, as I was the one who pressed the button, and obviously I should have known better than to follow that particular SOP.
".....I still shake my head in disbelief at the stupidity of Manglement & their inability to figure out that the immediate drop in office productivity exactly matched the increase of Help Desk call volumes....." I had a similar experience, but the good work was undone by crafty consultants pulling the wool over the eyes of duh manijment. A colleague and I wrote up the procedures library over the course of a year, and they were much welcomed by the staff, reducing helpdesk calls and freeing up the IT staff's time for other work. Then a well-known UK consultancy outfit sailed in and offered to provide a "one-stop-shop" for support with an offsite (as in waaaaay offsite in Bangalore) helpdesk, centralized remote builds, etc., at a bargain price. Our internal IT team was gutted to fund the deal. The first thing the consultants did when they got the contract was delete all help files from the desktop and server builds and remove access to the process library we had written. Now staff had to call their helpdesk for even the simplest of issues. The consultants' justification was that the staff were hired to do their jobs, not IT work, which sounded good to manglement. But the real reason was helpdesk contract had a threshold for call volume, and removing all the help files and our process library pushed the volume of calls over the threshold and meant additional charges, making the service eventually cost almost twice what the old inhouse IT team had.
Having more feeling of control is extremely important as I have found providing off-site IT services for numerous customers. Just the mere act of power-cycling a modem and firewall is often enough to not only reduce the calls but to make the customer feel like they are less dependent upon you.
I have heard in the past "I just didn't want to bother you" or some similar sentiment, but what is really being said is "I don't want to be forced to call you to free me from the shackles of technology every time some 'little' thing goes wrong." Some customers will feel that they are being held hostage, at the mercy of some outside contact with the keys to the kingdom, knowing it is a 80/20 gamble on if you answer right away or they may have to wait 10 or 15 minutes for a return phone call -- when a simple reboot would have been enough to resolve the issue.
Really. Something as simple as "reboot the computer" is not only empowering to the customer or user as having the ability to resolve many issues, it also lessens the frustration of having a critical call to return, or divert from another job, only to find the solution was as simple as rebooting. Now you have one customer or user waiting for you to return to them, and one customer or user who has had to wait for you.
Amazingly, a simple document with these lines is like gold:
"Problem: QuickBooks won't open
Error: QuickBooks cannot find the data file, or similar
Resolution: Check on Q: drive by clicking START then 'Computer.' If Q: drive is not present, restart the computer and try again. If Q: drive is present, please note if a red ' is present on the drive before proceeding, then double-click on the drive. If the Q: drive opens and you can see files, close the window and open QuickBooks, again.
If the error given is different than above, or any given step results in another error, please call xxxxxxxx."
Pictures help, too.
Of course, you will always have a user who just does not want to troubleshoot. Really, that is fine, too, as their job has other things on which to focus, and only a small percentage of those users makes life happier for all involved.
While customers like to know they can depend on you, most do not like being dependent upon you.
"If you're working with intelligent people there's a lot to be gained by teaching them to be self-sufficient . . ."
Ah, what a fine world that would be.
I completely agree that process guides save so much time so it really does surprise me when no-one - especially management - do not seem to grasp the concept of creating and maintaining them. Yes, there does need to be someone designated as the owner/creator/updater for them and it can be a chore if your library is large, but if the guides are well maintained and written clearly it is definitely time well spent.
Of course, the problem then is when someone comes along later and skims through one of the documents instead of actually reading it and misses out a critical step. You know, like skipping one of the numbered bullet points you've put in to make the process steps easy to follow.
Personally, I always try to read through a guide I've never seen before at least twice just so I can get my head around it before I even attempt any of the instructions contained in it.
"If you're working with intelligent people there's a lot to be gained by teaching them to be self-sufficient . . ."
I agree, except dealing with research scientists, so intelligent its scary, but no common sense whatsoever,
I had one who thought due to IPA not containing water, it was totally ok to "sanitise" a 2 day old laptop and couldnt understand after cleaning, the chiclet laptop keyboard all the keys were blank.......I gave them a sheet of A4 with the alphabet and numbers 0 to 9 on it, a pair of scissors and a pritt stick.
"No I cant call that into Dell for repair...you on drugs? " "thats not accidental damage!" thats user retardation!!"
another blinder is our on market support scientist, the guy should be at NASA, his record is a week of new laptop before he bricked the lan port (snapping connectors inside the port...the mind boggles!!) then a week after replacing the backplane, he managed an Ollie 720 and landed laptop on its lid while powered on onto the hard lab floor......he thought he had put it on his lab bench properly but had only put 1/3 of laptop on it and walked away.........BANG! WTF!
they brighten up my system support job in a biomedical research company daily with their clear lack of common sense. :D
"The information must have been important as he kept that disk for years – just in case."
But not important enough to have two physically separate copies, it seems.
Back when I was working as a contractor at IBM, watching other (and ourselves) train up overseas replacements, we called it "OUTSOURCEware".
Because management would literally terminate the knowledgable people's contracts when 60-70% of the knowledge transfer had been done, instead of when the outsourced replacements we actually functionally competent.
Cue the accidental destruction of mission critical telco (main national Oz carrier) database. More than once (different databases though fortunately). The shit show in penalty costs for them would have nuked all savings.
But that's IBM for you, where actual competence in staff is optimised out.
Especially if the costs can be pinned on someone else's department.
We actually called it "meatware" when I was at Berkeley and Stanford. One of the professors suggested that that name wasn't conducive to the reality of funding. So we butted heads over pizza & beer and came up with "wetware". Probably 1979.
I have been very fortunate to have only been bitten twice with mis-clicking errors of a monumental scale:
The first involved an outdated backup procedure we ran twice a week. A manual process between two machines. Take a database snapshot of the production database, verify it, copy it to the shared storage. Switch to the backup database server (stupidly on the same KVM). At this point I got called away to deal with a fault.
Upon return I merrily followed the next step - to type those famous words "Drop Database". Just before hitting enter I saw the desktop wallpaper. Someone had switched the KVM back to the production server while I was away! This has obviously bitten someone before as the backup has plain blue wallpaper, whereas the production server has bright red wallpaper with pictures of bombs on it! Somewhere in the region of 35,000 asset records saved. Luckily it was only a short(!) time before the obsolete, slow, clunky dust-puppy nests were replaced with a new pair of servers which could be driven by the automated nightly backup system.
NAS with a tabbed web management interface.
The backup NAS I'm cloning TO decides to go away, leaving the main production NAS tab on top.
Click clone, yes I'm sure, yes this will erase the target, yes I know that's what I'm trying to do....
.... ooops ....
Oh god yes. The number of times I've mistyped something critical is fairly low.
The number of times I've deliberately killed something critical because I thought I was looking at something else? That is definitely an embarrassingly higher number.
Protecting users from themselves is a lofty goal, but the most important user to protect is YOU.
Visual cues are a valuable help - different wallpapers, different coloured terminals, a change in text colour when you log on as superuser ... anything to say think twice.
How about the following - bank submits many network user account delete requests daily. Requests submitted in a common format where the requester name is formatted in the exact same way as the deletee name in the request.....
Yes - I learned that one a long time ago (NT3.5?) - always have a red background on the production system,
This one *did* catch me out.
We had a system outage on the automation system. I work in television, so the system failing to run a programme - especially a Soap Opera - gets the Points of View mailbag bulging. Needless to say we went into manual fairly quickly once it became clear the fault affected both main and backup transmission systems and ran the programme from tape.
I was still dealing with some of the fallout to make sure other channels were not going to suffer a similar fate when I get someone who should know better demanding to know what happened. It was obvious he wasn't going to leave Mission Control until he had and answer so, a little flustered, I went to the automation logs and opened the verbose logfile (which goes down to keystroke granularity). Unfortunately, missed out a crucial step - copy it to an offline terminal first. On the offline we have a tool which allows you to open the file without bringing the machine to a juddering halt.
In my desire to get rid of this person, I accidentally double clicked to file. Cue the server attempting to open a 3gig logfile in NOTEPAD. I couldn't even get Task Manager open to try and kill it. A few calls on talkback to warn everyone we were about to fall off the air and I could almost hear my P45 coming out of the printer in HR.
In the Incident Report I put it down to human error and held my hand up, as it would be pointless to mount a full investigation and waste a day, just to find the obvious. Cue some 'suits' descending and tearing a strip off me in front of my team. "How could a Senior be so stupid?!" etc.
Needless to say MY boss was none to pleased and had "a quiet word" with them along the lines he would deactivate their passes from Mission Control if they did that again as it was unprofessional on their part. He also explained that had they bothered to listen, I had already changed the system so if you tried to open a log on the live server you got a dialog box instead telling you you couldn't.
Simple fix for the memory inyensive text file using Notepad.
Change the default program for opening .txt and .log files to Large Text File Viewer.
You protect the machine from blowing memory and you can still 'right click open with' if notepad is needed buy you have to think about it. Better still, Notepad++ for editing.
I think Notepad++ struggles with files of even a few hundred megabytes.
OTOH I think it now has a "tail" mode i.e. when the file grows on disk, its view in NPP is updated.
Alternative suggestion though - have your routine editor be one that quickly fails out, SAFELY, on oversized files. MS-DOS EDIT or EDLIN may qualify, may not.
backup everything, preferably frequently and automatically.
manglement always disputes the cost of a backup system until they actually need it.
No amount of money thrown at the problem AFTER you've lost the data will bring it back quickly.
If there are scratch disks on individual systems you can guarantee that despite any amount of warnings these are not backed up, someone _will_ put critical data on them and then demand they be restored when the drive goes toes up. This happens regularly where I work.
Background: the scratch disks were supposed to be NFS cache disks for a fairly slow NFS server, but rhel cachefilesd didn't work, so manglement decided in their infinite wisdom that they should be put to use as scratch. Bad BAD BAD idea - what has been done is hard to undo, even when the NFS server is now significantly faster than local spinny disks.
Ah, it's not just about cost, I just wish manglement would listen and show some common sense sometime...
Even after paying an agency to come up with a 'Disaster Management Strategy' which mentioned things like 'maintaining offsite backups', manglement did sweet Fanny Adams to implement them.
So, off my own back, I did, told a.n.other where the offsite backups were held, and an unofficial system was thus in place. and I was a lot happier.
Fast forward about 20 months, manglement find out about my unofficial offsite backups thanks to the brown stuff hitting the rotatey thing when a couple of important files go amiss, they're not on the normal backups for some reason, so I go get the offsite backups (full dumps and incremental changes over the 20 months worth), restore the files, then get it in the neck for actually performing said backups in the first place..despite saving their arses by recovering these bloody files deleted 'by accident' by a soon-to-be-ex member of staff (admin rights really should be pulled well before end-of-contract)
You can't win..
Most of my fsckups have been with backups
A RAID system where the management console numbers the drives physically and an OS that numbers them logically. Physical drive 0 fails, OS boots from drive 1. OS asks if you want to rebuild the 2nd drive from the first?
I'll write a little shell script if I'm doing something that might be risky, like deletion or configuration changes that can''t easily be undone. Add in something that makes you hit a key to continue after showing the state of things for multi-step processes.
This lets you test things by making the risky statements print the command line they would execute rather than executing it for a dry run - important if you are using variables or loops or such to insure what you expect to happen actually happens.
The extra time required to write the script forces you to figure out exactly what it is you're trying to do, preventing the fat finger or 'in too much of a hurry' type of errors.
Obviously writing a script is a bit much if you are just going to delete one directory, but it is still a good idea to replace 'rm' with 'ls' and try that first, just to confirm what you are deleting. If you are using 'rm -rf' and expecting to delete a couple dozen files and see screen after screen scrolling by you'll be saved from a potentially costly mistake.
Another favorite habit was aliasing dangerous commands, like reboot. I'd alias it to echo "use reboot`hostname`" and alias reboot`hostname` to the full path of the reboot command. That prevents accidental reboots of the wrong server (this can be a problem in a major rollout where you are doing a lot of active work on some servers while developers are already working furiously on the test/dev/GM servers)
type 'm superfluous_thing', proof read, '<home>r<enter>'.
A script with echo in front of anything dangerous. Run the script, then remove the echos.
Finally: restore from backups regularly.
Two days work in 6502 assembler on someone else's computer. Tested, working, and saved twice to 5¼" floppy disks (IAVO) on Friday afternoon ready for demonstration to the customer on Monday. Clean up everything on the borrowed computer, then find both floppy disks are unreadable. Suddenly I was not looking forward to the weekend any more. I have not lost data since then.
No project is complete until it has been restored from backups, preferably twice, the second time by someone you trust to deal with problems while you are on holiday.
At $50 a TB you dont need to. Just mark it as not of interest at this level. Indexes are cool!
At some point, the information in the documents only is historical at best. Either archive it or delete it. Not so much make space but to not have to keep track of it.
We are no CERN but a TB of experimental data is still not that hard to produce...
There is a huge difference between the scale of data humans can produce themselves and data than can be acquired by some automated process.
Sometimes, responsible use of data includes deleting it when you don't need it any more, such as when it's the law regarding personal data or credit card numbers. Keeping what you shouldn't keep means it also can be stolen and misused and it's your fault.
Keeping everything is a terrible idea. It doesn't matter how good your search is, trying to find what you need is more difficult the bigger the haystack. Especially if you have dozens of different versions of the same dataset but only care about a few of them.
Already the volume of data is becoming a bigger and bigger problem because of attitudes like yours...
Keeping everything IS a terrible idea, but not as terrible as deciding which files might need recovering and which files won't -- especially before the urgent need for recovery occurs to grant resources needed to make all those very many decisions.
If everything is kept, it is then a simple matter for the owner to decide whether or not it is worth trawling through everything.
The problem with keeping everything is that you're sometimes not allowed to.
For example, the data protection laws covering personal data (both those that have been around for years and the new GDPR ones) make clear that you are OBLIGED to delete personal data when you no longer have a legitimate purpose for keeping it.
Defining and agreeing a good retention policy is a pain in the nuts, but it's a pain worth enduring. If someone complains that you deleted something three years ago that they now need, that's tough - because more often than not they WANT the information rather than NEEDING it and may well not have a truly legitimate reason to be using or processing it.
We hold on to data because most of the time we are obliged to keep it for AT LEAST a given period (e.g. keeping tax-related information for six years). It's easy to forget that there is often an upper limit to how long we're allowed to keep stuff for, whether it's a static measure of time or it's in the context of "the requirement has gone away".
Salvage on Novell is all well and good, but if i remember rightly it took two keystrokes to delete a volume and there was definitely no "are you sure" window.
When the teams are already understaffed, nobody has the time to write them...
So no documents available for new people starting after a staff member left because of overwork, not even the list of basic permissions to grant to a new team member.
Which leads to frustration, overwork for the other people in order to explain everything to the new hire, then another member leaving, and the cycle repeats.
But management being not interested in the day-to-day operations and only in big projects providing lots of notoriety it is not a problem.
Up to the moment the users start billing back IT for lost time...
Nobody bothers to read them when the shit hits the fan....
I remember documenting a procedure that had a flaw if the overnight processing went past 07:00 (the "start of a new day" in the scheduling system). There was a documented and simple process (one "ad-hoc" program needed to be run after overnight processing was completed) to recover the situation. Sometime after I was let go from that job I was having a catch up with the people for drinks and they related the issue that occurred during the End of Financial year processing where this process failed because of the overnight processing running late and no one was able to get it restarted (no one read the doco). The night shift person came in and was advised of the situation. He told them "there is a process for that in the documentation", opened the documentation, ran the job and it all started working. It had been down for over 15 hours because no one read the documentation that we had put together.
So why did you not automate running the process when the conditions were met? You know computers are good at that right?
The new management was just as bad as the old.
An "Are you sure?" message might save the day.
But when the original error is unrecognised or the user was confused the probability of pressing either key becomes 50%.
The only dialogue worth having is the one that says "If you continue the entire contents will be wiped completely. Are you sure you want to wipe this file/disc/hard drive?"
Followed by "Are you really sure? Really really?"
And even then a "Would you like to think about this?" wouldn't hurt.
Confirmation is (or was) not enough to protect you from me.
Back in the days of floppy disks, I was learning how to use the DOS commands by doing. Deleting multiple files using wild cards (like file*.* or file*.dat) was handy until I wanted to delete a number of files ENDING with the same characters.
I told the OS to delete *file.* and was asked to confirm.
Yes, dammit, that's what I typed, wasn't it?
... and the disk was empty.
I would have liked a couple of those "Are you really, really ..."
> An "Are you sure?" message might save the day.
But not with the genius Windows 7 folder re-selection 'feature' which changes the selected folder to the containing folder without any visual clue and you don't know what's happened until it asks 'are you sure you want to delete desktop.ini' just as it refreshes the explorer window to show there is nothing else there now.
And yes, this is the kind of crap that happens as you are sorting what needs to be backed up (method now changed, too late for vanished data) just like the disk that failed the night before that stack of blank DVDs was due to be delivered. Though that last one in particular is a clear demonstration of the temporal and contextual awareness of devices that exists at the most fundamental level of reality, and even CERN's toasted weasel only barely scratches the surface.
the genius Windows 7 folder re-selection 'feature' which changes the selected folder to the containing folder without any visual clue
Been there. Done that. Watched /Documents and Settings/ disappear in front of my eyes. Downloaded Explorer ++ and never used the Windows one again.
'Followed by "Are you really sure? Really really?"'
Followed by "We're recording that it's you who's doing this. If this is a mistake it'll be on your head."
..is a problem which plagues me now we all have to use cheapo keyboards which too easily fill up with crud.
eg: "rm *.txt" mutates into "rm *>txt" (ditto "mv").
Not so many years ago, it was a final warning if not a sackable offence to eat or drink near a computer/PC/work station. Nowadays, keyboards, when tipped over, could just about feed the 5000. The hardware is cheap so no one cares any more. They forget how valuable the data is. Cheap nasty keybpards don't help.
Just starting as a full-time programmer. So I created some stock libraries and parked them in my share. The other programmers could and did, grab one as needed and move it to their share to modify or compile as needed.
Cue the new hot-shot coming in 6 months later. Grabs my disk read and fill 10 pointers and uses it in one of his programs. However, he only changed my library in my share and then compiled his program. A week later, someone asked if I could speed the program I'd written up a bit. So, I go change the number of pointers to be filled on each disk read to 20, compile and test. Next thing I know, the entire day's work for the company (a massive customer database) is disappearing...
Hindsight... young lad had changed the line for my memory release for the pointers to "delete file" in a directory his program created. He hadn't copied it over to his share as was standard practice (and he damn well knew about that) but just changed the file in my share. I should have rechecked that library but several hundred lines of code for various functions just wasn't about to get reviewed everytime I compiled. Luckily, the nightly backup from the night before was still on site, and the incremental back-ups were set to "write but never delete". Instead of taking several days to recover, we were able to recover overnight.
After that, we (with manglement's approval) locked down the shares such that they could be read but not changed by anyone other than the share owner.
Biting the hand that feeds IT © 1998–2018