Why did he try a brand new monitor if his own known-to-be-working monitor didn't work when connected?
It’s Friday at long last and that means it’s time for On Call, our regular trip down readers’ memory lanes. But for a bit of a change, this week we’ve got two tales of tech support head-scratchers, where the problem involved thinking outside the box. First we have Patrick, who told us about an incident that happened during …
Because "elimination" is not in most IT guy's diagnostic process.
Yes, it drives me mad too.
The other is when they "eliminate" something, then for some mysterious reason proceed to return to it and eliminate it several more times after exhausting themselves on other things because they don't have the nous to go further down the line and/or imagine a test that would isolate the cause.
"X isn't on the network".
Okay... ping it. Is it turned on? Is it cabled in? Is the cable in the wall? Is the cable good? Is the wall cabling good? Is the wall socket good? Is the other end patched in? Is the switch working? Is that switch connected upstream? Is that switch port configured properly (e.g. VLANs, MAC filtering, etc.)? Is that actually the IP assigned to the device? ...
All these things are "simple" and obvious for an IT guy, or should be. But I've watched supposed IT professionals stare mystified because "obviously the wall cabling must be good" despite the fact that they haven't bothered to test any of it by even the simple precept of putting something else on the cable.
I've literally sent technicians back repeatedly for nearly 6 hours straight because a device wasn't online despite being powered up and working... only to then have to go do it myself and discovering that the cable between the device and the wall was faulty. Replace the cable, everything came up. They literally didn't bother to eliminate along the path, instead stabbing at random at causes, rebooting switches, etc. The reason I kept sending them back was to teach the lesson - you can waste an entire day just stabbing at causes and making yourself look an idiot... or you can apply a proper diagnostic process in a linear fashion until you find the cause (or, even, multiple causes).
The value of diagnostic thinking is greater than you think.
>'ve literally sent technicians back repeatedly for nearly 6 hours straight because a device wasn't online despite being powered up and working... only to then have to go do it myself and discovering that the cable between the device and the wall was faulty.
A former colleague of mine had the most perfect copperplate handwriting I have ever seen. It was almost font quality. When he first started work at 16, he was down the mines and had to walk 1.5 miles each way to the coal face, write down some information and walk back to the foreman. Every time it was not absolutely perfect, the paper was torn up and he was sent back again....
Clearly the above commentators have never had the fun* of having their 'known good' hardware killed by whatever was causing the original problem.
I've had a power supply go bad in such a way that it killed a motherboard, which would then destroy any other power supply it was tested with. We got through four PSUs and three motherboards before we worked that one out.
* actual fun may vary
Because monitors (particularly of that generation - it's not explicitly mentioned, but refers to CRTs) occasionally die through movement. Work fine on a desk for years, then you move it to another desk and it's dead.
I've had it happen to me, so I'm sure others have seen the same.
LCDs are slightly more robust, even if they don't feel it.
In normal circumstances I can find a fault with a binary search but some clients like to help by asking a continuous stream of "Is the printer broken? Is the computer broken? Is it the printer cable? ..." three seconds after I enter the room. Applying chresmomancy may lead me to suspect the problem is a Letter sized document sent to a printer loaded with A4 but that requires typing in my password to check which can be a little tricky if I get too much help.
It's a bit like when I go to the fridge looking for, I don't know, a packet of ham or something. Often I'll just close the door, and then check again (or my wife will check for me) and without doing anything else... Voila! It's appeared right there on the shelf in plain sight, when it definitely wasn't there before.
"Clearly the above commentators have never had the fun* of having their 'known good' hardware killed by whatever was causing the original problem."
So you mean.... you change one part. No change. Then you have to move your diagnosis further up the chain, until you find a dodgy item (i.e. a PSU that doesn't work on TWO motherboards, or that you swap out and it powers up), or until you *test* an item more than a quick check of "does it function immediately and perfectly in all regards"... by, say, sticking a PSU tester on it.
I have in fact had MUCH more complex diagnoses than that (recently someone put a digger through a 450KW supply cable and blew up £20,000 of hardware, that we restored by diagnosing and replacing only £3000 of parts that were ACTUALLY faulty). And you ALWAYS start with the same diagnosis. In that scenario. I wouldn't have got through more than two motherboards or PSUs before I suspected a much more serious problem. In fact, likely one PSU and maybe two motherboards - when the known-good MB is replaced on the same PSU and doesn't work, I'd suspect the PSU, replace that, and then when that didn't work, I'd go once up the chain (checking the power sockets and cables by using a known-good one of those).
Then when you know that it's the MB/PSU end / combination that's at fault... pull both, check one level up to make sure the power isn't blowing the PSUs, put something else in place, allow the user to continue work, and then carry on your diagnosis of the faulty parts back in the IT office (e.g. with a £10 PSU tester) before you do any more damage. In fact, at that point, I'd put the previously-known-good MB back on a known-good PSU, realise that it must have actually BROKE during testing despite being known-good, and ditch the PSU that did it, testing it only for curiosity.
Four PSUs and three motherboards again reeks of "I didn't narrow it down sufficiently and just kept guessing / throwing hardware at the problem".
Honestly, the second the "obvious" swaps don't work, I'm replacing the entire kit to shut the user up, breaking out multimeters and testers back in the office (where there's an isolated mains and network circuit, because you are playing with 240v and PSUs!). There's a reason I have a drawer full of nothing but cheap PoE testers, mains socket testers, multimeters, PSU testers, network cable testers, battery testers, discriminating continuity testers, telephone line testers, etc. And that drawer cost me an awful lot less than even the price of the cheapest replacement motherboard. (I am not one of these people who wants/needs £1000 high-tech testers... if it doesn't pass a basic test I don't want it, and if it needs £1000 of tester to tell you if it works, but £50 to replace it, I'll just replace it.).
P.S. Yes, we do all our own cabling. We manage and repair all the PCs and devices on-site. Hell, we do the CCTV, access control, and everything else you can imagine ourselves. We do not have a huge stock of spares (currently about 0.1% of the deployed hardware) or parts. I don't have a huge test suite or dozens of techs - 1 per 500 devices. I don't have a stupendous budget, or warranty support etc. on anything but the server-side. The way we cope (more than comfortably) is by proper diagnosis.
Often I'll just close the door, and then check again (or my wife will check for me) and without doing anything else... Voila! It's appeared right there on the shelf in plain sight, when it definitely wasn't there before.
That's a temporal continuity error, it was there before you opened the door and after you closed the door but not while the door was open the first time.
There is a full explanation of these in The Twilight Zone - A Matter of Minutes
Clearly the above commentators have never had the fun* of having their 'known good' hardware killed by whatever was causing the original problem.
Which was actually the case here, to some degree. Though the thing to have done would be to plug the 'known good' monitor back into the setup where it had been good, rather than continue killing (or at least incapacitating) monitors.
More monitor fun. Logged into another workstation in our room a while back to help troubleshoot something about the network (a case of "I'm Spartacus!" it turned out). We got to the bottom of it, but while doing this I noticed the colours looked a bit washed out, somehow paler and not quite right. Loaded up Gimp or something, and found out green (I think it was green) was displaying as white. My desktop is mainly blue and sandy yellow, login screen is blue and red, so it wasn't obvious from a distance. Broken monitor, video card, weird driver bug? Infrequently used machine and not particularly my problem, but I asked the person who did use it. They hadn't really noticed anything, but did agree it looked a bit funny. Made a mental note to try at least replacing the DVI cable, though didn't see how it could be that, surely you'd have a channel missing.
A few days later: had a free moment and remembered about this, so got hold of a new cable, tried plugging it in, met resistance. Pulled the machine around to have a look and saw the plastic around one of the holes in the socket had been distorted, blocking the adjacent hole. Took a look at the plug on the old cable, one of the small DVI pins was bent into contact with another. Prised it into a roughly normal position, fiddled about a bit and got it back in. RWB back to RGB. The computer had been moved a couple of months before and whoever did it had somehow managed to fail to put a DVI plug in correctly, not noticed the force they presumably needed to apply to get it to that state and then not noticed that somehow they'd left green behind after the move.
This is a situation where hard-earned experience come in to play. If I suspect a motherboard failure I will never put a new one on the existing PSU. If I had absolutely no possibility of a replacement PSU, I'd link out the soft start, and power the PSU up by itself with dummy loads (even once resorted to using car tail and brake lamps as loads).
You don't last long in this profession if you don't learn proper troubleshooting techniques and eliminating possible causes of a problem is a pretty vital technique. I know a some people who have never managed to get it down, but in my experience they don't last more than a few years.
> when the known-good MB is replaced on the same PSU and doesn't work, I'd suspect the PSU, replace that, and then when that didn't work, I'd go once up the chain
when the known-good MB is replaced on the same PSU and doesn't work, I'd suspect the PSU to have just blown a second MB
Amen. Been there, pulled my hair out, and then ran the testing myself with the "tech" (loosely applied) watching and hopefully learning. I usually took the "tech" and had him/her do it with me watching and walking them through the procedure.
FTR, I learned troubleshooting on aircraft in the military. The teach this method and hammer into those who do the troubleshooting. Logic seems to be a lost art these days, sadly.
"But I've watched supposed IT professionals stare mystified because "obviously the wall cabling must be good" despite the fact that they haven't bothered to test any of it by even the simple precept of putting something else on the cable."
My personal bugbear is the "cable tester says it's OK" when the port(s) are clearly not fine for the intended purpose.
Oh, and the fact that the company network support has been outsourced too will insist on 3-4 visits to "test" (at $120 a time) before they will accept they need to fix it ($200 fee).
I started just requesting a re-cable each time the cabling was fuxed. Saved time and money, although I rapped on the knuckles until someone from Accounting asked us why we were spending only 30% of our repair budget....
I've also had more than one personal job were someone who should know better (such as an engineer or techy) has buggered up their home network, but admitting so to the wife/husband/kids is too embarrassing. So I come round for a "social" call, and restore things to sanity. For the usual fee, plus a discretion bonus payable in whisky.
" whoever did it had somehow managed to fail to put a DVI plug in correctly, not noticed the force they presumably needed to apply to get it to that state"
Reminded me of the ancient times when I worked in mainframe testing. Had just helped some new hires track down a bent pin shorting out two signals. So I stated to the group that if they ever have extra trouble putting cards in that they may bend some pins if they force it. So a light bulb goes on for one of them and he grabs the card puller and pops another card out. We shine a light in and see he got the right card as there is a bent pin behind that one to.
I was taught fault finding on airborne TX/Rx's in the early sixties. Box didn't work in the Vampire, into the radio bay, then inject the correct signal in the right place and the fault was isolated to half the box. Repeat until fault identified, valves, discrete components made it simple. Smell and signs of burning also were useful indicators.
Fast forward 50 odd years (some very odd) and can I use that training to fix my PC, no bloody chance. Event Viewer next to useless (for me) just tells me a hardware fault somewhere. Start swapping major components, luckily the PS is the first I try ( the cheapest) and fixes it. I don't envy current techs with the sort of faults mentioned earlier.
Just a guess, but I don't think you're paying the going rate for good technicians.
Over twenty years ago I used to be a Unix & NetWare admin earning a decent wage (I've since switched to software development), but have noticed how "commoditised" desktop support has become.
Big firms pay peanuts for monkeys who's only technical abilities are turn it off and on again and if that fails re-image the boot drive (which is great when it takes a day and a half to install all the tools needed to do your job)
If you want non-simian staff you have to pay for them.
I do field service work. A lot of calls read something like "repair/replace the cabling to X". It usually isn't the cable. When it is, it's often the jack, particularly in the deli department or the meat department prep room -- anywhere that's moist or gets washed down. The (supposedly) gold-flashed contacts in the network jack turn green, or the punchdown contacts do. Replace the offending jack, and the problem is solved for a few years.
Some of the companies that hire my services treat me like a monkey with a screwdriver, not trusting my diagnostic skills. Being directed by levels 1, 2, AND 3 of contracted tech support from "Bob" in Bangalore (or wherever) to replace the cable from the modem to the router, and try it again, and again, when it is clear to me that the router's WAN port was fried by the electrical discharge that caused the ISP to replace their modem the previous day, is an exercise in frustration. The restaurant manager said that he'd lost printers and a dishwasher to the same storm. I had the fault diagnosed within 20 minutes of arrival, but it took 4-1/2 hours to persuade their Level 3 that the magic smoke had been let out of the WAN interface, and they needed to send a replacement router. I console myself with the fact that I am paid hourly for these gigs.
I carry a $50 cable mapper that will detect some of the gross faults. High-end cable certifiers are pricey and hard to justify. But I have a laptop with a Broadcom network interface, and a copy of Broadcom Advanced Control Suite (BACS). This software uses the NIC's inherent diagnostic capability to perform cable analysis, I assume through time-domain reflectometry. It will tell you the length of each pair in meters (the measured lengths vary due to different number of twists per foot from one pair to the next, to reduce crosstalk). It also detects crossed pairs (such as when you misread the colors and exchange the white-of-blue for the white-of-green, for example). It seems to be accurate to within a meter or so.
It's not calibrated, certainly not traceable to NIST standards, but it will give me enough indication that the fault is at the near or the far end of the cable (or, in one case, 50 feet from one end) that I bring it to all the gigs. It is quite persuasive to be able to say "My laptops connected to each other at 1Gb/s over that cable, and cable analysis didn't show any faults".
I've made adapters to be able to test telephone cables terminated in 6-position jacks, and even CCTV coaxial cable, using it. The velocity factor of other cable types may differ from Cat5, throwing the length measurement off. But it's certainly good enough to tell you whether it's the near, or the far, end of the coax that has the badly-crimped connector. If you care to know the actual length, use BACS to analyze a known length of the cable in question, and divide the measured by the actual length to get a correction factor.
So give that old Dell D610 a new life as a cable analyzer, or get a more modern (and portable) device such the Pockethernet, so you can have some more substantial evidence to point to and get those outsourced cabling guys to do their job right the first time.
Lee, this is more prevalent than I wish it to be. I keep trying to educate my techies to work from the lowest layer up (including the layer between chair and keyboard), instead of having them do turn-off turn-on stuff and "see if it works now". To no avail, mostly.
I remember a callout at 2am Saturday morning where "the hub has failed, we replaced it, but systems are still down." Strange, as we don't have any hubs in the complex.
So, I go over there, even after enjoying a good Friday "activity", which is a 2km walk, driving was out of the question...
Only to find the "hub" being a c3650, which they slapped in without thinking of configuring.
Even if you think you've covered it all, things like that will make you go.. hmmmm
Assumption is your enemy when troubleshooting.
Oh, systems were back up and working in 15 minutes. Is configuring-while-intoxicated a crime?
Far too many people cannot comprehend the possibility that the root cause of an issue might be physically located more than about 10cm from the visible symptom.
When Jeremy Clarkson's Ford GT wouldn't start, he and James May actually removed and disassembled the Start button. A clear example of this 'Proximity Focus'.
The hero of our IT story here didn't even think of checking the PC next? Oh, because it's a couple of feet away from the problem. Third monitor was dumb. (But can't blame him for the magnets...)
We've come a long way, much deeper and ever further in A Matter of Minutes, Johndoe888.
And Current Programs are Registering Alien Information for Advanced IntelAIgent Machinery Use Deploying Future Pictures for SMARTR Development.
Is that a Stop Press, Universal Scoop for El Reg? SomeThing for the Creation of Empires/Dream Lands?
Of course IT is. You surely all know us by now and I/We Kid U Not.
The logical progression of that information is the immediate regression of military weaponry into obsolescence for such has no chance of defeating or defending against Immaculate Sources and Alien Forces .... you know, the ones Donald Rumsfeld was aware and even wary of .....“Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know."
Are you prepared to know what you do not know and what you may be being denied by 0thers? Or are you being Prepared with an Awareness of Registered Alien Information?
"I've had a power supply go bad in such a way that it killed a motherboard, which would then destroy any other power supply it was tested with. We got through four PSUs and three motherboards before we worked that one out."
And then there's the, admittedly quite rare, DOA parts right out of the box.
"Big firms pay peanuts for monkeys who's only technical abilities are turn it off and on again and if that fails re-image the boot drive (which is great when it takes a day and a half to install all the tools needed to do your job)"
There are millions of firms out there who need support, but far less people who are actually capable of providing it. Often the monkeys are all you can find, and some of them aren't even cheap.
And then you have the problem of microsoft constantly pushing their products as "easy to use", despite the fact that to actually use and manage them properly requires more knowledge and skill than even the supposedly "hard to use" products theyre competing with. This marketing then convinces companies that hiring the cheap monkeys is perfectly ok, and that hiring properly competent people would be a waste of money.
The end result is instability and security breaches.
I recently bought some very expensive, very special, bolts that hold the brake calipers on a classic vehicle I am restoring. When I came to using them, could I find them? I turned the garage upside-down, nowhere to be seen, turned the house upside-down, still no bolts. Eventually used some inferior standard bolts as a temporary fix just so I could complete the next part of the rebuild, and ordered a new set. Next day I went into the garage and found the original four bolts in plain sight on the bench. I KNOW they weren't there the previous day, as I had swept the bench and vacuumed the garage floor, but - hey ho! - there they were. I now have a new set of expensive bolts to sell - anyone interested?
" ..a Letter sized document sent to a printer loaded with A4 but that requires typing in my password.."
More often than not it simply requires pressing OK on the printer to override.
I wrote a wee sed script to mangle "Letter" to "A4" on our print server. A lot of problems stopped after that.
"recently someone put a digger through a 450KW supply cable and blew up £20,000 of hardware, that we restored by diagnosing and replacing only £3000 of parts that were ACTUALLY faulty"
I know people who do that with lightning strikes. The problem is that over the next few weeks/months, the components which were overstressed but didn't _quite_ initially fail will decide they're really pining for the fiords and shuffle off their mortal coil. End result is a bunch of callbacks.
It depends on the circumstances, but loss of customer goodwill and labour costs usually outweigh any advantage on trying to cheap out on "got too many volts up the wrong end" type of repairs. You'll usually find that attempting to charge for revisits and extra parts is a non-starter too, unless you like your name being Mud.
"My personal bugbear is the "cable tester says it's OK" when the port(s) are clearly not fine for the intended purpose."
Mine too. Most cheap cable testers run at DC.
Ethernet runs at a few hundred MHz. You have to have the mindset that these are fancy (and sensitive) antenna cables, with an RF approach required for best results.
You can do a _basic_ test with a cheap tester but it's not the be-all and end-all. On top of that badly wired connections will frequently pass on the expensive testers if they're less than 100metres. (The most common failing is tails which are _way_ too long, or rotten IDC punchdowns.)
It's cheaper to toss a dodgy connector than to rewire it.
I've had (very low end) video cards pull that stunt on me as well; not a problem with memory, or the GPU, but the analog circuitry pushing the signal out the VGA cable had gone far enough out of spec that it showed up as a major color distortion on the monitor. (this was after swapping the damn CRT, obviously.)
There was also the time that some gorilla in a telco office managed to plug in a juniper line card upside down and forced it in; trashed a line card costing over 120K and badly damaged the backplane on the router chassis as well. (which was another 50K easily) My boss at the time was Not Amused.
'I wrote a wee sed script to mangle "Letter" to "A4" on our print server. A lot of problems stopped after that.'
Perhaps to be replaced by other problems?
Thank you for the A4 you sent me yesterday. Though I am left wondering if it is following the spirit of the law, or the A4 of the law?
I have a similar problem at the moment.
I have a monitor, (and it is definitely not; the cable, the video card, the drivers. As it happens where ever you plug the monitor in. White (or actually a shade of grey, which Windows likes to use in its system shadings) is decidedly green. Factory re-sets on the monitor, nothing. Playing with the colour balance on the display almost fixes it, but then throws out other colours.
Return to manufacturer, gets "tested" and returned to us with no fault, we then plug it back in to a random PC with a new cable, same fault. (Ba*ds)
I used to work in an environmental lab, and one of the computers stopped working one day. I couldn't get it to do anything, so I popped the case, and found that all the acid in the air (due to the specific testing they did in this particular lab area) had eaten a lot of the solder, and many of the components weren't attached to anything anymore...
We drastically increased ventilation and fume extraction after that...
"all the acid in the air (due to the specific testing they did in this particular lab area) had eaten a lot of the solder"
We had carbon dating samples which were acid washed. The steady deterioration in the efficiency of the drying oven was due to the HCl eating the fan blades.
Biting the hand that feeds IT © 1998–2018