I learned a long time ago that generating random numbers (really, truly random numbers) is a non-trivial exercise. However, I completely failed to apply that computer science lesson to the real world of computing and continued to believe that events in the Newtonian world could happen without a cause. Such a belief system is not …
I used to be responsible for an old Netware server that had run continuously and relatively painlessly for five years in conditions that were definitely not ideal. It started its life on the dusty floor of a garage that had been converted into a makeshift studio for the new community radio station it was serving. It endured several moves and took a few good knocks in its time. Summers in that town were hot and humid and there was never any airconditioning. The six full size SCSI hard drives would literally burn my finger if I accidentally touched them.
After five years of faithful and mostly trouble free service in these appalling conditions, it eventually started crashing (AbEnding, in Netware-speak) randomly very much like in the article, which obviously wasn't very desirably considering it was serving up live audio for the radio station. After many late night callouts I eventually deduced by experimentation that the SCSI controller was on its way out, and took the opportunity to convince the bosses to replace all the SCSI drives with a pair of mirrored IDE drives. By that stage IDE had caught up with SCSI in terms of performance and capacity, and was significantly cheaper.
We eventually also installed airconditioning in the server room but, having no budget for pretty much anything, we couldn't even install a water pump to pump away the condensation. So for a couple of years the aircon was draining into a large barrel, and there was a staff member who had to remember to empty it every evening. Occasionally they forgot... which is why the server was finally raised off the floor and got a space all of its own.
The lesson here is:
That NetWare servers don't crash at random.
Fire icon for obvious reasons.
I had a 'faulty' ceiling mounted projector in a Primary school classroom. Often it would run for the whole morning, then switch itself off at lunch. Sometimes it would switch itself off after a couple of hours. Sometimes (as far as I knew) it would work fine all day. It always started up immediately after switching itself off and had no unusual settings. I tried changing the bulb, but that didn't help.
Unfortunately I didn't have a spare at the time, but the holidays were coming up. The following week, I swapped the projector over with another of the same model. The problem stayed with the room, and not the projector. So, it must be the mains - but everything else in the room was fine, and not cutting out. The ceiling socket looked fine, and Estates new of no fault with the electrics in the building.
OK, so more investigation. I connected an anglepoise lamp to the projector and left the room for half an hour. When I came back I peered in through the classroom window, and the light had gone out. I walked back into the room, and, as if by magic, the lamp was back on! Then a moment of inspiration. In the corner of the room was what looked very much like some kind of motion sensor. I left the room for half an hour, waited for the light to go out, then walked to the doorway and stuck my arm in. The light came on.
After Estates took a look, it turned out that the electricians who had put the extra ceiling socket in for the projector had tapped it off a mains power cable running along the top of the room. This cable was from an old lighting circuit connected to a motion sensor that everyone had forgotten about. The teacher of the classroom had her pupils working so quietly that sometimes the motion sensor thought the room was empty and so switched off the projector.
I was thinking of visiting thedailywtf.com but then came here instead, only to read your story.
It's as if the IT web press is reading my mind...
Hot stuff always causes lateral [horizontal] thoughts. :-)
Some would posit that nothing is random, Mark. And waste no time or effort in arguing the point. Life is just too short to argue about IT.
Spookily enough, that would then render random number generation for security, a bit of an impossibility.
Thanks for the heat, Paris.
So the server was randomly crashing and it took you how long to consider it might just be overheating?
Cause and Effect
Reminds me of a network glitch I investigated 20-odd years back.
Freakishness on a cross-site link between 08:30 and 08:40 wednesdays/fridays. UARTs locked-up in utter confusion; X.25 protocol errors for 10/15-second bursts.
Much time was spent hunched over a Hewlwtt-Packard 4951A "electric handbag" protocol analyser.
Turned out that the site had one employee who only worked wednesdays-fridays.
Her partner was a taxi-driver.
He'd drop her at work - stopping right outside the data-centre - then use his radio to call the taxi-despatch for the first job of his day.
1980s-era synchronous datacomms often had poor RF-immunity!
Is it just me or is this not one of the first things you would look at with a randomly crashing server?
Once Operating Environment anomolies are discounted by checking the installed base, environment is the primary area of investigation in any such escalation followed quickly by human factors.
Any server worth it's salt would warn that it was getting too hot. Even PCs do that nowadays!
Dude, I use this "it's random" argument sometimes when I want to get the boss off my back, don't give him ideas about why saying: "It's random" is bullshit, I need that excuse to give me more time to fix something before he calls me on it and demand that I explain it in other terms than "it's random" at which point, I have to say,
"I dont know, errr, hey, look at that hot chick on the second floor with the nice boobies!"
and hope he just forgets about it!
Ruining my day man! STFU!
When I was a student, the hose phone started behaving oddly. This was about six months or so, after changing from BT to NTL (against my wishes!) sometimes the phone would chirrup and there would be someone at the other end, sometimes people would complain that they had been ringing us all night and no answer, other times we would pick up the phone and someone would be there, but it wasn't ringing. All this, while the phone was, occasionally, appearing to work normally. NTL visited about three or four times over a three month period until I got fed up and demanded that BT be put back in.
The BT guy came round and asked where we wanted their box, I told him and he asked if we minded if he used the old NTL wire across the house, which we would no longer be needing, I said that was fine. About twenty minutes later he said that something was wrong with the wire and he needed to go out to the van to get some test kit. Upon connecting the tester he pinpointed the problem being behind a radiator. The NTL guys had put a metal staple through the cable.
The phone was fitted in the summer, when there was no need to use the radiator, when we turned the heating on it caused expansion which made a short, when the radiator was off, everything was working fine.
Ah, sounds familiar...
I had a machine that suffered from sunburn too..
My favourite weird one was the machine that would never crash when left on it's own.
I could access it remotely, and run it's little cpu 100% 24/7, no problem at all.
But if I sat down at the desk with it, within an hour it would crash.
Until the one day it didn't... The one day I used it for hours whilst sitting at the desk, and it didn't put a foot wrong. The one day when my mobile phone was off being repaired!
Since then I've never owned a PC with a plastic case.
Many years ago, I gleaned from a far-away operator that the kit only went wrong when the sun came out and she had the covers off.
The kit was controlled by photo-sensors....
Sorry, meant waste of time to read
It's amazing how simple problems seem difficult if you don't think.
... you should have persuaded her to play badminton instead?
Or maybe not...?
When you find a correlation, it's often just the discovery that your measurements reflect the assumptions inherent in your model, thereby seeming to confirm it when in fact it directed them. Or in other words:
>"So the server was crashing when the weather was good. [ ... ] it gets hotter. Hmm.
Servers don't like heat. Where is the server? Sitting on a bench. In front of a south facing window [ ... ]
So, Rosanne plays tennis when the weather is good, the sun shines and it's cooking the server. The Newtonian world is back in balance, yin has a yang and effect does have a cause."
... thereby completely blinding you to the reality that in fact, every time Rosanne goes out to play tennis, the juniors (she being supervisor and all) decide to slack off and fire up a game of network quake on the server!
A new lease of life for an aging processor
I've had similar experiences with a desktop machine which I converted to a web and file server for development purposes at home.
This particular machine (an AMD XP 3000+) was forever failing on me. First it told me that the hard drives had gone. Not having the cash to buy new drives, I was in a bit of a panic about this until for some unknown reason I decided to try replacing the IDE cable instead. Funnily enough, this solved the fact the drives had seemingly packed up. Never did work out why I decided to try changing the IDE cable but I'm glad I did. A cable is a damn sight cheaper than 2 new hard drives and the frustration of restoring 200GB of data from backup
Around the same time as this happened the system started to shut itself down, freeze, reboot and do all kinds of peculiar things. At first it would only do these things when it was hot. I was forever cleaning it out from dust and grime (I was living in a pretty grotty hole when this started). Some times it would run for weeks or months on end without incident and others it would fail 8 or 9 times a day.
My initial thoughts were that it was likely to be an overheating problem. Most of the time it played up was during hot or humid weather although the fans never seemed to be working unduely hard. With this in mind I began to suspect that perhaps the temperature sensor on the MB had packed up or was in the process of failing.
This problem has been going on for about 3 years (yes, I'm still that skint I cant afford a new system yet) although since march it had been getting much much worst to the point I actually couldn't boot the system for more than an hour at a time or until I tried to open any applications.
About 2 months ago I decided to look more closely at the matter.
After careful thought I concluded that the system only crashed on me during hot weather or when I was placing undue load on the system. This narrowed it down to one of two things. Either memory or the CPU. Seeing as the memory had been upgraded, I wondered about the CPU.
Rather than replace the CPU itself (again through lack of money) I thought to try throttling it back. The clock base defaults at 166MHz. I brought this down to 100MHz and it was stable over the weekend. Wanting to find out how much the system could handle, I took it up to 150MHz and the system was stable overnight but died in the morning with a bios error relating to clock frequency. I've now dropped it down to 140MHz and its been stable for just under 2 months, and has taken everything I've thrown at it so far including indexing over 110GB of audio tracks, a feat that has not been completed in a single session in nigh on 3 years.
So now its time for a new processor although this one is currently performing quite nicely even though it is on its last legs.
I probably could have solved this a long time ago but to be honest, until March I wasn't all that bothered. I rarely use it as a desktop itself, preferring to do most of my work from my laptop and just use it as a server, and whilst it was frustrating at times, I could live with the occasional 5 minutes downtime whilst it reset and cooled enough to boot up again.
Nightmare on Elm St.
Many years ago when I was "last level" support for a hardware supplier, we got a call from a disti who had installed a network for a high-profile client. It suffered from occasional, but disastrous networking problems. Back then, networking was a black art (remember those old, thick, yellow networking cables?). After piling in their own people, analysers, reflectometers, tracing software and everything else they could throw at it, they finally called us in.I sat in a small cubby room for days (on charge, natch') squeezed in with seized/evidence equipment - yes it was No. 10 Elm St. and all the disti's kit, and nothing happened. After a while, the account manager thought the problem must've somehow fixed itself and was considering declaring the problem "solved". You guessed it - massive packet loss, collisions, machines crashing. The TDR showed up which cabling segment was at fault and off we went to find out what was going on.
It turned out that the cable was running under a staircase - with one rocky step, just by a window. When a courier had made a delivery, or a pickup, he would call in to get the next job. It turned out that mobile reception in the building was lousy and the only place his phone would work was near that particular window. There he'd stand for a couple of minutes, squashing the 10 base-2 cable and causing networking ructions. While he was not the only person to use that staircase, no-one else loitered on that particular step for any length of time, so the problems from people passing by were too small to notice.
It took you that long?
How long have you been in the industry?
Paris, because you seem to have little more IT clue than she does.
Heard a better one
Can't give any specifics, but where I used to work there was a story going around about a network connection failing every time the toilet was flushed.
Dodgy wiring to the water pump.
Solving another intermittent problem
When I was at Uni studying Engineering some years ago, we had a Professor who did a bit of external consultancy. He would occasionally tell us about some of the jobs, and this one was my favourite.
He was called to a paper mill to try and find out why just every now and then the paper was coming out with uneven thickness and, at considerable cost, had to be thrown away.
He looked around at the operation and asked them a few questions. He then said that he could fix it and named (by his own admission !) a hefty price. They were so keen to get it resolved they went for it.
He said "See that window up there - put some blinds on it". He'd worked out that when the sun was in a certain position it was falling on just part of the mill rollers and they were expanding with the heat !
I do hope there are still guys like him teaching ....
Which reminds me of this one...
GM had a similar problem a number of years ago. Have a look here:
laser links failing in the early morning
Long time ago a laser link between two sites kept failing every morning near sunrise. The lasers though weren't point/facing towards the sun. network engineers had to stake out the roofs of the building and watch for sunrise to see what was going on..
Turns out flocks of birds would take off en-mass during sunrise obscuring both sets of lasers!.
Friend of mine's uni had a building that was joined up to main hub building via a line of site microwave link. Same time every day, for same length of time (a couple of hours or something) the link goes down. After several months of trying all sorts of things they did a path analysis on the beam - a tree that had overgrown thanks to the council was being pushed into the beam's path by the dew on its leaves weighing its branches down first thing in the morning then as the day warmed up and dried up the dew the branches and leaves would rise back up again and stop blocking the beam! A stern talking to the council from a uni bigwig later, the tree was pruned heavily.
Flames cos of council stupidity interrupting students dossing *cough* learning sorry online.
never trust control panels
One place I worked, room full of Vax's (ooo 18 years ago). One day im round the back of the big one, feels a bit warm. i tell the sysadmin, he wanders over to the air con control panel, no error lights, everythings ok. "yes xxxx tends to get a bit warm, its a big machine"
first really hot day of summer, mid morning i cant log into any of the vaxes. helpdesk reports come in saying the same. i go over to the other building to see whats going on, only to find the loading doors wide open and the sysadmin powering down the (19"/340mb) hard drives.
the 5 machines log printers were going mental as being in a cluster everything got reported across the whole suite. one goes "I've lost connection to A!" everything else going "B has lost connection to A!" as well as their own problems.
what had happened is 2 out of the 3 aircon units had failed but due to a fault the fail lights weren't working on the panel. 1st hot day, the last working one gives up the ghost, machines overhead and its goodbye productivity for 1000 people.
I can see why pruning trees so they don't collide with double decker buses (for example) is the council's job, but I can hardly see how it is the council's job to ensure that line of sight RF links don't get obstructed by trees. The council isn't the stupid entity in your story, rather it's the university. That is, if your story isn't apocryphal...
A Bit Harsh
Come on, guys, we all have to learn some of them painfully. Netware servers are old enough that it was probably many years ago when overheating wasn't so much of a problem, especially in Scotland. I've had occasion to impress the management a few times when a problem appeared during a customer demo (actually a spurious signal on a bit of radar test kit) which I solved by turning off the nearby monitor. All of a sudden the 18kHz spurious disappeared. What they didn't know is that a few months before that, I'd spent most of a day trying to find out why one of my test circuits, that was working happily the day before, seemed to be oscillating at about 21kHz. Then near the end of the day someone who'd been using the department computer (in the days when EGA was the bees knees and an IBM AT cost seven grand) shut it down, the monitor went off and my problem disappeared.
Some problems are only obvious now because of hard, painful experience or, if you're lucky, a good tale of woe from a colleague down the pub who had the experience.
"So the server was randomly crashing and it took you how long to consider it might just be overheating?"
It was in Scotland! Trust me overheating is not the first thing you would think of!
>I was forever cleaning it out from dust and grime (I was living in a pretty grotty hole when this started).
>After careful thought I concluded that the system only crashed on me during hot weather or when I was placing undue load on the system. This narrowed it down to one of two things. Either memory or the CPU. Seeing as the memory had been upgraded, I wondered about the CPU.
PSU, either dodgy or needs cleaning, in a dusty environment probably the latter.
Cash only please
"Often it would run for the whole morning, then switch itself off at lunch. Sometimes it would switch itself off after a couple of hours. Sometimes (as far as I knew) it would work fine all day."
Was that the projector or the teacher?
Mine's the gown and mortar board.
More please ... its amusing
Good article and comments. I have nothing to add to these, the bloody things should just damn well work. Like a car ... innit ? If my car threw a "I can't let you do that Dave" moment, you just get the hammer out.
A stern talking to a council actually had an effect besides being passed to someone else time and again? Wonders will never cease. Must have finally spoken directly to the council's tree pruning chap/chapess.
I remember a network hub that was run by our engineers (before standardisation and telling them It kit was our role only) that would shut down at "random".
We eventually found it connected up to a power switch that they had taken off the same line as the nearest hand dryer.
When the senior engineers used that one rather than the usual toilet ones it would spike the power and kill the hub.
@The lesson here is:
Nope, the lesson here is that someone needs to get better sys admins. Nothing is random and when it appears so, heat related issues (usually bloody CPU fans if present) are often the culprit, closely followed by dodgy PSUs. All pretty basic stuff I'm afraid :(
Er - shouldn't you blame the dumbass who positioned a line of sight without proper clearance?
I am **so** glad I wasted 5 minutes reading about how an old processor needed to be underclocked to stop it crashing. Thanks!
Please take your coat and leave.
Oh! You haven't got one? Well then no reason to hang around then, bye!
Three biggest causes of failing systems
Faulty Power Supply.
Check those first and 9/10 you'll fix the problem.
Vacuum cleaners and processors that are afraid of the dark
Had a PC installed at a customer's site once (a prototype system, which is why we developers were looking after it) that started rebooting for no apparent reason at night. But not every night. Turned out that the cleaning staff had decided that the power outlet the PC was plugged into was more convenient than the one just outside the door that they were supposed to use, and simply unplugged the machine when they wanted to clean the floor.
An then there was the mainboard that was afraid of the dark. The PC failed one day - just wouldn't start, no lights, nothing. Onto the bench, case off - works fine. Reassemble - won't start. After a couple of cycles of this (making sure that no connectors were getting disturbed during reassembly) we decided to reassamble under power to see at what point it failed. Simply putting the cover on caused it to fail. Lifting the back of the cover to let a bit of light in - starts working again. One of the guys came to the conclusion that the machine was afraid of the dark ;-) ... Turned out that an LED in the front panel had got one of its legs bent at some stage, and the insulation had gradually chafed away. Unfortunately that leg of the LED carried the 5V rail of the mainboard and a short to the case took out the whole system, but the short was only present when the cover was fully pushed home.
..doing an installation in a quarry of an IBM cluster controller with a few screens 'n printers linked to a S/36 up the way over a BT leased line with BT sync modems.
Couldn't get the bloody thing to work at all. To add insult to injury, every time I reported the line faulty to BT, they insisted that they'd tested it and it was OK. The only evidence of something being up was a set of cable clips running up both sides of a dividing wall and ending at a hole that didn't go through the (rather thick) wall.
Eventually, I managed to get BT to get their engineer at the Exchange to call me directly while I was on-site. He was seriously pissed off 'cos he'd tested the same line repeatedly for the thick end of a month now and he normally only worked nights (big clue here). "It's up and fine, I can see carrier from both ends", he says. "So it is" say I. A couple of moments later, "It's bloody down now", I say. "F*** me! So it is." he exclaims. This up 'n down process continues for a while and he agrees to send an engineer on site.
The clue was the cable clips. That was where the original installing engineer had found that his drill wouldn't reach through the wall (not even half-way through). He'd then noticed a phone in the back office where the modem was required and "borrowed" a couple of free pairs off the existing cable. One of said pairs was a tad dodgy and every time a truck went over the weighbridge outside, the resulting vibrations caused an intermittant short and the line would drop.
J Random Crasho
Back when dinosaurs strolled through WC2, I was a BOFH on a site with a PDP 11/44, which would fall over with depressing frequency. Out would come the Field Circus bods, who would yank the machine out of its home in the rack and prod it, and poke ,it, and leave it hanging out of the rack for twenty-four hours, and nothing *ever* happened.
"No problem" said the Field Circus bods. They closed the lid, pushed it back into the rack and went down the pub. Half an hour later the wretched thing would crash again.
After roughly four months of this, with the WP department threatening to quit en masse, a Field Circus bloke slightly sharper than average figured that it *only* crashed when the lid was down and the machine was ensconced in the rack.
He wedged a matchstick between a pair of neighbouring boards and the thing ran happily for the rest of my time working there.
Only on hot Days
I had one of these "at random problems" once. And it too only happened on hot days. But only if a certain member of staff was not at work. An apple talk network between three Macs drifted in and out of connection. I was up there staring at the screen and sure enough watched the macs appear and disappear on the network. As I was watching it thinking how could this be, I leant back and noticed an oscilating fan moving at the same speed that the computeres were blinking on and off. The fan was on the other side of the desk to where an employee usually sat if he was at work. He wasn't that day! Aha I noticed that a plastic wall plate holding the network socket was lifting slowly when the fan blew air at it un obstructed by the missing employee. I redid the didgy network connection and put the wall plate back on the wall and voila!
flames - because it's only ever on a hot day!
Had a similar one with a monitor
Many years ago, I was installing PC-based EPOS systems in various shops in Ireland. One customer in Dublin called saying that their monitor would "go blank" for an hour during their busy lunchtime period, but only on random days. Simple things like "turn it all off & back on again" had no effect. As I was working nearby that week, I asked them to call me next time it happened. Sure enough, they called a day or two later & I rushed over to witness the strange phenomenon for myself.
It turned out that their description of the screen "going blank" wasn't entirely what I was expecting. More accurately, when the sun was shining, it would bounce off the mirrored windows of a nearby office block and overcome the screen's ability to compete with direct sunlight. As the sun moved, the light would move off the screen & presto, it could be seen again. Re-positioning the counter solved the problem...
My random error story
Back in the 80's I had to fly out to Hong Kong (yeah, life's a bitch) to sort out a problem on a network monitoring system we had installed. It too would crash "randomly", but this was in a huge air-conditioned computer suite on a UPS. We knew it wasn't really random as it only happened at night or early in the morning. As the computer room was freezing cold, no one wanted to sit up all night waiting for it to happen. Several days of scouring logs didn't help as each time it was a different bit of code executing when it crashed.
However putting a mains moniter on the supply showed that there were some mains spikes about the time of the crash. The Operations Manager said it was impossible because they had a fantastic UPS that stopped all that sort of thing.
Eventually we decided that one of the PFYs on night shift would sit shivering by the machine and wait for it to crash and see whatever else was happening at the time. Got it first night: at about 5 am, the cleaner came in, plugged her old Chinese hoover into a wall socket, switched on - Crash!
She was plugging into a UPS fed socket, and because the socket was near our server, we were getting the full mains spike, and the UPS was designed to stop spikes from outside coming in, not internally generated ones. I guess UPS's and server power supplies are better now at supressing mains spikes, but this was over 20 years ago.
No one had thought of her because she was one of the army of invisible people that work when the rest of us are asleep.
Still I got to sight-see in Hong Kong and eat lots of great Chinese food, the cleaner got a new hoover and instructions about which sockets not to use, and the crashes stopped, so everyone was happy.
It sounds like an urban legend...
...but it really happened.
Way back, the company I worked for had an NCR mini-tower server in the main office of a local theme park. We'd get regular weekend call-outs from staff based in other offices because the database had crashed. The server was fine, sitting there waiting for logins but the database process had terminated. Easy enough to restart the database but it was annoying as the software (Progress) was usually pretty stable.
Didn't take long to check the server logs and find that it usually went down on the Friday evening so I asked one of guys in the main office to work late on Friday and see if anything happened.
Six o'clock and the cleaning lady came in and asked if he minded her doing the office while he was there. "No", he replied, "go ahead". So first thing she does is unplug the server and plug her hoover into the socket.
Problem solved for the cost of a sticker with "DO NOT UNPLUG" written on it.
i got a stupid one for you
5 or so years ago in my last year at university i was trying to compile some java code in a class and it wouldn't compile didn't show any errors just a run time problem. i'd compiled the code before and it should of worked fine. i called my teacher over and we spent awhile playing with it but it didn't work. so i compiled it on some one else machine showed him got my makes. but being engineers most of the class spent the next hour sitting in class tring to figure out why this one machine wouldn't compile my code.
the machines are reimaged weekly and the machine should have been identical to it's working neighbours.
i later found out why turning up to lectuer a day later early the teacher told of a grad student having a fit apparently he'd been "issued" the machine i'd been working on for his software programming masters project. and he'd loaded his whole project on to the machine and left it there.
his project an enhanced version of the compiler the university uses( was actually built by a student as a project) with it's own library's
he'd set it to use the same commands as the old compiler. the problem for him was my lecturer had ask for the machine to be rebuild as the compiler didn't work .
(should probable explian before some one ask why not just use the jdk compiler the compiler in question was also a dev tool it was a gui that let you model the program in uml and then add code to the model then compile and run all in one. but you had to open it from the command line)
AC in case that students reads the reg (was popular with the final year network and sys students don't know about the programmers though)
I am shure I typed
must be wed morning
@AC - i got a stupid one for you
Are you sure you finished your university course? I thought someone who took, or at least seems to have taken, a computer science degree should know how to use punctuation and the shift key. The fact I had to parse the text two or three times to understand it means it clearly isn't any good!
Random number generator
Come on all of you. All you need is a Bambleweenie 57 sub-meson brain attached to an atomic vector plotter suspended in a strong brownian motion producer, say, a really hot cup of tea! (RIP Douglas)
I had an expererience with an educational computer-controlled robot arm that used IR sensors to make optical shaft encoders for the motors (it was a really good design of arm that did not use stepper motors as was the rage at the time, but proper electric motors, so was much faster and more impressive, and with six seperate independent movements). It worked really well, but unfortunately, the IR emmitter/detectors were covered in translucent plastic, which when used in direct sunlight caused ALL of the active motors to run to the end-stops of the respective movement. The whole arm contorted, and dumped itself off the bench, and led to red faces and a difficult-to-justify repair bill!.
@ Pooper Scooper
At least us old "fokes" can spell folk's!