back to article Two and a half days in hell

As sysadmins, we have to test before we deploy. We need to test before even upgrading a driver. We should test absolutely everything before a major deployment. It seems obvious. It is obvious. You should certainly need to test everything before doing what I did: throwing ten times the normal I/O and processing load at your …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Pint

    If it were up to me...

    I'd have NCR build or source all of the stuff I had to use. I do occasional support for a company that's too cheap to upgrade their NCR POS equipment. Some of that stuff has been running for more than 12 years, 24/7 with little maintenance, all fans are long since tied up, and it runs in a hellish environment of airborne grease, 110 degree+ ambient temperature. The enormous passive cooling on the processor is one reason for the success I'm sure. Amazingly, some of them are still on the original hard drives too, running an obsolete version of Linux or (shudder) SCO UNIX.

    I've always liked ASUS, but haven't played much with their server stuff.

    1. Mayhem

      NCR?

      Wincor Beetles are equally bombproof, but we're not exactly talking modern systems here.

      Our beetles ran a highly customised install of '98, and still used BNC networking as the shielding of the cables was useful to compensate for the environment they were in.

      You think getting dimms is hard, try finding new ISA BNC network cards. We had to get the damn things custom made in batches.

  2. Tom7
    Gates Horns

    Oh.

    And from the title I thought this was about using TFS...

  3. Anonymous Coward
    FAIL

    not always ?

    Vendors can't always be trusted ? Vendors can NEVER be trusted. I'm amazed you only realized this now.

    I'm a bit curious though about your Doomsday Weekend. Could you specify what you were trying to accomplish, what exactly failed and why ? I assume you're not in a small environment, if I take 8 dimm's/server you've got about 75 production servers running.

    I honestly fail to see the point of this entry. No details whatsoever, apart from the failure rate of server dimm's (what server vendor ?) and the benefits of using a 20-something year old keyboard.

    1. Trevor_Pott Gold badge

      @Mosquito

      It is the introduction; first part of a set. There will be additional articles with details as to what was attempted and what we suceeded/failed at.

    2. xj25vm

      Insufficient info

      I second that. This article has left me with more questions then answers. I too am a bit puzzled as to what the details of the 'hellish' weekend are. Very little hard facts or useful info.

  4. Gordan

    Misplaced Moaning

    Out of 600 DIMMs, one failing every 2 months is 6 DIMMs/year. 1% failure rate doesn't seem outrageous. With ECC memory, the chances are that you'll get some warning in time to sort things out before the machine suffers total failure. And all your production servers are redundant, of course, and if one fails you have others that will transparently take over it's workload - right?

    If you are bemoaning a 1%/year failure rate on DIMMs, it makes me wonder what your failure rate on disks is. I know that my %/year failure rate of disks is several times that of memory modules.

    Re: fan failure rate - fans fail. Even decently branded ball bearing ones will get loose after a couple of years of 24/7 operation - cheap bundled fans much sooner than that. Having variable speed fans that slow down when the temperatures drop also helps to prolong their life. And you do have lm_sensors' sensord monitoring fan rpm rates and alerting you when they drop below a healthy minimum, so you can act on it before the machine overheats and crashes - right?

    And Lacie NAS, you say? That well known enterprise brand?

    Save the moaning - it sounds like the energy would be better spent on better forward planning, redundancy and monitoring. If those aren't covered, you only have yourself to blame.

    1. Trevor_Pott Gold badge

      @Gordan

      I can RAID drives. DIMMS, not so much.

      As to redundancy...difficult to do on an SME budget, but I'd like to think we don't do too badly. I've learned the hard way some failure points, while others I was fortunate enough to see coming. The La Cie isn't something I would /ever/ use as a primary device. It was hoped however that the thing would serve as a temporary storage point during the move. Not so much, apparently.

      As to fans issue; good fans will give me three years. The fans that don't give me a single year irk me greatly. We can't afford IT bodies in all sites, so we are reliant on yearly preventative maintenance. Fans that refuse to give me even that year...I'll not be buying from that company again.

      Any decent sysadmin plans for redundancy within the budgetary parameters given them. There is that point however where you do have to trust hat some of your components will stay up some of the time; a minimum service life if you will. If only because the primaries have to be operational whilst you are upgrading the backups.

  5. anonoomouse
    Flame

    Can there be a sulphurous nook

    where the person(s) from Western Digital that did the calculations for their own built in obsolescence gets bad stuff happen to them for the remainder of eternity.

  6. Pavlov's obedient mutt
    Thumb Down

    edited

    This article reads like it underwent some serious legal editing

    It started out so well, and then fizzled into a dreary whine.

    Upshot - vendors can't be trusted.

    newsflash! Fox's can't be given keys to hen house!

    1. Trevor_Pott Gold badge

      @Pavlov's obedient mutt

      I can't speak to the "legal" part of the editing...but it definitely reads significantly different than written. My editor tries to compress my articles into as small a space as possible; I fear I have not yet learned the art of telling my tale in few enough words. ("Be ruthless, you know it makes sense!") Overall I think he's an excellent editor. I like to think that any improvement at all in my writing style is due to his hard work and the kind advice of several folks at El Reg.

      I was hoping that this article would serve as an introduction to a short series of articles based on a particular hellacious network migration I recently underwent. Unfortunately, that introduction contained a lot of lead-in and ended up being about three hundred words over limit. I can’t really comment on the resultant article; it would be the ultimate height of arrogance to comment on one’s editor’s style. Frankly, I’m new enough not to have much experience in what “sells,” other than I write (and talk) way too much.

      The ultimate measure though is whether the readers here like my articles or not. While for all other articles I’ve been merely posting links to them on my personal blog. I am allowed to post the full unedited articles there after a few weeks have passed since their publication on El Reg. It might be an interesting experiment to post the “Doomsday Weekend” articles in full and have some folk read the edited and non-edited versions.

      That, and my editor has kindly offered to take me through editing one of my articles step by step. With luck, from both I will learn what people like and don’t like. Stylewise, my articles (when they reach my editor) read no different than my comments. I have a lot of learning to do before I can pare my writing down enough to produce articles that don’t need editing at all…but I’m learning!

  7. Peter2 Silver badge

    you put your model M through the diswasher?!

    And there was me thinking I was the only person still using a Model M from a 286 bought *ahem* a couple of years ago. I don't put mine through the diswasher though. I know it's meant to be safe to do, but it still seems a bit dodgy for a bit of electronic equipment. It's the sort of thing users do before logging a call saying that it "isin't working"

    Speaking of gripes; mine is VOIP systems. Why the hell can Cisco not produce a system with the functionality AND reliability of a 20 year old PABX? Every VOIP installation i've seen or heard of has users used to more downtime a week than I expect in a year.

    1. Trevor_Pott Gold badge

      @Peter2

      I have three Model Ms. Every one of them gets a good dishwashering (NO SOAP!) followed by a three day air-dry. Take the caps off the keys first, or you'll spend the next few days picking them out of the dishwasher.

      Works like a charm. There was a rumour about the nets that this was the recommended way of cleaning the things in their original manual. I though "what the hell, worth a try" and haven't looked back since. More than two decades of this with my first one, and well over a decade with the other two. No problems so far. Of course, the water in my city is very soft...I have no idea what a dishwasher with a lot of minerals in the water would do.

    2. John Smith 19 Gold badge
      Stop

      @Peter2

      "Speaking of gripes; mine is VOIP systems."

      I used a small spur site (<8 lines) to a main site using Cisco kit.

      Phones rebooted about once a week.

      " Why the hell can Cisco not produce a system with the functionality AND reliability of a 20 year old PABX? Every VOIP installation i've seen or heard of has users used to more downtime a week than I expect in a year."

      *Very* simply put I'd say it's because they are used to developing embedded software like PC software (although presumably a lot of it is sitting on top of some form of Linux).

      PBX mfg don't think a reboot is *ever* the right answer to a malfunctioning call.

      And now you know why MS attempted domination of the PBX market fell rather flat. Try telling telecomm's managers with 100's to 1000's of lines "With our latest Windows whatever you'll only have to re-boot on average once a week"

      VoIP (even from big name mfg). Not quite fully cooked IMHO.

    3. xj25vm

      Voip downtime

      Can't comment directly on the Cisco voip stuff. I use Asterisk - and indeed, I had my share of problems with it. However, once all of them are ironed out - it tends to be pretty reliable. It is a software under continuous development - with new thinks added all the time - lots of them very useful. I tend to think of the occasional problems as the price to pay for the amazing amount of flexibility and great number of features it provides. And it is free :-)

  8. Peter Gathercole Silver badge

    Model M and failure rates

    You really put your model M in the dishwasher. Wow. I religiously strip all the keycaps off and wash them, and then apply a stiff brush and wet-wipes to the rest. Your solution sounds much quicker. How long do you leave it to dry?

    When it comes to large numbers of similar devices, you need to look at the MTBF figures. The more of a particular device you have, the more frequently you will see one fail. I would have to look up the exact maths, but I don't think its a simple ratio. Where I am, we have over three thousand 300GB disks, and we lose a couple every month. This does not cause a problem, because they are in a large number of separate raid arrays with two hot spares per 10 disk array (=12 disks total). We could still be operating with three disks down in an array.

    Memory, on the whole, seems reasonably reliable, but we have multi-bit parity on the systems, together with bit-steering (the joys of Power6 systems). This means that it is not the built-to-a-budget memory that most people put in their Wintel servers. That price premium must really buy you something.

    1. Anonymous Coward
      Anonymous Coward

      Model M dishwashing.

      My understanding is that the model M can be dishwashed, but doing so does lead to contacts getting wet and i've heard that it can lead to those contacts going rusty.

      I wouldn't want to do it without disassembling it to dry bits off, which defeats the object. Otherwise you'll probably be knocking a fair bit off the expected in service lifetime.

      It might only last 50 years instead of a hundred. :) (well hey, mines at ~25 years at the moment) I have visions of still using my indestructible model M when everybody else is using some advanced holographic keyboard along the lines of a VKB in 2100. Assuming I live that long.

    2. Anonymous Coward
      Anonymous Coward

      P-Series Reliability

      I would have shared your sentiment regarding premium hardware up until the last few weeks. We've had a bad run of late with our P-series boxes. In a way they remind me of a Wintel server vendor I ran across years back - they had triple redundant power supplies, triple redundant fans. Wintel servers might crash more often, but P-Series crash more spectacularly due to the critical loads we tendto pile onto them.

      All the virtualization and premium hardware in the world will not replace physical redundancy as a core architectural principle for "critical" systems (which are not properly identified in most environments - if everything is critical, nothing is). I had one client (you know their name but that's all I'll say) that actually ran "redundant" application clusters on different partitions on the same physical box. I had another that put an entire n-way redundant farm on the same SAN. I had another that back-ends *every* server in their environment with SAN - anyone care to guess what happened when someone accidently bridged the A and B fabrics?

  9. Anonymous Coward
    WTF?

    DIMMs failing every 2 months??

    Wow. I'm looking after ~700 servers with several thousand DIMMs and have seen only 1 die in 3 years, and even then its a minor issue as the server disabled it and we will replace it when we have a matanance window.

  10. Anonymous Coward
    FAIL

    Is that it?

    La Cie, Asus fans and don't trust vendors.

    I was just set to click on the next page to find out *what actually happened* --- and there wasn't one.

    By the way: wetware?

    Oh never mind....

  11. The Cube

    DIMM failure

    Regarding your DIMM failure rates there are a few things to consider.

    First, if you have one failure every two months out of 600 in service your (very rough due to small sample size) MTBF is in the region of (600*30*24*2) ~ 800,000 hours. That would be OK for a hard disk (if only they still actually made million hour MTBF hard disks and not those offensive lumps of crap that get sold now).

    I very much doubt that dodgy power is doing anything to your DIMMs, they are behind both the server PSU and an onboard voltage regulator, any noise that gets through that will cause bigger problems than failing DIMMs and you would be losing server PSUs at the sort of rate the Vatican has to blame homosexuals.

    From experience what you are probably seeing is that the memory in your servers is not well matched. Modern memory is very very sensitive to timing, this will change slowly over time, generally on an exponential curve. When you memory is matched into sets, for the DIMMs and then sets of DIMMs for the server not only does the timing need to match when they are grouped up but also the decay rate needs to be matched. If they have come off a single, well controlled line and are of the same age then this will be the case. If not then your "matched server memory" will fairly quickly un-match itself and you will end up with what appears to be a failed DIMM, when this goes back to the vendor they will not be able to find any fault with it.

    Try buying some decent quality memory for some of the servers and see if that fails too, if it does then maybe you have some other problem. I don't know whose servers you are buying and don't expect the lawyers to let you say but if you have fallen for the whole "high density" gag then you may well be toasting the innards of your 1U servers so that you can have most of the rack empty... (oh and no, high density is not high efficiency, it is quite the reverse)

  12. Slay
    FAIL

    I agree with Mosquito

    There was no point to this entry. Not without some decent examples.

    What you basically said was, "Manufacturers cannot be trusted. They don't make stuff like they used to. I bet you agree with me, right?"

    Unless you are a sysadmin who graduated from college this year, you already know this. It's not even a sysadmin thing, it's a consumer thing.

    It's like you wanted to write this nice detailed article, citing exact problems, which could help your readership, but some mate called you up for a beer, and you decided to not bother. Actually, I am with you on that one.

  13. xj25vm

    Memory reliability

    Can't talk about high-end type setups - or even servers in large numbers. But at desktop and small x86 servers level, in all these years, problems with faulty memory have been minimal. Nowhere near the number of faulty hard-disks or power supplies. Or motherboards. On the whole, I would say memory tends to be a pretty reliable piece of hardware.

  14. FozzyBear
    Thumb Down

    @ Pavlov's obedient mutt

    Yep that's what I got from this article. Not a whole lot accept a longwinded monologue on a model M keyboard and the best way to clean it. This could have been thrown under the “this old box” section rather than a sysadmin blog

    Anyone that has ever dealt with vendors knows you can’t trust them.. The only thing that differs between the vendors is the amount of BS they serve up to you in trying to sell their crap

  15. Goat Jam
    WTF?

    WTF?

    What was the point of this article again? I'm sure there was one but I must have missed it.

    Just when I thought you were past the whining part and into the details you were winding it up.

  16. dreamingspire

    Keyboard memories

    In the 80s I used to make rugged custom keyboards for financial dealing rooms. Never tried one in a dishwasher, but the rule was: if you pour coffee or coke in it, own up immediately and holler for the on-site engineer. Disconnect it ASAP (to stop corrosion on the PCB from electrolysis), tip it on edge to drain for a minute (it had drain holes), then swill it under the tap, drain again, leave over an aircon outlet for 24 hours to dry.

  17. This post has been deleted by its author

  18. alphawave
    Thumb Down

    Oops

    So, Cheap hardware (of all kinds it appears) , big bang upgrade, no proper testing, no planning, trusting vendors -

    I'm impressed you are willing to fess up in public and hope you post under an alias (most recruiters can do a google search...)

    1. Trevor_Pott Gold badge

      @alphawave

      I use no alias. I make mistakes, same as any other human being. I will fess up in public in the hopes that others can learn from my mistakes. The fact that I have made these mistakes is called “experience.” You screw up in some way once and then tend not to do so again. For the record though: I don’t happen to work in one of those lovely environments with unlimited budgets, massive amounts of free time and adequate manpower for every task. Quite the opposite: I am regularly tasked with doing the impossible on a shoestring budget. Some times I make it…many times I don’t.

      It certainly doesn’t make for bragging rights. I don’t get to stand up in front of all the commenters on El Reg and proclaim “here’s how I pulled off the perfect everything with no mistakes whatsoever.” It does is keep me humble. With luck, it will help a junior sysadmin or two avoid the same errors that I have made. Additionally it gives some of the commenters here on El Reg a reason to feel superior: many of you have avoided the mistakes I have made. Some have avoided these issues due to superior foresight, some due to superior experience and many due to superior availability of resources.

      Whatever the case…I don’t try to hide my faults, or my mistakes. I would rather honestly make a mistake (and own up to the consequences) than bury the truth and be hired/respected/whatever on false pretences.

      My blogs are my experiences in IT. The good, the bad, the hideously ugly. If it means that an IT recruiter looks at my articles and believes that I am completely incompetent then so be it. That is a consequence of my choice to try to pursue writing. I have often been told “write what you know.” Much of my life has been devoted to IT. Combine that with a personal philosophy of never sugar coating anything and you get El Reg’s Sysadmin blog.

      Warts and all.

  19. ehin

    Dodgy DIMMS

    Perhaps Static Discharge levels and conversely, equipment grounding issues should be suspect -- your failure rates are quite high ....

This topic is closed for new posts.

Other stories you might like