back to article 3,500 servers go down – so my FIRST AID training kicks in

Welcome again to On-Call, our getting-slightly-more-regular look at Reg readers professional escapades at odd times of the day or night, usually in odd places. This week's tale comes from Pat Dufresne, who tells us he spent two years on the night shift at a hosting company in Montreal, Canada, and rates the gig “the best two …

  1. frank ly

    Sounds like 'fun'

    "... when the power was re-established, a lot of hard drives died. "

    Had they died because of the previous sudden loss of power while they were doing something? Were they all old? etc.

    1. Tom 7

      Re: Sounds like 'fun'

      A lot of disks that have been running continuously for ages will refuse to start up after a halt. I have spent many happy hours going along sliding disks out while still connected to the power and giving them a rapid rock backwards and forwards in the direction of spin to get them going again.

      1. Danny 14

        Re: Sounds like 'fun'

        yup, same here, a mild percussive maintenance will get some rotating too. Problem is when they decide to tell the controller they have died and you need to override the controller to get the array back.

    2. OMGROFLSKATES
      Angel

      Re: Sounds like 'fun'

      Hi all

      Pat here

      to clarify a few points

      - the DC that died, has 3 power zones, one of the 3 zones didn't switch, cue massive server death

      - the DC in question has about 10K servers, so we lost about 10% to hard drive failures

      - the senior that left as I walked in, was doing more of a transfer of shift than anything else. Having 2 seniors on shift may have helped but what I needed were people to take the tickets and link them to the master outage ticket for the DC to do their work. Also to be honest, I didn't need a jittery and otherwise nervous and tired CSR senior working along already stressed, overwhelmed and tired front line agents. Calmness and cool heads prevail, we can't make rash, disconnected decisions that lead to further mistakes. In a situation like this, having one head instead of 2, like a chain of command in the army, is the preferred way of dealing with a crisis while keeping the communications open and efficient and flowing in a straight line, up and down the chain.

      - Management was contacted right away and they were more focused on getting the DC back online. What I omitted (since this is a customer contact center centric story, since that's where I worked) was that at the DC, there was a small army of techs and management working the floor to get things back online. The customer contact center was in an entirely different part of the city. Since management was focused on the DC and the phones were being answered, understandbly management was focused on where the real work was needed. Long live front line!

      - Regarding my/the social scene of an overnight shift. Montreal has a very vibrant Rave scene. EDM for the rest of you. If you're not a raver, then sadly, night shift kills your social life. If you are into the EDM/Rave scene, all night dance parties are the way to go. I've met so many people at these parties and made great friends. It's better then leaving work at pub-o'clock, leaving to the club for 21h, returning home at close which is around 3AM for most. All said and told, this makes for almost a 20h day once you come home from the party/club/pub. Where as if your party life follows night shift hours, you wake up at the same time your normally do and go to bed when you normally do, without having to have a massive sleep debt. I found I had less stress and I was getting 10h of sleep on a regular basis. For "day walkers" (read: everyone else in a 9-5) this is something people cannot understand because to them a "social life" ends at 3AM, where as mine only starts around then!

      - finally, no mental scars from the event. Other than fatigue, I genuinely enjoyed the experience. I have a strong leadership drive that kicks into autopilot when the poo hits the fan. Plus with first aid training, you're given a heavy dose on how to with stressful and traumatic experiences. Anyone here who knows an EMT/Ambulance Tech can attest to with the training they get.

  2. Tezfair
    Thumb Up

    feel the pain

    Now granted, I deal with 'small buiness' servers, but many a time I have been sat on the floor in front of a dead server at O'mygod O'clock wishing I was working as a checkout operative at Aldi (pay better than Tescos apparently).

    1. Little Mouse

      Re: feel the pain

      Agreed - In my book these incidents only tend to qualify as “by far the most fun I've had working as a sysadmin” once the resulting mental scars have healed.

  3. Joe Drunk

    This week's tale comes from Pat Dufresne, who tells us he spent two years on the night shift at a hosting company in Montreal, Canada, and rates the gig “the best two years I've spent working in IT".

    Conversely, the time I spent on-call were my worst years in IT. The ten percent pay differential was irrevelant. It interfered with my social life to the point where it was practically non-existant. The irregular sleep cycles had an adverse effect on my concentration which consequently affected my job performance. I went from drinking coffee because I enjoyed the taste to becoming a hard core caffeine addict.

    things seldom went awry..

    So that's the secret. He didn't work in a dynamic, fast-paced matrix environment. In my experience, major things seldom went awry. It was the minor things perceived as major things that led to the knee-jerk phone calls at odd hours followed by escalations and conference calls to determine on whom to place blame for making that red alarm go off in the NOCC. You come to work Monday morning like a zombie due to tossing and turning all night, stomach in a knot fearing that you or some member of your team will be sacked over this. Turns out that red alarm was for some service that was discontinued and naturally, the only one who knew anything about it was not available to answer that question until now. He forgot to mention that service during all our change management reviews, since, naturally, it was discontinued. Shame on myself and my team for not doing our due diligence. Unfortunately this is a true story and a scenario that would play itself out on many occasions during my time as on-call and one of the main reasons why I loathe IT as a career.

    1. The Axe

      Career in IT

      And you think things will be different with a career in a sector other than IT? Management are mostly bollocks in all walks of life.

      1. Joe Drunk

        Re: Career in IT

        Oh I know other career paths suck too. I have friends in medical, legal and education. I have heard their horror stories, the only one that sucked worse than mine or my colleagues were the EMT (Emergency Medical Technician) guys.

        Management is bollocks, I can deal with that, you have to no matter what career. It's having to deal with it 24/7, 365 days a year that's unbearable. Right now I only have to deal with it 40 hours a week. It's like heaven compared to on-call. Social life again. Mmm regular sleep schedule interrupted only by social life, as it should be. Quality of life high again.

        1. Danny 14

          Re: Career in IT

          the best work I had was in a failing company where everything was belt and braces. Nothing was thrown away, everything was repaired. Money was exceedingly tight, I gained lots of experience fixing, bodging, repairing, getting by on the bare minimum. Stressful yes but a job like that sets you up with a mindset to solving problems!

    2. OMGROFLSKATES

      Hi there

      Pat here

      @ Joe Drunk:

      I guess your sleep habits aren't the same as mine. I can sympathize with the effect it had on your body and mind. Not everyone is built the same.

      I hate to tell you this regarding your dynamic, fast-paced environment. I work 24/7/365 and the reason things seldom went awry is simple: I'm dam good at my job, I hate it when things go wrong and hate surprises. So if your experience has been one of high stress, blame and finger pointing, it is unfortunately something I cannot relate to as my experience in the same "dynamic and fast paced environment" has been one where I fix problems and people come to me because I'm the last line of defense and if I cant fix it, it cannot be fixed. Like that day the Dutch Consulate website (which was hosted on a machine that the Hosting company I worked for in this story hosted) went down because a borked Parallels update that borked the server the site was hosted on went so bad, all that was left after the kernel panic was a blinking white cursor of death in the upper left corner after POST because the RAID that WAS the system drive no longer existed. Cue 8hour rebuild of the RAID meta data, RAID, EXT3 FS and Parallel's container (read OpenVZ) object. My boss was so impressed, his only comment was "if I had gotten that server, I would have sent it to be reinstalled and restore from backup". Not one iota of lost data, Dutch consulate website online and functional.

  4. Frank N. Stein

    In need of some fun? Try doing first level phone support for a "greeting card company" who hires inexperienced/non technical/untrained senior citizens who've never used a mobile device, to operate an Android mobile device as part of their day job. That's right. They just hand them an Android mobile device with the training manual on the home screen (which they don't read) and say- "Go". And of course, their inept Supervisors who don't train them tell them to call the Tech Support Line, if they have any questions. Thing is, device use training is not part of the tech support contract they signed up for (We were only to provide them with technical support, not user training). There are few things in life that are more fun than having angry/bitter/untrained/non-technical senior citizens calling your tech support line all day long when they aren't having any technical problems with their device. No error message. No typical Android issues. "How do I..." or "Can you tell me how to...". No, I cannot. If you aren't having an error message or a technical problem, I must refer you to your manager for training.... LOL!!

  5. Tom 7

    Its worth mentioning the party scene stuff to youngsters.

    Having spent a summer working nights I can attest the power of night shift tolerance on party party party self destruction.

    Sunday nights were a bit twitchy it was a fantastic summer!

  6. ken jay

    This is exactly what you look for in not only a sys admin but in a job which you prepare for the worst its brilliant to test out your disaster recovery techniques. but mainly stay calm theres nothing you can do to speed up a fix.

    a lot of sys admins only know failure because thats how we learnt to do what we do.

    1. tfewster
      Alert

      Crisis? What crisis?

      In a crisis situation, I'll do whatever needs to be done - whether it's taking the lead to prioritise efforts, making a brew to keep everyone else going or fielding phone calls to keep manglement / customers off the techs backs. Even if it means neglecting my own work "for the greater good".

      Of course, the first few calls are to wake management up :-) If they don't come in, they can't complain about how we handled the crisis. And if they do come in, they can always help make the tea!

      1. Anonymous Coward
        Pint

        Re: Crisis? What crisis?

        There's two way to handle a crisis and it really depended on whether there was a particular individual who could take lead, in which case we gather around to receive direction, or no one to lead on it in which case we divided the problem and handled our own pieces committee-like (but more intelligent). Frequently I'd be doing nothing more than making coffee and referring to the various engineering manuals (my forte) trying to puzzle out what the symptoms were telling me about what failed. Those and the trivial: keeping management up to date, getting any repair parts immediately, and especially handling all the phones.

        I happen to love crises! They were always something new since we already had recovery procedures for anything we'd done in the past. [I'd computerized that right after I arrived on station for quick search.] Hell, sometimes they'd ask if I'd come in and help out and I did. Fun, if you've got a really odd definition of fun.

        Icon 'cause one of those times they called me in, I'd just polished off the last beer of a twelve pack (MGD, 3 hours) and I went. I was told later (I have no recollection of what I did) that "I was the best technician they'd ever seen even three sheets to the wind." The problem was a failed radar antenna, something hardly ever seen in this gear. Nice to hear I can party at work ;-).

      2. Robert Helpmann??
        Childcatcher

        Re: Crisis? What crisis?

        Of course, the first few calls are to wake management up...

        Yeah, I noticed that was left out, too. Along with the bit of the previous shift manager saying jack before heading for the hills, that got my attention. If a 24/7 shop is going to function reasonably well, there needs to be open lines of communication between shifts. In places I have worked where that was not the case, we functioned poorly as a whole. In places where we had good turnover procedures, things have run much more smoothly.

        Also, management should, whenever possible, work the same hours as the rest of staff. It is very hard to know what is going on with your staff in many cases if you aren't there with them at least some of the time. Doing this also helps to prevent them from being relegated to the status of second or third class employee.

        1. Danny 14

          Re: Crisis? What crisis?

          I read a similar thing. Previous shift guy should have been sacked. I assume it was for comic effect as in a situation where you have just lost a third of your 10000 servers there would be a pull of staff from wherever to get it running quickly as you KNOW there will be a lot of nasty hitting customers affected.

  7. Anonymous Coward
    Anonymous Coward

    Late shift running off?...

    I'm sure the story has bee edited, but where I work, if I arrive on shift and the shit is hitting the fans, rarely do the 'finishing' staff run away asap - and I would not rate any of us as particularly loyal to the company or scared of being fired or similar?..

    Interesting to read what others encounter though.

    Anon for various reasons.

    1. John Brown (no body) Silver badge
      Flame

      Re: Late shift running off?...

      "the shit is hitting the fans, rarely do the 'finishing' staff run away asap"

      Yes, that surprised me in the article. I'd expect the contract to include "working through" when there's an emergency, not pissing off home on the clock. If that is genuinely what happened rather than the story being badly edited, I'd expect most organisations to have sacked that guy.

      1. Anonymous Coward
        Anonymous Coward

        Re: Late shift running off?...

        The senior guy is the one you'd think needs to stay, especially since it sounds like all the juniors stayed. Things like this are "all hands on deck". I hope when he cowardly ran off like that he was given the cowardly treatment he deserved and was fired by text message!

      2. I Am Spartacus
        FAIL

        Re: Late shift running off?...

        Oh, I don't know. I work in a trading company. One of our key decision making tools, that lets the trader know what is going to happen, was down. At 11:30 AM the lead DBA decided to go to the gym. When he got back an hour later, we were still struggling with the database. At 4:30pm, with the system still not running he decides to reboot the database server and go home.

        I went off like a rocket when I found out. Unfortunately I was not given permission to apply the firing, or even severe boll***ing he deserved. Got to love management.

  8. Number6

    I remember the IT department in California installing an upgrade at midnight their time and then going home, only to be hauled out of bed by an irate CEO because the UK office, which was just starting its day, had lost all connectivity. When you're a multinational, there is no quiet time in which to install upgrades and it's therefore probably better to do them at the start of your shift (unless the overtime rates are good) so you've got maximum time to shovel the shit when it goes wrong.

  9. CAPS LOCK

    1000 out of 3500?

    Really?

    1. Danny 14

      Re: 1000 out of 3500?

      If the generator power doesn't switch then you are at the mercy of UPS priorities. Which ones can shutdown nicely and which ones wont. Which SANS will spin up again afterwards and which drives decide to stick. Nothing beats 1000 machines loading your switches (with that failing fan that was scheduled to be replaced in the next window) simultaneously to just overheat enough to cutout.

      Then of course your nodes decide to run updates because they have just powercycled.

      1. I Am Spartacus
        FAIL

        Re: 1000 out of 3500?

        He was lucky. In the 87 Hurricane I was stuck at home, but still had a phone working. I called the office, and yes, they had lost power. I talked to the data centre manager and told him, very clearly, to flick the power switches on the disk packs to off. Then, when the power came back, switch them on one-by-one.

        He ignored me. When the power came back on, 40+ disk packs all tried to power up at once. The mains switch gear objected to the sudden surge in current demand and decided to leap two foot from the wall with what I am told is the loudest bang anyone could remember.

        They were out for two months before they could get a new set of switch gear.

  10. Anonymous Coward
    Anonymous Coward

    Being picky....

    ...but surely this was about night shift working, not on-call work...

    Just saying.

  11. rhydian

    The 'fun' of disaster management

    I can understand where Pat is coming from regarding the "fun" of handing a major manure-fan interface.

    For me, it's mainly the challenge of getting whatever has gone bang back up ASAP, with the added bonus of being able to "bend" rules and procedures that you know add nothing and simply cause delay. It's the challenge of focusing on the problem and find a workable (as opposed to the "right") solution fast that gives job satisfaction for me.

    As for management, yes they should be kept informed, but its usually during such events as these that a "less is more" attitude to management input works best. Especially if you have non-technical management involved.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like