"I still recommend those guys today."
So their investment in goodwill has more than paid off then.
Nice when that happens.
Jim Thompson got in touch with The Register about the mother of all On-Call stories, recalling the time he received a message asking him to come back to New Orleans because a storm called “Hurricane Katrina” was on its way and looked bad. It's nearly 10 years since Katrina raged, so Jim kindly retrieved his jottings on the …
Make sure you have a plan to communicate with people. During a serious regional disaster you will not be able to call anyone with a phone in the affected area code.
Can't agree more with this, although some might think I take this to extremes. (Hey, it's good exercise and it's fun hearing people's responses from 1000km away, "Did you say bicycle mobile? Wow!") In short, it has to be a BAD disaster for me to not be contactable by someone.
In short, don't assume the usual telecommunications services will be operating normally. They may be up, they may be heavily overloaded. If they are working, do everyone a favour and keep your calls short. Text messaging might be worth using instead since that doesn't need real-time network delivery. It may also be worth investing in a few CB sets too, they may not go many kilometres, but it beats yelling!
Two years after Katrina, I found myself on an extended (ended up being a total of 6 months) business trip to Dallas. It was actually a very nice assignment - I really don't mind the summer extremes there, and you can get some pleasant weather in the winter. Anyway, guns are a serious business in Texas. The only rule at the office seemed to be that concealed firearms must be declared on arrival. But none of the pick up trucks in the car park had gun racks. All the guides clearly stated I should see pick up trucks with gun racks in Texas! I was actually a bit disappointed, and asked about this. I was told a lot of bad elements had come in amongst the Katrina refugees and, after a few nasty incidents, the locals voluntarily stopped putting their guns on show.
Per the Henry mentioned in the article, I was told various stories of law abiding citizens having firearms confiscated, around New Orleans. In an armed society, this meant the only people carrying were the bloody criminals.
My assignment in Dallas? Testing a massive,largely automated, DR solution for one of my customers back in the UK.
Watch this documentary
http://www.imdb.com/title/tt2520398/ Assaulted Civil rights under fire Produced by Ice T.
It has a section on Katrina that is very scary. Only the criminals had guns when the authorities were done.
The rest of it is very interesting.
This is why all my tech guys have laptops, mobiles with modem tethering, remote access from the laptop, with citrix access from any PC to a remote support VMs as back up. In addition we run dual data centres with key infrastructure duplicated and replicated between sites. Multiple ISPs and telcos used for network resilience. Huge UPS with massive generator backup to survive power outages. Every single bit is backed up to tape cross site.
So the only concern is MTR. The fact we have so much data and too few spare servers to put it on, mitigated by the fact we could use VMs. It would be slower than I'd like but we'd not lose any data and would be able to get back every service in time.
"That is fine but you need to find a consensual female.
My approach is duplication, redundancy and documentation stating what to do in each scenario."
Shirley (for it is she who consented) would prefer spontaneity rather than a documented checklist of what to do. Duplication might be ok, but I'm not so sure redundancy is required. What's that? You took a banana AND a cucumber?
Last year, I was involved with a disaster; not on the IT side though. Some of my thoughts.
If you think that the authorities have planned for BC / DR, think again. Most of them will have had numerous meetings and discussions, but none of this will be in the slightest bit relevant when it all goes belly up. Most of their staff will not have read any plans and will mostly stay in offices hiding; those that go out to meet the public will stand around looking bewildered.
This includes Police, who will possibly have some junior staff standing around being visible to "prevent civil unrest / looting" but actually achieve very little. They will probably have a silver & gold command structure, which will prove to be as useful as the proverbial chocolate tea pot. They won't talk to fire or ambulance service, so expect fire engines and ambulances to end up in the wrong places.
The army do sometimes get called out, but they are very few in number these days. Being cynical, I might suggest that they'll be around whilst TV cameras are there, so that the general public can be assured that "something is being done"; but as soon as the cameras get switched off, the army will be on their way back to barracks.
Communication seems to be a bad word; none of the organisations that should be talking to one another or to the public will actually have a clue how to get information out. If any government agencies are involved, they will actually make matters worse by passing incorrect / out of date / irrelevant / misleading information to the media.
National news media will cover the story as long as there are potentially "serious consequences". After that, you're yesterdays news. Local media tend to do a better job of getting their facts right and of keeping on top of the story.
In the end, more will get done by volunteers, or by individuals helping themselves out.
Afterwards, all of the relevant public bodies will tell everyone what a great job they did; and senior people will receive awards for their contributions. But at the same time, they will all bemoan the fact that they don't get enough funding and will spend more time in the press telling everyone of this than they did during the disaster in providing information that was relevant to the people affected.
"They won't talk to fire or ambulance service, so expect fire engines and ambulances to end up in the wrong places."
Considering that and the type of (mis)communications, poor communications, ambiguous communications and outright wrong communications that travel (or not!) around our company, it terrifies me what sort of DR plans they may (or don't) have. And we don't have anywhere near the level of committees and self-important people that local authorities are encumbered with.
The primary concern for government officials is human safety. Following that is safety of property. Business Continuity comes in a distant third. And, given the funding levels that most government agencies have to work with, it's rather amazing that they do as well as they do. Further adding to the problem is that a lot of the citizens think that, in any disaster, the government will step in and fix everything, thus requiring them to do no preparation for any type of disaster at all. Unfortunately, that's usually the farthest thing from the truth.
As for my background, not only have I worked for 32+ years as a technical professional (mostly in IT), but I also spent 8+ years as a volunteer on a local governmental emergency response team. The first disaster I worked involved a semi-truck spilling 5000 gallons of an unknown chemical into a local creek, which supplied water for a large number of towns and cities. We had over 21 different agencies responding to that event, everyone from county constables up through the federal EPA. And, it seemed that every agency had their own idea of how to handle the situation. Heck, some of them became belligerent, and there was some danger of fist-fights breaking out between various agency officials. Fortunately, cooler heads prevailed, and we were able to get the various agency heads to sit down and and come up with a unified response plan (Hint: It helps to ask "Who's going to pay for this?". It also helps to know the various statutes regarding who is in charge of a scene, and the statute that allows for unhelpful people to be detained or arrested.).
There was also a lot of misinformation that flowed from this event, despite the best efforts of the public information officer (PIO) to release the true facts, at least as best as were known at the various times. Compounding this were problems with various reporters penetrating the site to get their own stories, and, in the process, becoming contaminated with the unknown chemical, and spreading it around, thus making the cleanup effort worse.
Our main concern, during this event, was the safety of people who derived their water supply from the stream in question. We really didn't have the resources to devote to business continuity efforts until well after the event (e.g., many days later). We did attempt to notify local farmers of the problem, so that their livestock herds weren't affected.
We did have communication problems, since all of the various agencies did not have a common radio frequency which could be used to communicate with them. This was over 20 years before Katrina, so it has been an ongoing problem. Steps were supposedly taken following Katrina to alleviate this problem, but these solutions require money, and, while as there has been some constructive efforts, the emphasis on such communications coordination seems to be waning again.
So, the net result is that the governments will do what they can to assist following a disaster, but, for the most part, businesses, as well as the citizens, will be on their own for quite some time. Thus, it makes sense for everyone, both businesses as well as citizens, to adequately prepare for disasters. That means developing a plan, taking actions before the event, and then testing the preparations periodically. Such actions are not cheap, but they are a lot cheaper than not performing them.
but I just cant extend sympathy to the guy, sure heriocs were involved in recovering from the disaster of Katrina.
But somehow, no.
Because I was in New Orleans when Ivan came barrelling through, unable to get out as every last aircraft seat had been booked, and the best thing I could do was hunker down with friends, share the drinks and hope the storm moved east of its predicted track (It did).
But Ivan should have been a huge wake up call for everyone who lived and worked in Southern Louisiana.
Instead we had the long week of phoning what numbers worked and saying "have you seen xxxx?" when Katrina struck, while the disaster played out on our TV screens.
And yes, my friends were ok, although their area did take a bit of a pounding.
No sympathy needed. Ivan WAS a huge wake-up call for me and if Ivan had gone the way Katrina did then I doubt that the company would have survived. As it turned out, while I wasn't able to do anywhere near as much as I wanted to, it was enough to get us through.
I tried to present this story as it happened without trying to cover up my mistakes. But then, that's the point, we don't learn anything unless we focus on our errors. An AAR with all 'sustains' is a crap AAR.
Back when this happened, not only didn't you have reasonable DR in place... you also didn't have reasonable backups in place. :(
Saying that because you point out several times you weren't sure what info your company needed, so therefore weren't sure what should be transferred to the remote site.
Figuring out what info is needed by your company is the Foundation of a backup regime (let alone DR). If you had a solid backup plan in place, there's no way you could be unsure about what was needed. :(
"Solid backup plan" also means "one that gets tested" (preferably well documented) so that when Shit Hits The Fan there's minimal guesswork to be done.
That being said... we all have learning experiences. If you still do important SysAdmin stuff to this day, hopefully you verify/test your backups (and any DR) you do these days. :)
Jim said he was on the way out the door because he was the only sysadmin. I don't know the size of the company - sounds like he could have used more help, maybe much more.
He probably put out the fires he needed to, and could. It's a failure of the company, not Jim, that he had too many fires.
His plan of leaving the company would have communicated that to them more clearly than any emails or performance reviews. When they can't hire any single person to take Jim's spot, because the candidates all run away, they will think in terms of multiple sysadmins.
80+ people distributed across the country. I was also a billable asset and spent a small amount of time in the field and time supporting other consultants.
When you're up to ass in alligators it's hard to remember that your primary mission is to drain the pool.
As it turns out, they had no trouble hiring someone to replace me. There's always some idiot out there that thinks they can do it all. Just look at me :)
I'm much happier working in a place with 40 SA's across three shifts. I'm too old and bitter to go back to a situation where it's just me and Google running everything from printers to servers.
Yeah, I didn't have the time to do it right. But even if I could have found the time I didn't have the experience. I inherited most of those systems and didn't understand what was needed to recover or rebuild them. There was no documentation or change control. The tape backups were good but no decrypt keys meant that they were functionally useless. Realistically, even if I'd had the keys I wouldn't have been able to restore them as they were going onto new hardware, and, having never tested... You get the idea.
As to today? On each of the ~1600 machines I'm responsible for there's a job that runs three times a day to validates that files are being backed up to the master catalog server. We've got a dedicated enterprise backup team, but I still make sure that I've done everything I can to ensure things are working. Every year we have a DR exercise and every year I volunteer to be on the team.
"If you ever find yourself a refugee and have no idea if all of your worldly possessions have been destroyed, have lots of sex. Seriously, nothing beats refugee sex. I can't recommend it enough."
Any hints on where to find refugees then? :p
(the quoted bit could have been worded better)
Years back I worked for a company who did, among other things, a form of DR evaluation. Quite a few companies have backups that may look far away on paper, but are vulnerable to the same disasters. (Think backup and primary along the same fault line in an earthquake or vulnerable to flooding from linked water sources).
Awesome story, just having a laugh with my Mrs about the meeting through adversity piece.
Surprised you stuck it out, but I get the impression through the story, that money may not have been an object for you :) hence the 1st class flight back, etc. Hats off to Dell - although you were in their backyard.
Hope you don't have to face the same again anytime soon.
enjoyed your recount....
was part of a team who provided laptops/desktops for those folks pouring into the DFW area to check email, find loved ones..etc..... and even amongst a close knit usergroup....that was a cluster!
Now living on the coast of TX and DR is part of everyday life during hurricane season !!!!
Ten years ago my local town had the worst flooding in more than a century. Just after the civil disasters control centre had moved to brand new offices - by the river.
Kudos to my ISP at the time, though. They spent most of the day moving servers onto stacked tables to keep them out of the water.
My Katrina story would be much longer but the things I remember most are: 1) my driver's license expired on Sept 3 (which I didn't notice). Somehow they let me on a plane from New Orleans a couple weeks after the storm but they wouldn't let me back on in Atlanta for the return flight until the crowd behind me in the TSA line pretty much forced them to. I'll never forget the lady telling me I didn't have a valid ID while the TV on the wall next to her was showing my local DMV under 10 ft of water. 2) the entire 504 area code switching station was underwater - not just out, but destroyed under salt water. Anyone with a 504 phone number was out of reach for several days until Nextel rigged up a bunch of portable cell towers - still not sure exactly how they did it. It was fun trying to explain to the CTO(!) why the new VoIP system they had just put in was useless since it was still routed thru the 504 exchange.
I set up an unmanned DR center. It was a lot of work take your workload and multiply it by three, management doesn't get it, Documentation needs to be very, very good and kept up to date make sure the same tape drive are in the DR center that are in your production facility, I have always worried about encrypting tapes, KISS method. Take images of your servers and put the image where the server is not. I could take my data volumes and send the changed files on demand with DRBD to the DR site. It will use the bandwidth and trickle the change to the remote DR site as it can. If you add openvpn you can compress and encrypt also. DRDB (Distributed Replicated Block Device) is a huge savior for getting data across the links, THANK YOU Linbit!
Other than the "have offsite backups, have a backup data center", I think almost all disaster recovery plans are more a matter of trying to keep a data center running in the face of, perhaps an ice storm or wind storm knocking power out for a while. More serious issues than this are simply not usually planned for, and may be impossible to properly plan for (for instance, who would really expect ALL phones *from* an area to fail when you are completely out of the area?)
"his is why all my tech guys have laptops, mobiles with modem tethering, remote access from the laptop, with citrix access from any PC to a remote support VMs as back up."
Wouldn't have worked in this case, 1) The cell sites in the wide area were all down. 2) The device wouldn't have worked anyway. When you got somewhere with a functional cellular network and tried to make a call or data session, the cellular network would check with(I think) the HLR (Home Location Register) from your area to make sure your device is valid and paid up -- and get no reply because the HLR would have been under water by then. Most likely, calls and data sessions would have been physically routed through the switch for your home area, which also was underwater.
" In addition we run dual data centres with key infrastructure duplicated and replicated between sites. "
This would be the key, as long as both data centers weren't in the same city.
"Multiple ISPs and telcos used for network resilience."
Wouldn't have helped in this case. All ISPs and telcos failed.
" Huge UPS with massive generator backup to survive power outages. Every single bit is backed up to tape cross site.'
In this case, sites with generators were unable to get fuel, so it helped them run longer, although if their internet providers had already failed it was a moot point. Backing everything up is of course key to recovery.
Biting the hand that feeds IT © 1998–2019