Reply to post: my money would be on bad management

GitLab.com melts down after wrong directory deleted, backups fail

Nate Amsden

my money would be on bad management

It seems like their setup was rather fragile. I'd put my money on not having enough geek horsepower to do everything they wanted to do. Having been in that situation many times. Even having a near disaster with lots of data loss(and close to a week of downtime on backend systems), company at the time approved the DR budget, only to have management take the budget away and divert it to another underfunded project(I left company weeks later).

One place I was at had a DR plan, and paid the vendor $30k a month. They knew even before the plan was signed that it would NEVER EVER WORK. It depended on using tractor trailers filled with servers, and having a place to park them and hook up to the interwebs. We had no place to send them(the place the company wanted to send them flat out said NO WAY will they allow us to do that). We had a major outage there with data loss(maybe 18 months before that DR project), they were cutting costs by invalidating their Oracle backups every night to use them for reporting/BI. So when the one and only DB server went out (storage outage) and lost data, they had a hell of a time restoring the bits of data that were corrupted from the backups because the only copy of the DB was invalidated by opening it read write for reporting every night (they knew this in advance it wasn't a surprise). ~36 hrs of hard downtime there, and still had to take random outages to recover from data loss every now and then for at a least a year or two later. Never once tested the backups (and the only thing that was backed up was the Oracle DB, not the other DBs, or web servers etc). Ops staff so overworked and understaffed, major outages constantly because of bad application design.

Years later after I left I sent a message to one of my former team mates and asked him how things were going, they had moved to a new set of data centers. His response was something like "we're 4 hours into downtime on our 4 nines cluster/datacenter/production environment" (or was it 5 nines I forget).

I've never been at a place where even say annual tests of backups were done. Never time or resources to do it. I have high confidence that the backups I have today are good, but less confidence that everything that needs to be backed up is being backed up, because in the past 5 years I am the only one that looks into that stuff(I am not a team of 1), nobody else seems to care enough to do anything about it. Lack of staffing, too few people doing too many things..typical I suppose but it means there are gaps. Management has been aware as I have been yelling for almost 2 years on the topic yet little has been done. Though progress is now being made ever so slowly.

The place that had a week of downtime, we did have a formal backup project to make sure everything that was important was backed up (as there was far too much data to back up everything(and not enough hardware to handle it), much of it was not critical). So when we had the big outage, sure enough people came to me asking to restore things. Most cases I could do it. Some cases the data wasn't there -- because -- you guessed it -- they never said it should be backed up in the first place.

Been close to leaving my current position probably a half dozen times in the past year over things like that(backups is just a small part of the issue, and not what has kept me up at night on occasion).

I had one manager 16 years ago say he used to delete shit randomly and ask me to restore just to test the backups (they always worked). That was a really small shop with a very simple setup. He didn't tell me he was deleting shit randomly until years later.

It could be the geeks fault though. As a senior geek myself I have to put more faith in the geeks and less in the management.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon