User topics

Article topics

Log in Sign up

GitLab.com melts down after wrong directory deleted, backups fail

Source-code hub GitLab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued a sobering series of tweets we've listed below. Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Wednesday 1st February 2017 02:22 GMT Anonymous Coward

Great

I wont even connect to our gitlab.

So I guess we lost all our synced files.. GREAT.

I have a copy of everything on local.. and 3 backups, tomorrow I will see if my work colleagues are as paranoid as I am.

5 1 Reply
1. Wednesday 1st February 2017 02:49 GMT Anonymous Coward
  
  Super! Re: Great
  
  I assume you've tested your backups.
  
  Unlike Gitlab.
  
  53 0 Reply
  1. Wednesday 1st February 2017 05:26 GMT Charles 9
    
    Re: Super! Great
    
    Hasn't it been said you can't really practice for an emergency without an emergency, in which case Murphy will get you either way?
    
    19 0 Reply
    1. Wednesday 1st February 2017 11:54 GMT Jenny with the Axe
      
      Re: Super! Great
      
      Actually, I've worked in places that did emergency testing. They did things like "let's kill all the connections through one datacenter and see if our customers can still access their stuff." Also "let's restore the backups to this test system and see that it works". As Gitlab has now discovered, backups without restore testings are not backups....
      
      17 0 Reply
      1. Wednesday 1st February 2017 12:47 GMT Charles 9
        
        Re: Super! Great
        
        You're lucky to have the budget to do it. Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test. No test system to try the restore on (and besides, it's different from the live system, so things can still mess up in actual settings), and no way to really test for emergencies because they depend on things that ONLY occur in real emergencies, such as power to not just the floor but the whole building going out (and perhaps next door as well, just to be sure something wasn't plugged in a jury-rig).
        
        13 1 Reply
        
        Wednesday 1st February 2017 12:53 GMT Anonymous Coward
        
        Re: Super! Great
        
        Maybe most people shouldn't have nice things if they can't afford to maintain them?
        
        9 5 Reply
        
        Wednesday 1st February 2017 13:50 GMT Mark 78
        
        Re: Super! Great
        
        "You're lucky to have the budget to do it. Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test."
        
        Maybe you should rethink that..... Are you lucky enough to have the budget to NOT run a restore and risk the loss of everything? The cost of a properly tested restore system should be a vital part of any project budget. Your lack of a test/restore system is basically saying to your customers that if things go wrong we are bankrupt and possibly they are as well (if they are external customers).
        
        18 0 Reply
        
        Wednesday 1st February 2017 16:28 GMT asdf
        
        Re: Super! Great
        
        >Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test.
        
        Sounds to me like the failure is in the business model of the company. Those generally are the type of companies that are one recession or self created disaster away from administration.
        
        7 0 Reply
        
        Wednesday 1st February 2017 17:50 GMT Charles 9
        
        Re: Super! Great
        
        "Sounds to me like the failure is in the business model of the company. Those generally are the type of companies that are one recession or self created disaster away from administration."
        
        That's why it's called living on the razor's edge. Where margins are close to zero all the time. You'd be surprised how many firms HAVE to live like this because they flip between profit and loss every month. You're floating in the ocean and you barely have the stamina to tread water. Sometimes, that's all you're dealt. All you can do is hope for shore or some flotsam.
        
        9 0 Reply
        
        Wednesday 1st February 2017 19:08 GMT asdf
        
        Re: Super! Great
        
        >Sometimes, that's all you're dealt. All you can do is hope for shore or some flotsam.
        
        That's fine if you are a brave entrepreneur who has few limits of what he/she can reap if they succeed but taking a job at one of those companies is another matter (especially without a big ownership stake). A big part of job interviewing from the view of the interviewee is figuring out if the company is one of those companies. If you do take the job then it probably means you need to do a better job researching companies or you need to increase your skills and experience so you don't have to work for those type of companies for long if at all.
        
        1 0 Reply
        
        Wednesday 1st February 2017 19:41 GMT asdf
        
        Re: Super! Great
        
        Forgot to expand on the whole hoping to cash in on the ground floor of a startup angle which again is fine I suppose if you are young or aren't the only income earner in your family but still probably won't end up being one of your wiser choices more than likely. If you are lucky you might get to keep the actual pets.com puppet though after everything goes sideways.
        
        1 0 Reply
        
        Wednesday 1st February 2017 23:07 GMT Charles 9
        
        Re: Super! Great
        
        "A big part of job interviewing from the view of the interviewee is figuring out if the company is one of those companies. If you do take the job then it probably means you need to do a better job researching companies or you need to increase your skills and experience so you don't have to work for those type of companies for long if at all."
        
        Or it simply means you're out of options. If they're the ONLY opening, then as they say, "Any port in a storm."
        
        3 0 Reply
        
        Thursday 2nd February 2017 17:07 GMT asdf
        
        Re: Super! Great
        
        >Or it simply means you're out of options. If they're the ONLY opening, then as they say, "Any port in a storm."
        
        Which is fine unless you spend decades in that situation and then turn around and blame globalization for all your problems. Not you per say of course but a significant number of people.
        
        2 0 Reply
        
        Friday 3rd February 2017 15:04 GMT Anonymous Coward
        
        Re: Super! Great
        
        Ever thought for some people it's absolutely true?
        
        0 0 Reply
        
        Wednesday 1st February 2017 15:01 GMT Doctor Syntax
        
        Re: Super! Great
        
        "Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7"
        
        If you have only one system which must remain up 24/7 you only have two choices, a huge budget or eventual failure.
        
        7 0 Reply
        
        Wednesday 1st February 2017 15:30 GMT Tikimon
        
        Re: Super! Great
        
        If your system is set up such that your disaster remediation cannot be tested, it is set up WRONG. There, I said it.
        
        I've shared my personal example before. Previous job, my so-called supervisor messing with SQL queries did a Delete thinking it would clear his query. Blew away the whole database with a single click. The never let me test the backups, claiming "24/7, can't be down for testing!". Instead we were down for THREE DAYS while an SQL consultant helped rebuild from scratch, and we never recovered all the data.
        
        Disaster testing always creates some inconvenience, but that's no excuse to skip it. A smart captain never complains about the lifeboat drills.
        
        8 0 Reply
        
        Wednesday 1st February 2017 18:48 GMT Anonymous Coward
        
        Re: Super! Great
        
        "Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7"
        
        Words fail me. I would characterize that as planning to fail - which is pretty expensive also.
        
        0 0 Reply
        
        Thursday 2nd February 2017 03:11 GMT theblackhand
        
        Re: Super! Great
        
        "Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7"
        
        Is that 24/7 as in always operational or 24/7 in that if something goes wrong IT won't be in until the morning to fix it so someone will complain?
        
        1 0 Reply
  2. Wednesday 1st February 2017 09:55 GMT Halfmad
    
    Re: Super! Great
    
    This sort of situation is why I constantly nag my IT department about testing backups, just because the backup product says the backup verified OK is not the same as the occasional bare metal restore just to check.
    
    10 0 Reply
  3. Thursday 2nd February 2017 01:12 GMT Anonymous Coward
    
    Re: Super! Great
    
    Yes, I have tested all my backups, and we lost 0 data.
    
    A non tested backup does not exist on my book.
    
    0 0 Reply
2. Wednesday 1st February 2017 15:25 GMT Anonymous Coward
  
  Re: Great
  
  No repository data was lost, it was *ONLY* the database that got rolled back 6 hours. Your code is safe.
  
  0 0 Reply
  1. Wednesday 1st February 2017 17:41 GMT Anonymous Coward
    
    @brodrock
    
    "No repository data was lost"
    
    YET.
    
    0 0 Reply
Wednesday 1st February 2017 03:25 GMT kain preacher

How would like to be the one to tell your boss I think I accidentally deleted the company ?

28 0 Reply
1. Wednesday 1st February 2017 04:34 GMT Anonymous Coward
  
  Maybe the company should be deleted. If you can't handle your own backups without a "cloud" service, just how non-functional are you at the center?
  
  10 1 Reply
2. Wednesday 1st February 2017 07:03 GMT Voland's right hand
  
  Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
  
  A company running a production service with this backup strategy in place should be deleted.
  
  14 1 Reply
  1. Wednesday 1st February 2017 08:48 GMT Adam 52
    
    All those rushing in the criticise, how confident are you in you processes?
    
    Before the event these guys were very confident, they had LVM snapshots, DB backups, Azure snapshots and even a copy on s3 in case Azure goes down. I bet they were even tested in the first place.
    
    Even after all this very public embarrassment they do still have a copy from 6 hours earlier.
    
    How many of you could suffer multiple failures and still be able to do a same day recovery?
    
    47 3 Reply
    1. Wednesday 1st February 2017 09:18 GMT codejunky
      
      @ Adam 52
      
      Having read your comment you have restored my faith in people. On topics on data loss where the backup failed I usually see comments about 1 backup is none and 2 is 1 etc, so these guys had 4 by that rule and still got stung. I feel sorry for whoever nuked the data and people responsible for making the various backup scripts. I know if it was me I would be feeling bad and wondering what I could have done different, and in future probably becoming more paranoid of all the systems I have in place.
      
      I wish these guys well and hope they dont get too much stick, just the opportunity to fix it.
      
      29 1 Reply
      1. Wednesday 1st February 2017 10:48 GMT Peter2
        
        Re: @ Adam 52
        
        Mmm.
        
        There is an old tale called the Tao of Backup, written in the days when NT4 was shiny new equipment people aspired towards having and the press was talking about the upcoming Win98. It's still available online.
        
        http://www.taobackup.com/
        
        Skipping the first four points we get to :-
        
        5. Testing
        
        The novice asked the backup master: "Master, now that my backups have good coverage, are taken frequently, are archived, and are distributed to the four corners of the earth, I have supreme confidence in them. Have I achieved enlightenment? Surely now I comprehend the Tao Of Backup?"
        
        The master paused for one minute, then suddenly produced an axe and smashed the novice's disk drive to pieces. Calmly he said: "To believe in one's backups is one thing. To have to use them is another."
        
        The novice looked very worried.
        
        It then links to this page which lists some of the things that can go wrong due to not fault of your own.
        
        http://www.taobackup.com/testing_info.html
        
        Gitlab didn't have backups. They had backup scripts that didn't work. Having those sort of problems happens. Not testing your backups and discovering those problems before an emergency hits is inexcusable.
        
        28 0 Reply
        
        Wednesday 1st February 2017 20:09 GMT TVU
        
        Re: @ Adam 52
        
        "Gitlab didn't have backups. They had backup scripts that didn't work. Having those sort of problems happens. Not testing your backups and discovering those problems before an emergency hits is inexcusable."
        
        ^ I fully agree with this and the question then becomes, "Will Gitlab learn from this experience?". I hope that they do and now set up and test suitable back up systems.
        
        Examples of how not to do things include TalkTalk with multiple breaches where they do nothing, bad things happen again and then customers leave.
        
        1 0 Reply
        
        Wednesday 1st February 2017 23:17 GMT Adam 52
        
        Re: @ Adam 52
        
        "Not testing your backups and discovering those problems before an emergency hits is inexcusable"
        
        How do you know they didn't test? I guess they did test, but the Postgres version changed and the database backup stopped working.
        
        So they didn't test regularly enough. What is regularly enough? And how many people actually do that, rather than take a holier-than-thou attitude on a forum.
        
        1 1 Reply
      2. Wednesday 1st February 2017 14:24 GMT Keith Langmead
        
        Re: @ Adam 52
        
        Completely agree with both of you. There's no doubt they made mistakes, backups, replication etc should have been tested properly and clearly wasn't, but do you know what really hit me reading about this? How open they've been about the whole thing. From the initial issues through to publishing the live notes for the restoration attempts, I'm personally very impressed that they've been open and honest about what's happening. So many other companies would and have hidden behind vague "we're working on it" responses, and while it doesn't undo their failure to test, I think their honesty does need to be acknowledged and commended.
        
        12 0 Reply
        
        Wednesday 1st February 2017 14:36 GMT Peter2
        
        Re: @ Adam 52
        
        I have to agree with that. It is refreshing to see "Mea Culpa", rather than an extract from "Yes Minister".
        
        8 0 Reply
      3. Wednesday 1st February 2017 17:17 GMT Oh Homer
        
        Re: "what I could have done different"
        
        They could start by delegating someone to be responsible for checking the results of their backup plan, on a daily basis, to ensure that backups are actually being made and are valid.
        
        As a bare minimum they should have scripted this to fully test the output (Is the target accessible, does it contain a backup, is the backup valid?) then email a warning in the event of failure. Or better yet, use the dead man's switch method by emailing the successful results of a fully verified backup, then have a warning issued on the admin's local system in the event that no such message is received, to account for not only backup failures but communication failures too, and beyond that have the admin manually check for such messages on a set schedule, in case the local warning system (cron et al) doesn't work for any reason.
        
        Beyond that minimum effort, they should also have had full server replication on a hot spare, not only for failover but also for bare metal restore testing, which is ultimately the only way you can be sure your backup process really works, as opposed to "seems to work", "completed without error" and "says it verified", all of which amount to exactly nothing. Until you have successfully performed an actual restore, you simply don't have a backup.
        
        The most shocking part of this is not that backups failed, but that nobody noticed, at least not until it was already too late.
        
        Essentially they had no process in place, they had a token effort that was cobbled together then not even verified as working. That's the unforgivable bit.
        
        4 0 Reply
    2. Wednesday 1st February 2017 09:46 GMT Michael
      
      Disaster recovery
      
      It is all relative.
      
      For example, in our small team, we destroy our entire staging environment and perform complete restore from backups once every month or two. We also use the production backups for restoration to verify that these are working in a separate environment. I deliberately get someone that hasn't done it recently to check the documentation is correct and everything works. Given that we also have the git repos and source on multiple dev machines also for everything, the worst case data loss is from a dev not committing regular changes. But everything is tested, and I believe regularly enough for a team of 10.
      
      If I was hosting thousands of peoples data, I'd expect a more regular testing of the backup and recovery procedure. The can never have tested it with production data as they have no working backups. That is just incompetent.
      
      9 0 Reply
    3. Wednesday 1st February 2017 10:23 GMT Ogi
      
      > All those rushing in the criticise, how confident are you in you processes?
      
      Very confident in our backup system. Quite frankly because I insist on regular restore testing from backups, precisely for this reason. A habit I got into many years ago when I worked for a big company where we stored very sensitive personal data.
      
      Until you have done a successful restore, you don't have a verified working backup system. This is simple 101 of sysadmin work. It isn't something exotic or really hard to wrap your head around.
      
      There was the joke that you could back everything up to /dev/null just fine. However the restores would be much harder. Sounds like the gitlab guys inadvertently used a similar concept as their backup strategy.
      
      I can understand the accidental rm -rf, mistakes happen, we are human after all. However backing up (presumably for years) without never even bothering to check the size of your backup is just gross negligence. Surely a quick check of the backups, noticing all the files are suspiciously 512 bytes in size would have been enough to draw attention that maybe something wasn't working in the backup system.
      
      If what they say is true and they essentially have no backup, then this was a completely avoidable situation that was of their own doing.
      
      17 0 Reply
    4. Wednesday 1st February 2017 11:00 GMT anothercynic
      
      Amen, @Adam 52! Well said!
      
      3 0 Reply
    5. Wednesday 1st February 2017 11:46 GMT Anonymous Coward
      
      All those rushing in the criticise, how confident are you in you processes?
      
      You're right, although I'm old school - where I come from, backups aren't as important as restores, so a backup wasn't until you could prove you could actually use it to recover.
      
      That said, the first ever from-the-metal-upwards restore test I did was the single most nerve-wracking thing I've done in my career, despite the fact that I had two separate backups and a hot standby site to take over if recovery took more time than the available test window.
      
      That said, you haven't earned your sysadmin badge if you haven't learned the "rm -rf" lesson the hard way...
      
      6 0 Reply
      1. Wednesday 1st February 2017 12:20 GMT Doctor Syntax
        
        "That said, the first ever from-the-metal-upwards restore test I did was the single most nerve-wracking thing I've done in my career"
        
        I've worked in two places where we had DR contracts including rights to run practice restores. They can be learning experiences, especially the first one. /etc was the last directory on the last dump tape. We had to sit twiddling thumbs waiting to get a system we could log into & then ran out of time before we had a restored system. It ensured the dumps were better organised for the next pass.
        
        5 0 Reply
    6. Wednesday 1st February 2017 12:12 GMT Doctor Syntax
      
      "How many of you could suffer multiple failures and still be able to do a same day recovery?"
      
      Wrong question.
      
      How many of you could have five non-functioning backup mechanisms and not be aware that none of them worked because you hadn't tested them?
      
      24 0 Reply
    7. Wednesday 1st February 2017 14:12 GMT Anonymous Coward
      
      "All those rushing in the criticise, how confident are you in you processes?"
      
      Confident enough to know it works through testing.
      
      I could defend an honest mistake, not a short sighted lazy one. It could be argued that they were not lazy (who knows, maybe their testing matched their design), but they were clearly short sighted. Either they were or they were not designing to be bullet proof, which was it?
      
      All they had to do was test for the worst. This isn't a after school special, not everybody is winners, not everybody deserves a rewad. At least they owned it. They could of said "You're holding it wrong."
      
      3 0 Reply
    8. Wednesday 1st February 2017 14:52 GMT DJ Smiley
      
      At my 2nd day of a job, I deleted the entire stack of the test system with a misplaced rm -rf.
      
      I crapped myself thinking I'd be instantly fired. My boss made some 'angry' sounds, then told me it's not the biggest issue as they needed to try a fresh install of the new version anyway (as thats' how the new version would be rolled out in production rather than upgrading, which is what they normally did on the test servers.)
      
      This also allowed them to fully test the backups, pulling the older data from the production backups, anonymising it as required and also finding some faults with various processes that were included but didn't work after the upgrade. In all the test system was down for about 4 days instead of 1, but the fixing of the systems to allow it to get the go ahead in production took a month or more. If I'd not 'slipped up' then they wouldn't of known these issues until trying to go live in production and if so, it would of been a very long night of around 6-8 hours reinstalling the older version back into production (after the 6-8 hours of installing and testing the new version).
      
      This attitude of 'we can't afford to test it' is utter bollocks. You fire up as many vm's as required in the cloud, and you at least verify the _data_ is there, even if the functionality isn't. It's bad to find the code for the production system isnt' backed up as much as you think it is, it's unrecoverable to find out the data is gone.
      
      These guys got lucky, if he hadn't taken that copy 6 hours before they'd be dead in the water and the company would be gone.
      
      9 1 Reply
      1. Wednesday 1st February 2017 15:06 GMT Doctor Syntax
        
        "In all the test system was down for about 4 days instead of 1"
        
        You could even look on it as an unplanned 4 day test because, as you described it, that's what it was.
        
        4 0 Reply
    9. Wednesday 1st February 2017 15:53 GMT Tikimon
      
      How confident are we? VERY.
      
      We test our backup processes quarterly. Real data is deleted and restored from backup, the files opened to check integrity. Failover servers are shut down and the handoff to partner server checked.
      
      Can we guarantee nothing will go wrong? Of course not, such a thing is not possible. Sometimes when a ship sinks it lists too far and half the lifeboats cannot be used. That's no excuse to skip lifeboat drills, nor is "the crew can't stop what they're doing to run safety drills."
      
      So with apologies to Adam and his admirers, the Gitlab geeks did not do their jobs properly. They admit that the backup files were too small to be believable. That alone should flag their systems as not working, with no shutdown testing required. They missed several chances to spot this.
      
      4 0 Reply
      1. Wednesday 1st February 2017 19:16 GMT Adam 52
        
        Re: How confident are we? VERY.
        
        "We test our backup processes quarterly"
        
        ...and you're confident? Then you are far too complacent. You could easily lose 3 months of data if something's gone wrong. Think about all the things that could have failed between your last test and now - the tapes, the tape drives, the disks, the software, a permissions issue, a version mismatch, a CRC algorithm change, new directories, new servers, credentials, sneaky ransomware, ... the list goes on.
        
        I test weekly. And I still think my backup process is worse than the one these guys had.
        
        4 0 Reply
  2. Thursday 2nd February 2017 19:10 GMT dyioulos
    
    YP should ask JN where those backups are stored. Or, they should both head for the nearest bar. If I did something like this, that's where I'd be headed so that I could work on my resume.
    
    1 0 Reply
3. Thursday 2nd February 2017 00:32 GMT swm
  
  When I was teaching at a College the account containing all of the homework submissions was deleted by a rogue script. Next day in class I used this as an example as to the necessity for having good backups. I saw many smirks around the room until I pointed out that we took backups seriously (with weekly and incrementals) and that the account was fully restored losing about 4 hours of work that the students could easily resubmit (unless they deleted their submissions which is unlikely). We probably could have restored more but didn't want to restore after the rogue script started deleting files.
  
  On another note, we had an IT professional from Paychex (a company that prints checks for small businesses) give a talk on something or other and after the talk I asked how the company survived a state-wide power failure lasting for several days. His answer was that they keep 6 copies of their essential data spinning and online. During the power failure all UPS systems worked perfectly and they lost nothing. However their customers could not download their information because they did not have power so Paychex had to rent trucks to cart all of the checks their customers couldn't print. The point is that the entire supply chain needs to be considered.
  
  2 0 Reply
Wednesday 1st February 2017 03:32 GMT DryBones

So I'm having....

a lot of schadenfreude right now. All of that boils down to one single sentence:

Nobody ever test-restored a backup.

14 0 Reply
1. Wednesday 1st February 2017 06:24 GMT kodykantor
  
  Re: So I'm having....
  
  I work at Veritas and I always have trouble explaining the need for good backup/recovery and DR solutions to my young friends. This is a nice link to send them in the future. It hurts to see this happen to a group of people, but hopefully it will lead to others testing (or implementing :) ) their strategies!
  
  12 0 Reply
2. Wednesday 1st February 2017 06:35 GMT MacroRodent
  
  Re: So I'm having....
  
  Nobody ever test-restored a backup.
  
  That is a step too often skipped, because you don't want your test to overwrite live data, so you would temporarily need as much space elsewhere as the restoration takes. In fact, you better have a complete spare system to test you can make everything working with the backup. May be difficult to arrange.
  
  9 0 Reply
  1. Wednesday 1st February 2017 12:22 GMT Doctor Syntax
    
    Re: So I'm having....
    
    "In fact, you better have a complete spare system to test you can make everything working with the backup."
    
    As per another post, a DR contract can give you rights to run practice drills. That's your test opportunity.
    
    2 0 Reply
3. Wednesday 1st February 2017 09:30 GMT Anonymous Coward
  
  Re: So I'm having....
  
  The issue with a real restore from backup is that when an issue strikes you try these things:
  
  1) Attempt to fix the issue as you don't want to load stale data or risk overwriting your live system, you think it'll take 1 hour but you've invested so much effort that after 6/12/24 hours you are almost there but keep hitting snags but are close enough not to turn to the backup (all the while every half hour you get a call asking for updates)
  
  2) Eventually you decide you'll have to go to backup - data is now even older. You start the restore with very little idea how long it will take but estimate 1 hour. As it starts restoring and the progress bar whizzes along and tells you 2 hours to go you feel hopefull. The progress bar gradually slows down and the time starts showing 2 hours, 6 hours, 12 hours to go. After 4 hours you think something is wrong with the restore so you abort it and decide to manually copy the files over and restore from a local mount point. Repeat issue above with copy times getting gradually longer. You start looking at jumbo frames, data transfer graphs, can proficiently convert bits to bytes to to Mbs to Mibs in your head.
  
  3) You update everyone that it is actually going to take about 24 hours to recover that much data and go home, waking every 30 minutes to remote in and check progress.
  
  4) After three days the few TB has copied back but the restore fails with an error 00x0ffx0f00075844 Unspecified Error. Possibly something to do with merging the incrementals.
  
  5) You reach for your last full backup, data that is now 7 days old. You wait three days for a full restore again. The restore is a success but none of your db based products work. SQL won't start, Exchange won't mount any DBs you've got files restored but a number of iSCS link are broken on some apps.
  
  6) You spend the next hour/6 hours/3 days trying to work out how to clean mount a db with partial corruption logs spending more time on various forums than you care for and eventually bring most things back to life although Jim from finance still can't access any e-mails, the Finance system has rolled back a month of transactions due to consolidation errors and no-one can access the intranet anymore.
  
  7) You feel relieved to have made it through, what feels like a war zone single handedly with relief but then feel angry that you had no real support and no-one seems to understand you mix of emotions from anxiety, to panic, to fear, to relief and back to helplessness you have just been through.
  
  8) You go to speak to your manager and they tell you they are bringing in a consultant.
  
  9) You get fired.
  
  28 0 Reply
  1. Wednesday 1st February 2017 13:47 GMT Calleb III
    
    Re: So I'm having....
    
    That's why you make multiple work streams, working on alternative solutions in parallel
    
    0 2 Reply
4. Wednesday 1st February 2017 10:36 GMT Just Enough
  
  Re: So I'm having....
  
  >a lot of schadenfreude right now.
  
  Really? You're deriving pleasure out of seeing this failure? How very cold-hearted of you.
  
  Any time I read these kind of stories I'm filled with relief that I'm not in the team that has to fix the mess, and have nothing but sympathy for them. We've all experienced screw ups like this, we all know what a stressful experience they are.
  
  Yes, they messed up and should have tested their backups. But I take no pleasure out of seeing others' work go tits-up or lost.
  
  17 0 Reply
  1. Wednesday 1st February 2017 12:24 GMT Doctor Syntax
    
    Re: So I'm having....
    
    "Really? You're deriving pleasure out of seeing this failure? How very cold-hearted of you."
    
    Maybe he works for Github?
    
    4 0 Reply
Wednesday 1st February 2017 04:46 GMT Infernoz

It sounds like the sysadmin did not have a proper plan, was probably tired, preoccupied or bored, and rushed things, and someone did not put in place enough script logs/failure-alerts, backup verification logs/alerts, and do regular checks for both, so that the backups worked...

I do wonder why the database wasn't mirrored to reduce downtime for upgrades or other failures like this. Some redundancy should be compulsory for all professional systems.

Maybe GitLab could use OpenZFS with regular dataset filesystem snapshots, for rapid rollback to a snapshot before damage/deletion of files/data occurred, or (read-only) mount the snapshot and use some files/data off it to fix the active filesystem or make consistent backups; I've found the later very useful when I've accidentally deleted stuff e.g. on FreeNAS.

3 0 Reply
1. Wednesday 1st February 2017 05:02 GMT Steve Aubrey
  
  "Some redundancy should be compulsory for all professional systems."
  
  Ya, but who's going to be the adult in this situation? Obviously one was lacking.
  
  Regardless of how cloudy, trendy, and hipstery your company is, hire at least one adult. The one who knows the hard questions, and will ask them.
  
  8 0 Reply
  1. Wednesday 1st February 2017 12:26 GMT Doctor Syntax
    
    "The one who knows the hard questions, and will ask them."
    
    You can get fired for asking questions like that. Not a team player.
    
    6 0 Reply
    1. Wednesday 1st February 2017 13:43 GMT Steve Aubrey
      
      Doc - agree about the possibility of being fired for asking hard questions. But maybe a blessing in disguise?
      
      3 0 Reply
      1. Wednesday 1st February 2017 15:08 GMT Doctor Syntax
        
        "But maybe a blessing in disguise?"
        
        Yup.
        
        1 0 Reply
  2. Wednesday 1st February 2017 13:44 GMT Anonymous Coward
    
    hire at least one adult. The one who knows the hard questions, and will ask them.
    
    It's not enough to be able to ask the hard questions if you can't come with some of the answers for it too. Sniping from the side lines is easy but being able to assist in making sure the answers are right is where the skills lie - it's old adage: "don't bring me problems, bring me solutions".
    
    1 0 Reply
    1. Wednesday 1st February 2017 15:09 GMT Doctor Syntax
      
      "It's not enough to be able to ask the hard questions if you can't come with some of the answers for it too."
      
      You need to let them flounder a bit first, otherwise they won't be ready to grasp your solution.
      
      1 0 Reply
      1. Wednesday 1st February 2017 17:56 GMT Anonymous Coward
        
        If they're floundering like that, they probably wouldn't be able to grasp your solution even if you spelled it out for them.
        
        1 0 Reply
        
        Wednesday 1st February 2017 18:45 GMT Doctor Syntax
        
        "If they're floundering like that, they probably wouldn't be able to grasp your solution"
        
        Flounder in terms of not having been able to answer the hard questions.
        
        0 0 Reply
2. Wednesday 1st February 2017 06:47 GMT A Non e-mouse
  
  Some redundancy should be compulsory for all professional systems.
  
  Some working redundancy should be compulsory for all professional systems.
  
  FTFY ;-)
  
  From the article, they had lots of fancy redundancy: They just never checked that the redundancy was working.
  
  18 0 Reply
3. Wednesday 1st February 2017 09:10 GMT Marc 13
  
  Yup, love ZFS snapshots (but only as an add-on to other backup methods - its nice to have choices!) - simple to mount and restore from.
  
  0 0 Reply
4. Wednesday 1st February 2017 11:11 GMT Zippy's Sausage Factory
  
  Some redundancy should be compulsory for all professional systems.
  
  Or, in this case, whatever the Netherlands equivalent of a P45 / Pink Slip happens to be...
  
  4 0 Reply
Wednesday 1st February 2017 05:30 GMT jamesb2147

Speaks to a fundamental problem

IT is hard.

Backups are a pain in the ass, for exactly the reasons mentioned here. All ye who apply a rigorous and robust backup policy, I applaud you, but I doubt that a single one my employer's clients falls into that category, and we have many, many clients.

Anyone know of a product that you can point at a database, provide it credentials, and it handles all the rest, including test restores with error messages on failures? That's not to even get into file backup, but file backup is notably more simple in many ways, especially with the right tools (ask any ZFS admin).

3 0 Reply
1. Wednesday 1st February 2017 08:01 GMT Anonymous Coward
  
  Re: Speaks to a fundamental problem
  
  Used to have an EMC san with recover point that took an oracle db block level clone once a night broke the clone remounted it, renamed it and replayed it so you could use it in the morning. Still not there though as that didn't work for creating offsite backups that you knew worked. Just that we could restore to any point during the last 24 hours happily.
  
  There's a lot pontificating here but I wonder how many really have battle tested their backups enough to be so sure.
  
  3 0 Reply
Wednesday 1st February 2017 06:23 GMT Richard 12

At least it's git

That means there shouldn't be much real data loss at the final count, as the important things pushed there will have a backup in the place that pushed the last commit.

Pain in the proverbial for all the project leads to push up all the lost branch tips again, but at least it's only mostly dead and not dead-dead.

9 0 Reply
1. Wednesday 1st February 2017 08:20 GMT macjules
  
  Re: At least it's git
  
  And a REAL pain in the butt when you have an organisation like GDS with hundreds of git forks, more tramlines than Zürich Central and several hundred developers working remotely. Thank God GitHub never goes down ... well, almost.
  
  4 0 Reply
  1. Wednesday 1st February 2017 15:11 GMT Doctor Syntax
    
    Re: At least it's git
    
    "And a REAL pain in the butt when you have an organisation like GDS with hundreds of git forks"
    
    I think it's GDS that's the real pain.
    
    0 0 Reply
Wednesday 1st February 2017 06:24 GMT Anonymous Coward

CloudFog...

The Cloud... Someone else's computer....

12 2 Reply
Wednesday 1st February 2017 06:52 GMT A Non e-mouse

Who's to blame?

The sysadmin who accidentally nuked the live data reckons "it’s best for him not to run anything with sudo any more today."

Whilst the sysadmin screwed up deleted the wrong directory, the bigger screw up was the company not having any tested DR systems.

13 0 Reply
Wednesday 1st February 2017 07:36 GMT Anonymous Coward

It's pretty easy to screw up even when you have a working backup.

Recently had a situation where a client had messed up a data upload, and we needed to fix the data. Our software being fairly forgiving, the plan was to download the automated back-up, then backup the live dataset, attach the downloaded files alongside the live system database, and repair the damaged data with the values from the backup.

Would have worked great except in between this work and our last test, the backup software vendor had changed the recovery process to include a helpful feature, namely if you don't expand the file tree fully, it downloads the files and then automatically and silently over-writes the live database...

Needless to say, the client lost the 2 days worth of data that we'd fully expected to be preserved.

Re-written the procedures to move the manual backup to prior to downloading files but even with the best laid plans, backup software can be a tricky proposition!

5 0 Reply
1. Wednesday 1st February 2017 18:23 GMT ecofeco
  
  Seriously, why are backup vendors such morons? It's been a long time since I've seen backup software that was easy and reliable.
  
  1 0 Reply
Wednesday 1st February 2017 08:20 GMT Anonymous Coward

And this is why...

I'm not that thrilled about anything cloud based and prefer to host my own repositories. And here's one of the many reasons why. For starters: I actually check my backups on a regular basis, even when I don't need them.

I'm not even going to bother commenting any further because this is simply too big a fail. Makes you wonder what kind of geniuses work there. And what they're doing all day.

3 0 Reply
1. Wednesday 1st February 2017 12:31 GMT Doctor Syntax
  
  Re: And this is why...
  
  "I actually check my backups on a regular basis, even when I don't need them."
  
  That's the thing about backups. You hope you never need them.
  
  1 0 Reply
Wednesday 1st February 2017 08:42 GMT Norman Nescio

Backup is hard. Doesn't mean it should be ignored.

There is little excuse for being unable to restore from backup. "The dog ate my backup tape.", doesn't cut it. Having to deal with a full-scale civil insurrection that has trashed your off-site storage location might give you a pass.

The thing is, it is human nature to put something off that is difficult that has no immediately apparent consequences, so you need a martinet in charge of backup and restoration services, because your business could well live or die as a result.

I worked for an organisation that did annual full-scale disaster recovery exercises, which were instructive. The chap in charge of them was quite sanguine about failures: his point was that he far preferred to find things didn't work during an exercise than during the real thing. As this is about backup, I won't go into some of the more interesting organisational failure modes we found, but the backups caused problems.

1) The off-site storage vendor couldn't locate some of the backup tapes. It turned out that nobody had ever audited their retrieval performance. Talking to the operators, it was found that it was common for tapes to go missing. All the operators did was use a new blank tape when it was time in the backup cycle to re-use one that couldn't be found. The organisation used enough tapes in day-to-day work that the small number new tapes used to replace the missing backup tapes wasn't noticed. The operators' job was to do backups, which they were doing.

2) At least one of the tape heads was misaligned. It would quite happily write out a backup, which could be read with no problems on that drive, and that drive only. When the time came to ship the backup tape to the disaster-recovery location, no tape drive there could read tapes written by the original drive. Lesson learned was to make sure you could read backups on different equipment than it was made from.

3) It turned out that referential integrity is important. Who knew? Backing up files while they were in use, then trying to use the restored backup caused all sorts of problems. This was before the days of journalling file systems and snapshots. The application developers had failed to appreciate that backing up a large file took an appreciable amount of time, and in that time many records would be changed. An update that added or modified records both near the beginning and the end of the file would end up with only some of the updates recorded on the backup. That was solved initially by having a backup window where no updates were allowed to the file.

These days, some of the problems are solved by backup up to 'the cloud' and having the ability to snapshot databases, but doing proper backups is a neglected art. There is still a point at which it is not economic to back up over a network and you have to plan for moving physical media about, and life starts to get interesting.

Doing backups and testing restoration procedures is a huge time hog, and it isn't sexy. But it is important. I hope these guys get themselves out of the hole they just dug for themselves. Some people probably have unhealthy stress levels right now.

17 0 Reply
1. Wednesday 1st February 2017 08:56 GMT A Non e-mouse
  
  Re: Backup is hard. Doesn't mean it should be ignored.
  
  I worked for an organisation that did annual full-scale disaster recovery exercises, which were instructive. The chap in charge of them was quite sanguine about failures: his point was that he far preferred to find things didn't work during an exercise than during the real thing.
  
  People don't appreciate that failures are a wonderful learning experience. In my line of work, I've learned a lot more from unpicking a failure than working on a fault-free system.
  
  I've also heard several instructors across different areas say that they often prefer pupils who appear to make lots of mistakes as the pupils learn a lot more from the mistakes than those who do things right every time.
  
  7 0 Reply
  1. Wednesday 1st February 2017 17:57 GMT Charles 9
    
    Re: Backup is hard. Doesn't mean it should be ignored.
    
    "People don't appreciate that failures are a wonderful learning experience."
    
    Because for many people's personal experiences, people who fail (at all) don't survive for very long.
    
    4 0 Reply
2. Wednesday 1st February 2017 09:18 GMT Paul Crawford
  
  Re: Backup is hard. Doesn't mean it should be ignored.
  
  When the time came to ship the backup tape to the disaster-recovery location, no tape drive there could read tapes written by the original drive.
  
  I have also seen this with optical media - readable (probably just) on the original drive, not on another. Probably not after several years either.
  
  As you mention, snapshots are a brilliant idea - instant copy of a whole file system for backing up so (mostly) no inconsistencies, and with copy-on-write like ZFS you only need space for the changes so having many per day is not a high cost. However, as you mention in some cases the on-disk file is not always in a consistent state when a process is using it so having time to do a snapshot with no modifications is also good.
  
  3 0 Reply
3. Wednesday 1st February 2017 11:20 GMT Anonymous Coward
  
  Re: Backup is hard. Doesn't mean it should be ignored.
  
  The dog at my backup tape is at least understandable, if unacceptable.
  
  That they have five different backup solutions and all of them failed for various reasons is astonishing.
  
  4 0 Reply
4. Wednesday 1st February 2017 12:38 GMT Doctor Syntax
  
  Re: Backup is hard. Doesn't mean it should be ignored.
  
  "t turned out that referential integrity is important. Who knew? Backing up files while they were in use, then trying to use the restored backup caused all sorts of problems. This was before the days of journalling file systems and snapshots."
  
  Don't roll your own encryption and don't roll your own database.
  
  This was a solved problem years ago without depending on journalling file systems and snapshots. Use a proper database engine that does this for you.
  
  2 0 Reply
  1. Sunday 5th February 2017 12:56 GMT Norman Nescio
    
    Re: Use a proper database engine
    
    This was before the days of journalling file systems and snapshots."
    
    Don't roll your own encryption and don't roll your own database.
    
    This was a solved problem years ago without depending on journalling file systems and snapshots. Use a proper database engine that does this for you.
    
    You are, of course, completely right. However...
    
    The system used multiple files that were a kind of ISAM-type file*, and had been optimised to hell and back. It did its job, and fast. Several attempts, lasting may years each with large teams, were made to replace it with a 'proper DB', all of which failed to achieve the necessary performance. Half the application operated for decades** before being replaced by an entirely different system, the other half is still going, although its functionality is gradually being replaced by other systems, so eventually it will be sufficiently obsolete to decommission.
    
    I had a lot of conversations with the DBAs of proper databases also in use within the organisation, and the solutions proposed involved throwing a great deal of very expensive hardware at the problem. The business had a very simple question: "Why do we need to spend N times more money on the proper DB to achieve exactly what we are doing now for far less?". Having a backup window during which updates were blocked was a pragmatic (and thankfully, workable) solution
    
    Times have changed a great deal, and what was once very expensive hardware is now available very cheaply - Gigabytes of fast RAM, much faster processors with multiple cores, huge RAM-disks, Terabytes of spinning rust (and now, SSDs), so if you were starting again, you would simply throw enough (relatively) cheap hardware at the problem so you could run one of the newer databases. It wasn't an option then.
    
    Its quite interesting how, once a system is up and running, it is often cheaper to continue with it than build a more modern replacement - until a compelling event occurs - as a result, you can find some quite remarkably old business critical applications and systems in use. When you find you are forced to buy your spares on Ebay, then that is probably a good signal that moving to a newer approach is a good idea. That doesn't stop some people, though.
    
    Sorry for mansplaining. Please don't take this as criticism.
    
    *I'm deliberately not going into detail so it is not identifiable
    
    **I'm carefully not saying exactly how long.
    
    1 0 Reply
Wednesday 1st February 2017 09:43 GMT Anonymous Coward

Babkup Audit

Anonymous just in case anybody might stand a chance of recognising the companies involved

I had 6 months of easy but boring contract work in 1999 for a company who had the foresight to have there backups audited. They discovered that none of their off-site backups were valid. The process was that backups would be made to on-site tapes, then these would be cloned to off-site tapes. Problem was that the backup windows was too small and the clone job was set as low priority.

The on-site backup was working fine but the clone jobs never got to finish because the tape drives were re-allocated to higher priority jobs.

I got to re-jig the backup, keeping 2 of the drives under manual control so that I could check that the clone jobs were all finished before releasing the drives to other tasks.

Then we recalled all of the off-site tapes and I had the great job (when the drives were less busy in the afternoon) of feeding batches of the off-site tapes into the library to redo all of the clone jobs manually.

It was shocking how many of the tapes arrived back from the vault company with physical damage and even some which could not be found at all.

Some years later at another job, the same vault company was used. I made myself very unpopular with them by regularly requesting a random tape to make sure that it was available, intact and readable. There was a provision that in emergency they would have any tape available for physical collection within 30 minutes. As I had to virtually drive past the vault on the way to one of our datacentres, I would sometimes drop in on the way and request a tape. The receptionist would silently moan when I walked in after a while.

This was no guarantee that the backups would work, but I had the reassurance that we were taking steps to maximise the chances.

The contract with our customer called for annual DR tests. In 8 years they agreed to 1 very limited test. They did not want to disrupt there important business activities.

12 0 Reply
1. Wednesday 1st February 2017 12:42 GMT Doctor Syntax
  
  Re: Babkup Audit
  
  Upvoted but...
  
  s/there/their/g
  
  7 0 Reply
Wednesday 1st February 2017 09:44 GMT CAPS LOCK

All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...

... at the command line?

1 3 Reply
1. Wednesday 1st February 2017 10:14 GMT Alan_Peery
  
  Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...
  
  After you're done with the copies of data you're holding in a temporary filesystem, you clean out the temporary filesystem.
  
  Just make sure you're in the right filesystem... :-(
  
  9 0 Reply
2. Wednesday 1st February 2017 10:49 GMT Ogi
  
  Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...
  
  > it's a bit hard to understand use of rm -rf ... at the command line?
  
  Nobody every done "rm -rf . /tempfolder"? One typo, and hell to pay if you don't notice it.
  
  As a junior SA I once did that on the root box of the SAN in the company root directory. I blew away the entire companies data at 6pm, when I was tired and in a bit of a rush to go home. Wanted to clear out some temporary dirs I created for testing. Hit enter and started packing to leave. Only when the monitoring went haywire did I log in again and realise what I just did. Hundreds of millions of files across god knows how many divisions were deleted.
  
  Spent all night till 3am restoring everything, and writing scripts to pull fresh data, and then setting the permissions just right.
  
  The next day only 5 people noticed discrepancies in their data, which was a phenomenal result, but it was a life lesson as well. Despite managing to recover almost all the data, I was "asked to leave" shortly after (can't say I blame them).
  
  Thank god for ZFS snapshots and a verified backup system, otherwise the company could well have ended up having to cease trading. Or at least losing untold millions and millions before they could start to function again.
  
  I also am really really careful around "rm -rf" commands as root on machines now.
  
  9 0 Reply
  1. Thursday 2nd February 2017 16:35 GMT John H Woods
    
    Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...
    
    "Nobody every done "rm -rf . /tempfolder"?" -- Ogi
    
    30 long years ago I over-lingered on the SHIFT key, rm -r *.o became rm -r *>o and left me with a single file containing a single byte. Now I usually put an -i in, and when it seems to be right, exit and edit the command line to remove just the -i before setting it off in anger.
    
    But for specific critical folders you could use cp -al DoomedFolder/ QuickSnapshot/ ... Now, QuickSnapshot contains hardlinks to all the files and folders in DoomedFolder but because you haven't copied anything you don't need all that much space (just a bit for the new inodes) to do it (or much time to execute it). Now you can rm -r DoomedFolder and you've still got a second chance.
    
    2 0 Reply
3. Wednesday 1st February 2017 11:40 GMT Androgynous Cupboard
  
  Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...
  
  Pray tell, how else would you have us delete a directory?
  
  5 0 Reply
  1. Wednesday 1st February 2017 11:45 GMT petur
    
    Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...
    
    ctrl-a and then shift-delete
    
    bonus: on windows it would have taken ages to delete 300GB
    
    2 0 Reply
    1. Wednesday 1st February 2017 11:48 GMT Androgynous Cupboard
      
      Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...
      
      I tried that, but it just moved the cursor to the start of the line
      
      7 0 Reply
  2. Wednesday 1st February 2017 12:02 GMT CAPS LOCK
    
    Pray tell, how else would you have us delete a directory?
    
    I was suggesting the use of a shell script. Perhaps I should have been more explicit?
    
    Quite apart from that I've written my own 'del' command using 'mv' instead of 'rm' for use at the command line.
    
    Creating such a script is left as an exercise for the reader. Write on only one side of the intertubes.
    
    0 0 Reply
    1. Wednesday 1st February 2017 12:49 GMT Doctor Syntax
      
      Re: Pray tell, how else would you have us delete a directory?
      
      "Quite apart from that I've written my own 'del' command using 'mv' instead of 'rm' for use at the command line."
      
      I've had an accident with mv and managed to move /bin further down the hierarchy. It should have been possible to recover by booting from the distribution disks but the vendor had omitted the driver disk for the SCSI controller. It was the following afternoon when the controller vendor finally emailed us a driver.
      
      3 0 Reply
      1. Thursday 2nd February 2017 10:58 GMT Anonymous Coward
        
        Re: Pray tell, how else would you have us delete a directory?
        
        Reminds me of the time a colleague of mine a few years back was doing some changes to a MQ submission script, that simply polled directories, defined in a configuration file, then pushed the files onto the queue, and deleting the files afterwards (they should have already been archived by this point in the process).
        
        It took them a while to notice what was going on, but the system basically ended up eating itself! Submitting all the files it was meant to, then working back up the directory tree, and submitting and deleting the next directories contents it found, and so on.
        
        Thankfully permissions mean the process could only really 'eat' data and a few test scripts, rather than system files, and this wasn't a production environment, although it was in use for UAT at the time! Queue questions from the clinet, "My data's gone, but I can't find it where it should be! Does anyone one know where it went?"...
        
        2 0 Reply
4. Sunday 5th February 2017 17:02 GMT CAPS LOCK
  
  Three thumbs down....
  
  ... that'll lern yer ...
  
  0 0 Reply
Wednesday 1st February 2017 10:14 GMT David Austin

Refreshingly Honest

That must have been a very painful document to write, but it's a great real live scenario and a future test case - How many people have screwed up backups and kept quite and vague about it for operational reasons or pride?

Hopefully, someone will learn a lesson from this, but I won't hold my breath.

3 0 Reply
Wednesday 1st February 2017 10:29 GMT Anonymous Coward

Customers don't want backups....

They want restores

6 0 Reply
1. Wednesday 1st February 2017 18:28 GMT ecofeco
  
  Re: Customers don't want backups....
  
  Magically, too!
  
  2 0 Reply
Wednesday 1st February 2017 10:29 GMT ScriptFanix

Oops

I feel for that sysadmin...

3 0 Reply
Wednesday 1st February 2017 10:29 GMT Anonymous Coward

They at least have a backup backup strategy

In that all the GitHub users will probably have local copies of their content they can helpfully send back to GitHub. Hopefully.

0 0 Reply
1. Wednesday 1st February 2017 12:50 GMT Doctor Syntax
  
  Re: They at least have a backup backup strategy
  
  "In that all the GitHub users will probably have local copies of their content they can helpfully send back to GitHub."
  
  This was GitLab.
  
  1 0 Reply
  1. Wednesday 1st February 2017 17:32 GMT Anonymous Coward
    
    Re: They at least have a backup backup strategy
    
    Craps... I will bring forward my annual optician visit. Or maybe my brain is rotting.
    
    0 0 Reply
2. Wednesday 1st February 2017 15:25 GMT alisonken1
  
  Re: They at least have a backup backup strategy
  
  One question that I have about local git repo - does it also contain the buglist that's kept at gitlab as well? That would be another interesting exercise.
  
  0 0 Reply
Wednesday 1st February 2017 10:34 GMT wolfetone

I've done this. Ran an "rm -rf" command on a production server due to an email system creating a huge log file that brought the whole server down. Early in the morning, noisey open plan office, I run rm -rf forgetting I'm in the root directory. It wasn't until Linux started saying "/boot/ could not be removed". And I'm thinking "Why is there a /boot/ directory in this folder?". I cancelled what I could, but it was too late.

Server was off for 36 hours because the great guys at Rackspace tried to restore a 120GB backup to the 60GB drive what was unaffected by my mistake.

But hey, it was the first time in my career I did that and so far it's been the last time.

3 0 Reply
1. Wednesday 1st February 2017 11:46 GMT Androgynous Cupboard
  
  Back in the days before package management I was upgrading some libraries including ld.so - the dynamic library loading library. I moved or deleted the old one, and the next command to run was "mv newlibrary.so ld.so". But of course "mv", along with every other command on the OS, was dynamically linked. It didn't end well, although I did learn my lesson.
  
  2 0 Reply
Wednesday 1st February 2017 10:37 GMT simpfeld

It's too easy to blame the sysadmin

I'm sure they'll get all the stick, there are certainly failings

But management often tends to not be so interested in DR, until something like this happens. Especially in companies who are running to just keep up with constrained resources.

I have seen many times IT depts wanting to test DR, and management will not provide resourcing (equipment and/or staffing) to test this. Also they won't take any interruption to production systems to test a DR solution.

5 0 Reply
Wednesday 1st February 2017 11:45 GMT Anonymous Coward

It even has its own hashtag

#RMRFocalypse.

Pretty bad, but hopefully they will learn from this £xp£ri£n$e and test their backups properly.

I've lost data before but not quite on this scale, lesson learned to lock your PC especially when some random work experience drone comes along with his 2 friends and goes "Oh lookie, a hex editor"... Facepalm!!!

0 0 Reply
Wednesday 1st February 2017 11:46 GMT clocKwize

The only way to be confident in your backup plan is to have tests to make sure its working.

If you backup nightly, you could automate grabbing the latest backup, restoring it to a throw away instance, ensuring that it completed properly by checking record counts in various tables. You could run that every other day. Or better yet, once your backup process has completed, to verify that it has indeed worked properly.

You could still get caught out in many ways but verification to some extent would give you more confidence.

I can understand how this has happened though, start-ups are not the same as large corporations with the resource to have people spent a long time ensuring backups are rock solid and testing disaster recovery efforts monthly etc. In an ideal world, that'd be quiet high on the agenda, but realistically, breaking even is the first hurdle and you don't (technically) need a backup plan for that, so it gets put to the bottom of the list.

1 0 Reply
1. Wednesday 1st February 2017 12:54 GMT Doctor Syntax
  
  "realistically, breaking even is the first hurdle"
  
  And not breaking is the zeroth hurdle.
  
  2 0 Reply
2. Friday 3rd February 2017 15:21 GMT Charles 9
  
  That's if you can afford an instance or some other fallover. Many CAN'T. Yes, it's stupid, but if you're stuck in the middle of the ocean with nothing but a piece of flotsam, what options do you have besides exhausting yourself treading water?
  
  As said, breaking even is priority one because you're obligated to your investors first. If they don't agree with you about long-term investments, than again you're stuck because they can pull out, killing you BEFORE the disaster hits.
  
  0 0 Reply
Wednesday 1st February 2017 12:11 GMT cuddlyjumper

Livestreaming

They are actually live-streaming the rebuild here, in case anyone is interested:

https://www.youtube.com/watch?v=nc0hPGerSd4

It might seem gimmicky, but it does at least offer a seemingly unparalleled level of transparency in such a situation.

5 0 Reply
1. Wednesday 1st February 2017 13:12 GMT wolfetone
  
  Re: Livestreaming
  
  It's funny that you can see all of these gamer profiles coming on and asking what GitLab is.
  
  1 0 Reply
Wednesday 1st February 2017 12:12 GMT Mage

GitLab last year decreed it had outgrown the cloud

Irrelevant,

Either way, it's only a shared service to allow collaboration. EVERYONE should have their own complete backups.

Gitlab itself is surely a "Cloud" service?

2 0 Reply
Wednesday 1st February 2017 12:32 GMT Matt Siddall

Saw this last time backups were in the news - seems relevant

Yesterday,

All those backups seemed a waste of pay.

Now my database has gone away.

Oh I believe in yesterday.

Suddenly,

There's not half the files there used to be,

And there's a milestone hanging over me

The system crashed so suddenly.

I pushed something wrong

What it was I could not say.

Now all my data's gone

and I long for yesterday-ay-ay-ay.

Yesterday,

The need for back-ups seemed so far away.

I knew my data was all here to stay

Now I believe in yesterday.

10 0 Reply
1. Wednesday 1st February 2017 13:24 GMT wolfetone
  
  Re: Saw this last time backups were in the news - seems relevant
  
  Na, na na
  
  NA NA NA NA,
  
  NA NA NA NA,
  
  Back ups
  
  0 0 Reply
Wednesday 1st February 2017 12:54 GMT Anonymous Coward

@Doctor Syntax

I grovel most apologetically. Silly mistake on my part. That is normally something which makes me wince when other people do it.

Have an upvote for your trouble.

1 0 Reply
This post has been deleted by its author
Wednesday 1st February 2017 14:13 GMT kalman

Barman

So it seems:

1) They are not using Barman for backup managment in postgress

2) They tough to being able to vacuum (full otherwise the space reclaim they needed was not

happening)

3) They don't have a PITR ready to be used

this is how IT works today: "We need a database". "Sure". <after googling> "apt-get install postgresql". "Done". (some time you even find a docker ready...)

0 0 Reply
Wednesday 1st February 2017 15:06 GMT sillyfudder

not just rm

I created typo-geddon once using chown.

As root, I tried to give my user ownership of files from my current dir down (./)and instead put in the space of doom and changed ownership from root on down.

I realised the command was taking too long after about 3 seconds and hit ctrl-c.

I genuinely thought I'd got away with it at first until the machine had to reboot, and a lot of the fundamental stuff that ran the machine (IBM AIX) got read from disk again.

The box itself could be done without for a while so my boss at the time got me to mount the drives and undo a lot of the damage manually to reinforce the lesson (which I've never, yet, had to relearn).

2 0 Reply
Wednesday 1st February 2017 15:41 GMT OliP

1. - They were honest

2. - They are now live straming the recovery process via YouTube.

https://www.youtube.com/watch?v=nc0hPGerSd4

New Gold Standard in dealing with customers after a mamouth fuck up if you ask me.

Still - shouldn't have happened in the first place, but compare this to other companies and im not sure i could ask for more.

*No a gitlab customer

3 0 Reply
Wednesday 1st February 2017 15:48 GMT Alistair

ick

This is just one of those horribly ugly situations. Feel for the SA that hit the wrong command in the wrong place at the wrong time. We are all human and I'm sure anyone that's been in the business long enough has done something near enough to identical to feel for this individual.

I've found that being in that place, *THE* most critical thing to do right then and there is stand up and tell those that absolutely need to know that you've buggered up. And if you know what you can do to recover, lay out those options (Please, note the plural there, you should have more than one option). Otherwise bellow for assistance. Seems this SA at least hit that set of rules.

Backups. Snapshots. Copies. etc.

They can *all* fail at different times for different reasons.

This sounds to be like a case of too many disconnects between groups as to which is what and who owns what.

I've written DR plans. I've executed them. I've audited DR execution. I've fixed DR plans after the test. I've tried. Really I have tried. But unless your DR process is part of your day to day execution those plans get to be crap every 6 months or so since the apps and systems you're restoring change pretty damned rapidly nowadays.

Now, I'm gonna go back to trying to figure out why 6 tape drives on a sun box have crossed up data and control path device files.

4 0 Reply
Wednesday 1st February 2017 17:32 GMT creepy gecko

Oops!

I feel sorry for the sysadmin, but the failed backups are almost beyond comprehension.

1 0 Reply
Wednesday 1st February 2017 17:45 GMT Diginerd

Two Words - CHAOS MONKEY

https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey

Testing [RECOVERY] in production is like parachuting without a safety chute...

..,if things go truly pear shaped you're only gonna do it once.

This little guy will suffice as the adult in the room. ;-)

1 0 Reply
1. Wednesday 1st February 2017 18:00 GMT Charles 9
  
  Re: Two Words - CHAOS MONKEY
  
  Right, but what if that's your ONLY unit?
  
  0 0 Reply
  1. Wednesday 1st February 2017 18:03 GMT Diginerd
    
    Re: Two Words - CHAOS MONKEY
    
    1oz of prevention > 1lb of cure.
    
    0 0 Reply
    1. Friday 3rd February 2017 15:21 GMT Charles 9
      
      Re: Two Words - CHAOS MONKEY
      
      But sometimes, you're not even allowed the ounce. What then?
      
      0 0 Reply
Wednesday 1st February 2017 18:03 GMT ecofeco

$20 Million?

Too bad they didn't spend enough to hire an admin who knew what they were doing. Or the accountant. Or the boss.

There's an old saying: It's good to save money in business, but you can save yourself right out of business.

0 1 Reply
1. Thursday 2nd February 2017 04:18 GMT theblackhand
  
  Re: $20 Million?
  
  Given the recovery process, it looks like the DBA is pretty competent - he may have made a huge mistake (in my experience, competent people can hugely misjudge the risk of their actions) and is fixing it. Not a perfect fix, but able to recover all but 6 hours of data and able to quantify what was missing in under a day isn't bad given the number of issues found.
  
  It looks like the root cause was attempting to get replication working from live to staging that broke the db1 to db2 replication process - the issue may have been related to performance limits in the staging environment. There was then a period of high DB utilisation issue that may have partially contributed to the replication problem either directly or indirectly via distracting the DBA. While I can understand the thought process behind deleting the db2 replica and starting it again, there was a risk in these actions that was unfortunately realised. At which point, things started to go horribly wrong as all the back up issues were discovered.
  
  The bit that is missing is why did all the backups fail? I suspect the backups and backup process had been tested in the past with the earlier DB versions. 9.6 is reasonably new (Sept 2016) so they may have had a working backup strategy up until at least then and arguably based on their issue tracker until mid-December 2016.
  
  Why is this important? Read through the comments about testing backups and ensuring high availability. They probably had both until last month when they upgraded the database...
  
  2 0 Reply
  1. Thursday 2nd February 2017 09:48 GMT David Roberts
    
    Re: $20 Million? - no testing or complacency?
    
    @theblackhand I was going to post much the same.
    
    The backup plan was so broad reaching that it is very unlikely that it was never tested.
    
    The article includes a bit about using outdated verions which "failed silently".
    
    My suspicion is that the backup strategy was tested so comprehensively and had so many fail safes that everyone assumed that they were covered and neglected to check on a regular basis because it was "too good to fail".
    
    All those posting that it was obviously never tested; reveal your position as an insider or other verifiable proof or STFU.
    
    4 0 Reply
Wednesday 1st February 2017 18:10 GMT Nate Amsden

my money would be on bad management

It seems like their setup was rather fragile. I'd put my money on not having enough geek horsepower to do everything they wanted to do. Having been in that situation many times. Even having a near disaster with lots of data loss(and close to a week of downtime on backend systems), company at the time approved the DR budget, only to have management take the budget away and divert it to another underfunded project(I left company weeks later).

One place I was at had a DR plan, and paid the vendor $30k a month. They knew even before the plan was signed that it would NEVER EVER WORK. It depended on using tractor trailers filled with servers, and having a place to park them and hook up to the interwebs. We had no place to send them(the place the company wanted to send them flat out said NO WAY will they allow us to do that). We had a major outage there with data loss(maybe 18 months before that DR project), they were cutting costs by invalidating their Oracle backups every night to use them for reporting/BI. So when the one and only DB server went out (storage outage) and lost data, they had a hell of a time restoring the bits of data that were corrupted from the backups because the only copy of the DB was invalidated by opening it read write for reporting every night (they knew this in advance it wasn't a surprise). ~36 hrs of hard downtime there, and still had to take random outages to recover from data loss every now and then for at a least a year or two later. Never once tested the backups (and the only thing that was backed up was the Oracle DB, not the other DBs, or web servers etc). Ops staff so overworked and understaffed, major outages constantly because of bad application design.

Years later after I left I sent a message to one of my former team mates and asked him how things were going, they had moved to a new set of data centers. His response was something like "we're 4 hours into downtime on our 4 nines cluster/datacenter/production environment" (or was it 5 nines I forget).

I've never been at a place where even say annual tests of backups were done. Never time or resources to do it. I have high confidence that the backups I have today are good, but less confidence that everything that needs to be backed up is being backed up, because in the past 5 years I am the only one that looks into that stuff(I am not a team of 1), nobody else seems to care enough to do anything about it. Lack of staffing, too few people doing too many things..typical I suppose but it means there are gaps. Management has been aware as I have been yelling for almost 2 years on the topic yet little has been done. Though progress is now being made ever so slowly.

The place that had a week of downtime, we did have a formal backup project to make sure everything that was important was backed up (as there was far too much data to back up everything(and not enough hardware to handle it), much of it was not critical). So when we had the big outage, sure enough people came to me asking to restore things. Most cases I could do it. Some cases the data wasn't there -- because -- you guessed it -- they never said it should be backed up in the first place.

Been close to leaving my current position probably a half dozen times in the past year over things like that(backups is just a small part of the issue, and not what has kept me up at night on occasion).

I had one manager 16 years ago say he used to delete shit randomly and ask me to restore just to test the backups (they always worked). That was a really small shop with a very simple setup. He didn't tell me he was deleting shit randomly until years later.

It could be the geeks fault though. As a senior geek myself I have to put more faith in the geeks and less in the management.

2 0 Reply
Wednesday 1st February 2017 18:34 GMT Cynic_999

Hindsight is a wonderful thing

I am reading a lot of sanctimonious comments from people explaining how this could never happen to them because they always test everything and are well-prepared for a failure event. I'd like the people making those comments to honestly answer the following questions:

1) Do you presently have a spare can of fuel in your car?

2) Do you have a spare can of water in your car?

3) Do you have a torch (flashlight) in your car?

4) Do you carry warm clothes and/or blankets (in case you get stuck in a traffic jam etc. overnight)?

5) How regularly do you check the air pressure in your spare tyre?

6) When did you last check that your brake lights were working OK?

1 0 Reply
1. Wednesday 1st February 2017 19:29 GMT Herby
  
  Re: Hindsight is a wonderful thing
  
  Yes, it is, but sometimes you need to understand the risks of doing too much.
  
  Sometimes you need to just rely on your design, and after proving you have made it as good as possible, let it go. One example of this is the ascent stage of the lunar lander. This rocket was only fired ONCE for the takeoff from the moon. It was NEVER tested since the act of testing it with the fuels/oxidizers involved degrades/destroys the engine itself. They built it to be as bullet proof as it could be and over engineered it a bit more. It used a hypergolic fuel mixture and simplified fuel flows (I believe they used gas pressure to empty the tanks, and it had only one speed (ON!). Guess what, it worked EVERY time. As for my vehicle that I use every day:
  
  1) Do you presently have a spare can of fuel in your car?
  
  No, but I do watch my gas gauge, and if I forget, I have a AAA (us, AA - UK) card that will get me some.
  
  2) Do you have a spare can of water in your car?
  
  No, but on the time the cooling system failed (it was a couple of months ago), I could pull over and park, waiting for a tow.
  
  3) Do you have a torch (flashlight) in your car?
  
  Yes, it is only common sense. This is a small device that takes up little space, and has other benefits.
  
  4) Do you carry warm clothes and/or blankets (in case you get stuck in a traffic jam etc. overnight)?
  
  No, but in the cases where this might be a problem, I was traveling to a ski area overnight, and DID have some warm clothes I was actually wearing.
  
  5) How regularly do you check the air pressure in your spare tyre?
  
  While not on my vehicle, automatic pressure telemetry is now required on new vehicles. I do get my tires rotated on a regular basis (5,000 miles) and it is checked there.
  
  6) When did you last check that your brake lights were working OK?
  
  Thankfully the vehicles electronics DOES check this (modern cars!). As for older vehicles, no brake lights will usually get you rude warnings (horn honks) from people behind you. Good practice to check every so often, when servicing.
  
  So while you do bring up valid points, overthinking things like this can get too extreme. Thankfully the faults described to not cause my vehicle to spontaneously destroy itself, whereas lack of a proper computer backup, can be catastrophic (to say the least).
  
  2 0 Reply
2. Wednesday 1st February 2017 20:42 GMT G2
  
  Re: Hindsight is a wonderful thing
  
  7) do you have all of the above and a SPARE CAR?
  
  8) do you keep the car engine running on (or at least a few hours per day) and driving a few miles and fuelled all the time? (= live backup system, just in case... )
  
  9) do you have all of the above in a third spare car that's kept running, fuelled and road-worthy all the time on the other side of the continent? (= live backup data being kept in multiple locations)
  
  and so on... the logistics of these things keeps getting more complex.
  
  2 0 Reply
Wednesday 1st February 2017 19:25 GMT Anonymous Coward

Dumb

How can you not notice backups only being a few bytes of data?

I get database backups can be tricky, but C'MON MAN!!!

I think there is a job opening for a database admin...

1 2 Reply
Wednesday 1st February 2017 19:33 GMT Herby

Option for 'ls'??

Maybe if invoked as root (maybe any user?) and arguments are '-rf' it should count the number of files it might delete, and say:

Wow over 1000 files, are you sure?

Me? Typically I do it without the 'f' option and see how it progresses, then abort and re-do with the added '-f' option as needed. I get very careful with recursive descents (with good reason!).

0 0 Reply
Wednesday 1st February 2017 20:14 GMT Uplink

Wipey

YP.... Wipey... He'll never outrun his name now.

2 0 Reply
Thursday 2nd February 2017 01:02 GMT jonfr

Backup of my backups

I don't have a company and I have backup of my backups. I never know when a hard drive fails. I only backup important stuff that I cannot replace elsewhere.

I would like to have double or triple sided backup elsewhere, but I have limited budget at the moment. I'll just work with what I got at the moment.

As for this company in question. I think lack of experience makes this type of errors resulting in large scale problems like this one. Also, bad attention in school when people learn about computers and how they actually work.

1 1 Reply
1. Thursday 16th February 2017 00:37 GMT wstewart
  
  Re: Backup of my backups
  
  Have you tested the backup of your backups? Backing up a corrupt/bad backup will get you exactly where gitlab is. This story is pretty funny though. I'd expect better from a company with their name recognition. That rm -rf that was mistakenly run as a part of a replication process is one of the main reasons for automation. Also can't stop laughing at "The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented" and "Our backups to S3 apparently don’t work either: the bucket is empty"
  
  0 0 Reply
  1. Thursday 16th February 2017 11:26 GMT jonfr
    
    Re: Backup of my backups
    
    The second layer of backups is a cloud service (no good way to test, but reported size is correct), the primary backups are fine. I always test them and the reported hard drive usage is as expected.
    
    0 0 Reply
Thursday 2nd February 2017 04:09 GMT Hot Diggity

Expensive Mistakes

I'm reminded of someone who made a mistake that cost a company a large amount of money.

The person concerned was called to the CEO's office.

"I suppose that you want me to leave the company" he said shame-facedly.

"Leave? We just spent over $1 million on your education. Just don't do it again!"

3 0 Reply
Thursday 2nd February 2017 17:52 GMT dmacleo

irony...

github (last time I looked months ago) hosts backup programs that most likely would have worked on gitlab database....

*************************************************************

shoot disregard this forgot they were separate entities.

left my stupid comment up to make this comment make more sense.

0 0 Reply
Saturday 4th February 2017 10:49 GMT petef

DVCS

Nobody seems to have mentioned that this is git. All checked out repos have all history so it is intrinsically backed up.

0 0 Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

Other stories you might like

Patch time: Critical GitLab vulnerability exposes 2FA-less users to account takeovers

The bug with a perfect 10 severity score has been ripe for exploitation since May

Patches 15 Jan 2024 | 21

GitLab admits IT ineptitude in finance reporting is ongoing

Code shack has had two years since auditor's 'adverse opinion' to get house in order

Devops 7 Dec 2023 | 8

GitLab deploys on a Friday and ... is down for a few hours

Updated Snafu blamed on config change

Devops 7 Jul 2023 | 8

AI coding is 'inescapable' and here to stay, says GitLab

Getting strong FOMO vibes from devs – tho how ML is actually used among engineers may surprise you

Devops 5 Sep 2023 | 23

One third wiped off value of GitLab shares, Wall Street didn't like weaker outlook

Investors nervous in same week that Silicon Valley Bank failed

Devops 14 Mar 2023 | 6

Tech job bonfire rages on as Microsoft, GitLab and others join in

Hundreds of thousands of techies looking for work, with ultimate cost to vendors not yet tallied

On-Prem 10 Feb 2023 | 14

GitLab versus The Zombie Repos: An old plot needs a new twist

Opinion Git back, git back, git back to where your files belong

Devops 8 Aug 2022 | 58

GitLab plans to delete dormant projects in free accounts

Exclusive Hopes to save a quarter of hosting costs by binning repos that haven't been touched for a year

Devops 4 Aug 2022 | 103

GitLab U-turns on deleting dormant projects after backlash

Updated Now makes vague pledge to shove inactive repos into slow object storage

Devops 5 Aug 2022 | 41

GitLab spots huge opportunity for DevOps platform as revenue soars

All companies will need to embrace modern software development, says CEO, and we'll be waiting for them

Devops 7 Jun 2022 | 3

GitLab version 15 goes big on visibility and observability

GitOps fans can take a spin on the free tier for pull-based deployment

Devops 23 May 2022 |

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024