back to article GitLab.com melts down after wrong directory deleted, backups fail

Source-code hub GitLab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued a sobering series of tweets we've listed below. Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had …

  1. Anonymous Coward
    Anonymous Coward

    Great

    I wont even connect to our gitlab.

    So I guess we lost all our synced files.. GREAT.

    I have a copy of everything on local.. and 3 backups, tomorrow I will see if my work colleagues are as paranoid as I am.

    1. Anonymous Coward
      Anonymous Coward

      Super! Re: Great

      I assume you've tested your backups.

      Unlike Gitlab.

      1. Charles 9

        Re: Super! Great

        Hasn't it been said you can't really practice for an emergency without an emergency, in which case Murphy will get you either way?

        1. Jenny with the Axe

          Re: Super! Great

          Actually, I've worked in places that did emergency testing. They did things like "let's kill all the connections through one datacenter and see if our customers can still access their stuff." Also "let's restore the backups to this test system and see that it works". As Gitlab has now discovered, backups without restore testings are not backups....

          1. Charles 9

            Re: Super! Great

            You're lucky to have the budget to do it. Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test. No test system to try the restore on (and besides, it's different from the live system, so things can still mess up in actual settings), and no way to really test for emergencies because they depend on things that ONLY occur in real emergencies, such as power to not just the floor but the whole building going out (and perhaps next door as well, just to be sure something wasn't plugged in a jury-rig).

            1. Anonymous Coward
              Anonymous Coward

              Re: Super! Great

              Maybe most people shouldn't have nice things if they can't afford to maintain them?

            2. Mark 78

              Re: Super! Great

              "You're lucky to have the budget to do it. Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test."

              Maybe you should rethink that..... Are you lucky enough to have the budget to NOT run a restore and risk the loss of everything? The cost of a properly tested restore system should be a vital part of any project budget. Your lack of a test/restore system is basically saying to your customers that if things go wrong we are bankrupt and possibly they are as well (if they are external customers).

              1. asdf

                Re: Super! Great

                >Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test.

                Sounds to me like the failure is in the business model of the company. Those generally are the type of companies that are one recession or self created disaster away from administration.

                1. Charles 9

                  Re: Super! Great

                  "Sounds to me like the failure is in the business model of the company. Those generally are the type of companies that are one recession or self created disaster away from administration."

                  That's why it's called living on the razor's edge. Where margins are close to zero all the time. You'd be surprised how many firms HAVE to live like this because they flip between profit and loss every month. You're floating in the ocean and you barely have the stamina to tread water. Sometimes, that's all you're dealt. All you can do is hope for shore or some flotsam.

                  1. asdf

                    Re: Super! Great

                    >Sometimes, that's all you're dealt. All you can do is hope for shore or some flotsam.

                    That's fine if you are a brave entrepreneur who has few limits of what he/she can reap if they succeed but taking a job at one of those companies is another matter (especially without a big ownership stake). A big part of job interviewing from the view of the interviewee is figuring out if the company is one of those companies. If you do take the job then it probably means you need to do a better job researching companies or you need to increase your skills and experience so you don't have to work for those type of companies for long if at all.

                    1. asdf

                      Re: Super! Great

                      Forgot to expand on the whole hoping to cash in on the ground floor of a startup angle which again is fine I suppose if you are young or aren't the only income earner in your family but still probably won't end up being one of your wiser choices more than likely. If you are lucky you might get to keep the actual pets.com puppet though after everything goes sideways.

                    2. Charles 9

                      Re: Super! Great

                      "A big part of job interviewing from the view of the interviewee is figuring out if the company is one of those companies. If you do take the job then it probably means you need to do a better job researching companies or you need to increase your skills and experience so you don't have to work for those type of companies for long if at all."

                      Or it simply means you're out of options. If they're the ONLY opening, then as they say, "Any port in a storm."

                      1. asdf

                        Re: Super! Great

                        >Or it simply means you're out of options. If they're the ONLY opening, then as they say, "Any port in a storm."

                        Which is fine unless you spend decades in that situation and then turn around and blame globalization for all your problems. Not you per say of course but a significant number of people.

                        1. Anonymous Coward
                          Anonymous Coward

                          Re: Super! Great

                          Ever thought for some people it's absolutely true?

            3. Doctor Syntax Silver badge

              Re: Super! Great

              "Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7"

              If you have only one system which must remain up 24/7 you only have two choices, a huge budget or eventual failure.

            4. Tikimon
              FAIL

              Re: Super! Great

              If your system is set up such that your disaster remediation cannot be tested, it is set up WRONG. There, I said it.

              I've shared my personal example before. Previous job, my so-called supervisor messing with SQL queries did a Delete thinking it would clear his query. Blew away the whole database with a single click. The never let me test the backups, claiming "24/7, can't be down for testing!". Instead we were down for THREE DAYS while an SQL consultant helped rebuild from scratch, and we never recovered all the data.

              Disaster testing always creates some inconvenience, but that's no excuse to skip it. A smart captain never complains about the lifeboat drills.

            5. Anonymous Coward
              Anonymous Coward

              Re: Super! Great

              "Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7"

              Words fail me. I would characterize that as planning to fail - which is pretty expensive also.

            6. theblackhand

              Re: Super! Great

              "Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7"

              Is that 24/7 as in always operational or 24/7 in that if something goes wrong IT won't be in until the morning to fix it so someone will complain?

      2. Halfmad

        Re: Super! Great

        This sort of situation is why I constantly nag my IT department about testing backups, just because the backup product says the backup verified OK is not the same as the occasional bare metal restore just to check.

      3. Anonymous Coward
        Anonymous Coward

        Re: Super! Great

        Yes, I have tested all my backups, and we lost 0 data.

        A non tested backup does not exist on my book.

    2. Anonymous Coward
      Anonymous Coward

      Re: Great

      No repository data was lost, it was *ONLY* the database that got rolled back 6 hours. Your code is safe.

      1. Anonymous Coward
        Facepalm

        @brodrock

        "No repository data was lost"

        YET.

  2. kain preacher

    How would like to be the one to tell your boss I think I accidentally deleted the company ?

    1. Anonymous Coward
      Anonymous Coward

      Maybe the company should be deleted. If you can't handle your own backups without a "cloud" service, just how non-functional are you at the center?

    2. Voland's right hand Silver badge
      WTF?

      Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

      A company running a production service with this backup strategy in place should be deleted.

      1. Adam 52 Silver badge

        All those rushing in the criticise, how confident are you in you processes?

        Before the event these guys were very confident, they had LVM snapshots, DB backups, Azure snapshots and even a copy on s3 in case Azure goes down. I bet they were even tested in the first place.

        Even after all this very public embarrassment they do still have a copy from 6 hours earlier.

        How many of you could suffer multiple failures and still be able to do a same day recovery?

        1. codejunky Silver badge

          @ Adam 52

          Having read your comment you have restored my faith in people. On topics on data loss where the backup failed I usually see comments about 1 backup is none and 2 is 1 etc, so these guys had 4 by that rule and still got stung. I feel sorry for whoever nuked the data and people responsible for making the various backup scripts. I know if it was me I would be feeling bad and wondering what I could have done different, and in future probably becoming more paranoid of all the systems I have in place.

          I wish these guys well and hope they dont get too much stick, just the opportunity to fix it.

          1. Peter2 Silver badge

            Re: @ Adam 52

            Mmm.

            There is an old tale called the Tao of Backup, written in the days when NT4 was shiny new equipment people aspired towards having and the press was talking about the upcoming Win98. It's still available online.

            http://www.taobackup.com/

            Skipping the first four points we get to :-

            5. Testing

            The novice asked the backup master: "Master, now that my backups have good coverage, are taken frequently, are archived, and are distributed to the four corners of the earth, I have supreme confidence in them. Have I achieved enlightenment? Surely now I comprehend the Tao Of Backup?"

            The master paused for one minute, then suddenly produced an axe and smashed the novice's disk drive to pieces. Calmly he said: "To believe in one's backups is one thing. To have to use them is another."

            The novice looked very worried.

            It then links to this page which lists some of the things that can go wrong due to not fault of your own.

            http://www.taobackup.com/testing_info.html

            Gitlab didn't have backups. They had backup scripts that didn't work. Having those sort of problems happens. Not testing your backups and discovering those problems before an emergency hits is inexcusable.

            1. TVU Silver badge

              Re: @ Adam 52

              "Gitlab didn't have backups. They had backup scripts that didn't work. Having those sort of problems happens. Not testing your backups and discovering those problems before an emergency hits is inexcusable."

              ^ I fully agree with this and the question then becomes, "Will Gitlab learn from this experience?". I hope that they do and now set up and test suitable back up systems.

              Examples of how not to do things include TalkTalk with multiple breaches where they do nothing, bad things happen again and then customers leave.

              1. Adam 52 Silver badge

                Re: @ Adam 52

                "Not testing your backups and discovering those problems before an emergency hits is inexcusable"

                How do you know they didn't test? I guess they did test, but the Postgres version changed and the database backup stopped working.

                So they didn't test regularly enough. What is regularly enough? And how many people actually do that, rather than take a holier-than-thou attitude on a forum.

          2. Keith Langmead

            Re: @ Adam 52

            Completely agree with both of you. There's no doubt they made mistakes, backups, replication etc should have been tested properly and clearly wasn't, but do you know what really hit me reading about this? How open they've been about the whole thing. From the initial issues through to publishing the live notes for the restoration attempts, I'm personally very impressed that they've been open and honest about what's happening. So many other companies would and have hidden behind vague "we're working on it" responses, and while it doesn't undo their failure to test, I think their honesty does need to be acknowledged and commended.

            1. Peter2 Silver badge

              Re: @ Adam 52

              I have to agree with that. It is refreshing to see "Mea Culpa", rather than an extract from "Yes Minister".

          3. Oh Homer
            Childcatcher

            Re: "what I could have done different"

            They could start by delegating someone to be responsible for checking the results of their backup plan, on a daily basis, to ensure that backups are actually being made and are valid.

            As a bare minimum they should have scripted this to fully test the output (Is the target accessible, does it contain a backup, is the backup valid?) then email a warning in the event of failure. Or better yet, use the dead man's switch method by emailing the successful results of a fully verified backup, then have a warning issued on the admin's local system in the event that no such message is received, to account for not only backup failures but communication failures too, and beyond that have the admin manually check for such messages on a set schedule, in case the local warning system (cron et al) doesn't work for any reason.

            Beyond that minimum effort, they should also have had full server replication on a hot spare, not only for failover but also for bare metal restore testing, which is ultimately the only way you can be sure your backup process really works, as opposed to "seems to work", "completed without error" and "says it verified", all of which amount to exactly nothing. Until you have successfully performed an actual restore, you simply don't have a backup.

            The most shocking part of this is not that backups failed, but that nobody noticed, at least not until it was already too late.

            Essentially they had no process in place, they had a token effort that was cobbled together then not even verified as working. That's the unforgivable bit.

        2. Michael

          Disaster recovery

          It is all relative.

          For example, in our small team, we destroy our entire staging environment and perform complete restore from backups once every month or two. We also use the production backups for restoration to verify that these are working in a separate environment. I deliberately get someone that hasn't done it recently to check the documentation is correct and everything works. Given that we also have the git repos and source on multiple dev machines also for everything, the worst case data loss is from a dev not committing regular changes. But everything is tested, and I believe regularly enough for a team of 10.

          If I was hosting thousands of peoples data, I'd expect a more regular testing of the backup and recovery procedure. The can never have tested it with production data as they have no working backups. That is just incompetent.

        3. Ogi

          > All those rushing in the criticise, how confident are you in you processes?

          Very confident in our backup system. Quite frankly because I insist on regular restore testing from backups, precisely for this reason. A habit I got into many years ago when I worked for a big company where we stored very sensitive personal data.

          Until you have done a successful restore, you don't have a verified working backup system. This is simple 101 of sysadmin work. It isn't something exotic or really hard to wrap your head around.

          There was the joke that you could back everything up to /dev/null just fine. However the restores would be much harder. Sounds like the gitlab guys inadvertently used a similar concept as their backup strategy.

          I can understand the accidental rm -rf, mistakes happen, we are human after all. However backing up (presumably for years) without never even bothering to check the size of your backup is just gross negligence. Surely a quick check of the backups, noticing all the files are suspiciously 512 bytes in size would have been enough to draw attention that maybe something wasn't working in the backup system.

          If what they say is true and they essentially have no backup, then this was a completely avoidable situation that was of their own doing.

        4. anothercynic Silver badge

          Amen, @Adam 52! Well said!

        5. Anonymous Coward
          Anonymous Coward

          All those rushing in the criticise, how confident are you in you processes?

          You're right, although I'm old school - where I come from, backups aren't as important as restores, so a backup wasn't until you could prove you could actually use it to recover.

          That said, the first ever from-the-metal-upwards restore test I did was the single most nerve-wracking thing I've done in my career, despite the fact that I had two separate backups and a hot standby site to take over if recovery took more time than the available test window.

          That said, you haven't earned your sysadmin badge if you haven't learned the "rm -rf" lesson the hard way...

          1. Doctor Syntax Silver badge

            "That said, the first ever from-the-metal-upwards restore test I did was the single most nerve-wracking thing I've done in my career"

            I've worked in two places where we had DR contracts including rights to run practice restores. They can be learning experiences, especially the first one. /etc was the last directory on the last dump tape. We had to sit twiddling thumbs waiting to get a system we could log into & then ran out of time before we had a restored system. It ensured the dumps were better organised for the next pass.

        6. Doctor Syntax Silver badge

          "How many of you could suffer multiple failures and still be able to do a same day recovery?"

          Wrong question.

          How many of you could have five non-functioning backup mechanisms and not be aware that none of them worked because you hadn't tested them?

        7. Anonymous Coward
          Anonymous Coward

          "All those rushing in the criticise, how confident are you in you processes?"

          Confident enough to know it works through testing.

          I could defend an honest mistake, not a short sighted lazy one. It could be argued that they were not lazy (who knows, maybe their testing matched their design), but they were clearly short sighted. Either they were or they were not designing to be bullet proof, which was it?

          All they had to do was test for the worst. This isn't a after school special, not everybody is winners, not everybody deserves a rewad. At least they owned it. They could of said "You're holding it wrong."

        8. DJ Smiley

          At my 2nd day of a job, I deleted the entire stack of the test system with a misplaced rm -rf.

          I crapped myself thinking I'd be instantly fired. My boss made some 'angry' sounds, then told me it's not the biggest issue as they needed to try a fresh install of the new version anyway (as thats' how the new version would be rolled out in production rather than upgrading, which is what they normally did on the test servers.)

          This also allowed them to fully test the backups, pulling the older data from the production backups, anonymising it as required and also finding some faults with various processes that were included but didn't work after the upgrade. In all the test system was down for about 4 days instead of 1, but the fixing of the systems to allow it to get the go ahead in production took a month or more. If I'd not 'slipped up' then they wouldn't of known these issues until trying to go live in production and if so, it would of been a very long night of around 6-8 hours reinstalling the older version back into production (after the 6-8 hours of installing and testing the new version).

          This attitude of 'we can't afford to test it' is utter bollocks. You fire up as many vm's as required in the cloud, and you at least verify the _data_ is there, even if the functionality isn't. It's bad to find the code for the production system isnt' backed up as much as you think it is, it's unrecoverable to find out the data is gone.

          These guys got lucky, if he hadn't taken that copy 6 hours before they'd be dead in the water and the company would be gone.

          1. Doctor Syntax Silver badge

            "In all the test system was down for about 4 days instead of 1"

            You could even look on it as an unplanned 4 day test because, as you described it, that's what it was.

        9. Tikimon
          FAIL

          How confident are we? VERY.

          We test our backup processes quarterly. Real data is deleted and restored from backup, the files opened to check integrity. Failover servers are shut down and the handoff to partner server checked.

          Can we guarantee nothing will go wrong? Of course not, such a thing is not possible. Sometimes when a ship sinks it lists too far and half the lifeboats cannot be used. That's no excuse to skip lifeboat drills, nor is "the crew can't stop what they're doing to run safety drills."

          So with apologies to Adam and his admirers, the Gitlab geeks did not do their jobs properly. They admit that the backup files were too small to be believable. That alone should flag their systems as not working, with no shutdown testing required. They missed several chances to spot this.

          1. Adam 52 Silver badge

            Re: How confident are we? VERY.

            "We test our backup processes quarterly"

            ...and you're confident? Then you are far too complacent. You could easily lose 3 months of data if something's gone wrong. Think about all the things that could have failed between your last test and now - the tapes, the tape drives, the disks, the software, a permissions issue, a version mismatch, a CRC algorithm change, new directories, new servers, credentials, sneaky ransomware, ... the list goes on.

            I test weekly. And I still think my backup process is worse than the one these guys had.

      2. dyioulos

        YP should ask JN where those backups are stored. Or, they should both head for the nearest bar. If I did something like this, that's where I'd be headed so that I could work on my resume.

    3. swm

      When I was teaching at a College the account containing all of the homework submissions was deleted by a rogue script. Next day in class I used this as an example as to the necessity for having good backups. I saw many smirks around the room until I pointed out that we took backups seriously (with weekly and incrementals) and that the account was fully restored losing about 4 hours of work that the students could easily resubmit (unless they deleted their submissions which is unlikely). We probably could have restored more but didn't want to restore after the rogue script started deleting files.

      On another note, we had an IT professional from Paychex (a company that prints checks for small businesses) give a talk on something or other and after the talk I asked how the company survived a state-wide power failure lasting for several days. His answer was that they keep 6 copies of their essential data spinning and online. During the power failure all UPS systems worked perfectly and they lost nothing. However their customers could not download their information because they did not have power so Paychex had to rent trucks to cart all of the checks their customers couldn't print. The point is that the entire supply chain needs to be considered.

  3. DryBones

    So I'm having....

    a lot of schadenfreude right now. All of that boils down to one single sentence:

    Nobody ever test-restored a backup.

    1. kodykantor

      Re: So I'm having....

      I work at Veritas and I always have trouble explaining the need for good backup/recovery and DR solutions to my young friends. This is a nice link to send them in the future. It hurts to see this happen to a group of people, but hopefully it will lead to others testing (or implementing :) ) their strategies!

    2. MacroRodent

      Re: So I'm having....

      Nobody ever test-restored a backup.

      That is a step too often skipped, because you don't want your test to overwrite live data, so you would temporarily need as much space elsewhere as the restoration takes. In fact, you better have a complete spare system to test you can make everything working with the backup. May be difficult to arrange.

      1. Doctor Syntax Silver badge

        Re: So I'm having....

        "In fact, you better have a complete spare system to test you can make everything working with the backup."

        As per another post, a DR contract can give you rights to run practice drills. That's your test opportunity.

    3. Anonymous Coward
      Anonymous Coward

      Re: So I'm having....

      The issue with a real restore from backup is that when an issue strikes you try these things:

      1) Attempt to fix the issue as you don't want to load stale data or risk overwriting your live system, you think it'll take 1 hour but you've invested so much effort that after 6/12/24 hours you are almost there but keep hitting snags but are close enough not to turn to the backup (all the while every half hour you get a call asking for updates)

      2) Eventually you decide you'll have to go to backup - data is now even older. You start the restore with very little idea how long it will take but estimate 1 hour. As it starts restoring and the progress bar whizzes along and tells you 2 hours to go you feel hopefull. The progress bar gradually slows down and the time starts showing 2 hours, 6 hours, 12 hours to go. After 4 hours you think something is wrong with the restore so you abort it and decide to manually copy the files over and restore from a local mount point. Repeat issue above with copy times getting gradually longer. You start looking at jumbo frames, data transfer graphs, can proficiently convert bits to bytes to to Mbs to Mibs in your head.

      3) You update everyone that it is actually going to take about 24 hours to recover that much data and go home, waking every 30 minutes to remote in and check progress.

      4) After three days the few TB has copied back but the restore fails with an error 00x0ffx0f00075844 Unspecified Error. Possibly something to do with merging the incrementals.

      5) You reach for your last full backup, data that is now 7 days old. You wait three days for a full restore again. The restore is a success but none of your db based products work. SQL won't start, Exchange won't mount any DBs you've got files restored but a number of iSCS link are broken on some apps.

      6) You spend the next hour/6 hours/3 days trying to work out how to clean mount a db with partial corruption logs spending more time on various forums than you care for and eventually bring most things back to life although Jim from finance still can't access any e-mails, the Finance system has rolled back a month of transactions due to consolidation errors and no-one can access the intranet anymore.

      7) You feel relieved to have made it through, what feels like a war zone single handedly with relief but then feel angry that you had no real support and no-one seems to understand you mix of emotions from anxiety, to panic, to fear, to relief and back to helplessness you have just been through.

      8) You go to speak to your manager and they tell you they are bringing in a consultant.

      9) You get fired.

      1. Calleb III

        Re: So I'm having....

        That's why you make multiple work streams, working on alternative solutions in parallel

    4. Just Enough
      Thumb Down

      Re: So I'm having....

      >a lot of schadenfreude right now.

      Really? You're deriving pleasure out of seeing this failure? How very cold-hearted of you.

      Any time I read these kind of stories I'm filled with relief that I'm not in the team that has to fix the mess, and have nothing but sympathy for them. We've all experienced screw ups like this, we all know what a stressful experience they are.

      Yes, they messed up and should have tested their backups. But I take no pleasure out of seeing others' work go tits-up or lost.

      1. Doctor Syntax Silver badge

        Re: So I'm having....

        "Really? You're deriving pleasure out of seeing this failure? How very cold-hearted of you."

        Maybe he works for Github?

  4. Infernoz Bronze badge
    Facepalm

    It sounds like the sysadmin did not have a proper plan, was probably tired, preoccupied or bored, and rushed things, and someone did not put in place enough script logs/failure-alerts, backup verification logs/alerts, and do regular checks for both, so that the backups worked...

    I do wonder why the database wasn't mirrored to reduce downtime for upgrades or other failures like this. Some redundancy should be compulsory for all professional systems.

    Maybe GitLab could use OpenZFS with regular dataset filesystem snapshots, for rapid rollback to a snapshot before damage/deletion of files/data occurred, or (read-only) mount the snapshot and use some files/data off it to fix the active filesystem or make consistent backups; I've found the later very useful when I've accidentally deleted stuff e.g. on FreeNAS.

    1. Steve Aubrey

      "Some redundancy should be compulsory for all professional systems."

      Ya, but who's going to be the adult in this situation? Obviously one was lacking.

      Regardless of how cloudy, trendy, and hipstery your company is, hire at least one adult. The one who knows the hard questions, and will ask them.

      1. Doctor Syntax Silver badge

        "The one who knows the hard questions, and will ask them."

        You can get fired for asking questions like that. Not a team player.

        1. Steve Aubrey

          Doc - agree about the possibility of being fired for asking hard questions. But maybe a blessing in disguise?

          1. Doctor Syntax Silver badge

            "But maybe a blessing in disguise?"

            Yup.

      2. Anonymous Coward
        Anonymous Coward

        hire at least one adult. The one who knows the hard questions, and will ask them.

        It's not enough to be able to ask the hard questions if you can't come with some of the answers for it too. Sniping from the side lines is easy but being able to assist in making sure the answers are right is where the skills lie - it's old adage: "don't bring me problems, bring me solutions".

        1. Doctor Syntax Silver badge

          "It's not enough to be able to ask the hard questions if you can't come with some of the answers for it too."

          You need to let them flounder a bit first, otherwise they won't be ready to grasp your solution.

          1. Anonymous Coward
            Anonymous Coward

            If they're floundering like that, they probably wouldn't be able to grasp your solution even if you spelled it out for them.

            1. Doctor Syntax Silver badge

              "If they're floundering like that, they probably wouldn't be able to grasp your solution"

              Flounder in terms of not having been able to answer the hard questions.

    2. A Non e-mouse Silver badge

      Some redundancy should be compulsory for all professional systems.

      Some working redundancy should be compulsory for all professional systems.

      FTFY ;-)

      From the article, they had lots of fancy redundancy: They just never checked that the redundancy was working.

    3. Marc 13

      Yup, love ZFS snapshots (but only as an add-on to other backup methods - its nice to have choices!) - simple to mount and restore from.

    4. Zippy's Sausage Factory
      Coat

      Some redundancy should be compulsory for all professional systems.

      Or, in this case, whatever the Netherlands equivalent of a P45 / Pink Slip happens to be...

  5. jamesb2147

    Speaks to a fundamental problem

    IT is hard.

    Backups are a pain in the ass, for exactly the reasons mentioned here. All ye who apply a rigorous and robust backup policy, I applaud you, but I doubt that a single one my employer's clients falls into that category, and we have many, many clients.

    Anyone know of a product that you can point at a database, provide it credentials, and it handles all the rest, including test restores with error messages on failures? That's not to even get into file backup, but file backup is notably more simple in many ways, especially with the right tools (ask any ZFS admin).

    1. Anonymous Coward
      Anonymous Coward

      Re: Speaks to a fundamental problem

      Used to have an EMC san with recover point that took an oracle db block level clone once a night broke the clone remounted it, renamed it and replayed it so you could use it in the morning. Still not there though as that didn't work for creating offsite backups that you knew worked. Just that we could restore to any point during the last 24 hours happily.

      There's a lot pontificating here but I wonder how many really have battle tested their backups enough to be so sure.

  6. Richard 12 Silver badge

    At least it's git

    That means there shouldn't be much real data loss at the final count, as the important things pushed there will have a backup in the place that pushed the last commit.

    Pain in the proverbial for all the project leads to push up all the lost branch tips again, but at least it's only mostly dead and not dead-dead.

    1. macjules

      Re: At least it's git

      And a REAL pain in the butt when you have an organisation like GDS with hundreds of git forks, more tramlines than Zürich Central and several hundred developers working remotely. Thank God GitHub never goes down ... well, almost.

      1. Doctor Syntax Silver badge

        Re: At least it's git

        "And a REAL pain in the butt when you have an organisation like GDS with hundreds of git forks"

        I think it's GDS that's the real pain.

  7. Anonymous Coward
    Anonymous Coward

    CloudFog...

    The Cloud... Someone else's computer....

  8. A Non e-mouse Silver badge

    Who's to blame?

    The sysadmin who accidentally nuked the live data reckons "it’s best for him not to run anything with sudo any more today."

    Whilst the sysadmin screwed up deleted the wrong directory, the bigger screw up was the company not having any tested DR systems.

  9. Anonymous Coward
    Anonymous Coward

    It's pretty easy to screw up even when you have a working backup.

    Recently had a situation where a client had messed up a data upload, and we needed to fix the data. Our software being fairly forgiving, the plan was to download the automated back-up, then backup the live dataset, attach the downloaded files alongside the live system database, and repair the damaged data with the values from the backup.

    Would have worked great except in between this work and our last test, the backup software vendor had changed the recovery process to include a helpful feature, namely if you don't expand the file tree fully, it downloads the files and then automatically and silently over-writes the live database...

    Needless to say, the client lost the 2 days worth of data that we'd fully expected to be preserved.

    Re-written the procedures to move the manual backup to prior to downloading files but even with the best laid plans, backup software can be a tricky proposition!

    1. ecofeco Silver badge

      Seriously, why are backup vendors such morons? It's been a long time since I've seen backup software that was easy and reliable.

  10. Anonymous Coward
    Mushroom

    And this is why...

    I'm not that thrilled about anything cloud based and prefer to host my own repositories. And here's one of the many reasons why. For starters: I actually check my backups on a regular basis, even when I don't need them.

    I'm not even going to bother commenting any further because this is simply too big a fail. Makes you wonder what kind of geniuses work there. And what they're doing all day.

    1. Doctor Syntax Silver badge

      Re: And this is why...

      "I actually check my backups on a regular basis, even when I don't need them."

      That's the thing about backups. You hope you never need them.

  11. Norman Nescio Silver badge

    Backup is hard. Doesn't mean it should be ignored.

    There is little excuse for being unable to restore from backup. "The dog ate my backup tape.", doesn't cut it. Having to deal with a full-scale civil insurrection that has trashed your off-site storage location might give you a pass.

    The thing is, it is human nature to put something off that is difficult that has no immediately apparent consequences, so you need a martinet in charge of backup and restoration services, because your business could well live or die as a result.

    I worked for an organisation that did annual full-scale disaster recovery exercises, which were instructive. The chap in charge of them was quite sanguine about failures: his point was that he far preferred to find things didn't work during an exercise than during the real thing. As this is about backup, I won't go into some of the more interesting organisational failure modes we found, but the backups caused problems.

    1) The off-site storage vendor couldn't locate some of the backup tapes. It turned out that nobody had ever audited their retrieval performance. Talking to the operators, it was found that it was common for tapes to go missing. All the operators did was use a new blank tape when it was time in the backup cycle to re-use one that couldn't be found. The organisation used enough tapes in day-to-day work that the small number new tapes used to replace the missing backup tapes wasn't noticed. The operators' job was to do backups, which they were doing.

    2) At least one of the tape heads was misaligned. It would quite happily write out a backup, which could be read with no problems on that drive, and that drive only. When the time came to ship the backup tape to the disaster-recovery location, no tape drive there could read tapes written by the original drive. Lesson learned was to make sure you could read backups on different equipment than it was made from.

    3) It turned out that referential integrity is important. Who knew? Backing up files while they were in use, then trying to use the restored backup caused all sorts of problems. This was before the days of journalling file systems and snapshots. The application developers had failed to appreciate that backing up a large file took an appreciable amount of time, and in that time many records would be changed. An update that added or modified records both near the beginning and the end of the file would end up with only some of the updates recorded on the backup. That was solved initially by having a backup window where no updates were allowed to the file.

    These days, some of the problems are solved by backup up to 'the cloud' and having the ability to snapshot databases, but doing proper backups is a neglected art. There is still a point at which it is not economic to back up over a network and you have to plan for moving physical media about, and life starts to get interesting.

    Doing backups and testing restoration procedures is a huge time hog, and it isn't sexy. But it is important. I hope these guys get themselves out of the hole they just dug for themselves. Some people probably have unhealthy stress levels right now.

    1. A Non e-mouse Silver badge

      Re: Backup is hard. Doesn't mean it should be ignored.

      I worked for an organisation that did annual full-scale disaster recovery exercises, which were instructive. The chap in charge of them was quite sanguine about failures: his point was that he far preferred to find things didn't work during an exercise than during the real thing.

      People don't appreciate that failures are a wonderful learning experience. In my line of work, I've learned a lot more from unpicking a failure than working on a fault-free system.

      I've also heard several instructors across different areas say that they often prefer pupils who appear to make lots of mistakes as the pupils learn a lot more from the mistakes than those who do things right every time.

      1. Charles 9

        Re: Backup is hard. Doesn't mean it should be ignored.

        "People don't appreciate that failures are a wonderful learning experience."

        Because for many people's personal experiences, people who fail (at all) don't survive for very long.

    2. Paul Crawford Silver badge

      Re: Backup is hard. Doesn't mean it should be ignored.

      When the time came to ship the backup tape to the disaster-recovery location, no tape drive there could read tapes written by the original drive.

      I have also seen this with optical media - readable (probably just) on the original drive, not on another. Probably not after several years either.

      As you mention, snapshots are a brilliant idea - instant copy of a whole file system for backing up so (mostly) no inconsistencies, and with copy-on-write like ZFS you only need space for the changes so having many per day is not a high cost. However, as you mention in some cases the on-disk file is not always in a consistent state when a process is using it so having time to do a snapshot with no modifications is also good.

    3. Anonymous Coward
      Anonymous Coward

      Re: Backup is hard. Doesn't mean it should be ignored.

      The dog at my backup tape is at least understandable, if unacceptable.

      That they have five different backup solutions and all of them failed for various reasons is astonishing.

    4. Doctor Syntax Silver badge

      Re: Backup is hard. Doesn't mean it should be ignored.

      "t turned out that referential integrity is important. Who knew? Backing up files while they were in use, then trying to use the restored backup caused all sorts of problems. This was before the days of journalling file systems and snapshots."

      Don't roll your own encryption and don't roll your own database.

      This was a solved problem years ago without depending on journalling file systems and snapshots. Use a proper database engine that does this for you.

      1. Norman Nescio Silver badge

        Re: Use a proper database engine

        This was before the days of journalling file systems and snapshots."

        Don't roll your own encryption and don't roll your own database.

        This was a solved problem years ago without depending on journalling file systems and snapshots. Use a proper database engine that does this for you.

        You are, of course, completely right. However...

        The system used multiple files that were a kind of ISAM-type file*, and had been optimised to hell and back. It did its job, and fast. Several attempts, lasting may years each with large teams, were made to replace it with a 'proper DB', all of which failed to achieve the necessary performance. Half the application operated for decades** before being replaced by an entirely different system, the other half is still going, although its functionality is gradually being replaced by other systems, so eventually it will be sufficiently obsolete to decommission.

        I had a lot of conversations with the DBAs of proper databases also in use within the organisation, and the solutions proposed involved throwing a great deal of very expensive hardware at the problem. The business had a very simple question: "Why do we need to spend N times more money on the proper DB to achieve exactly what we are doing now for far less?". Having a backup window during which updates were blocked was a pragmatic (and thankfully, workable) solution

        Times have changed a great deal, and what was once very expensive hardware is now available very cheaply - Gigabytes of fast RAM, much faster processors with multiple cores, huge RAM-disks, Terabytes of spinning rust (and now, SSDs), so if you were starting again, you would simply throw enough (relatively) cheap hardware at the problem so you could run one of the newer databases. It wasn't an option then.

        Its quite interesting how, once a system is up and running, it is often cheaper to continue with it than build a more modern replacement - until a compelling event occurs - as a result, you can find some quite remarkably old business critical applications and systems in use. When you find you are forced to buy your spares on Ebay, then that is probably a good signal that moving to a newer approach is a good idea. That doesn't stop some people, though.

        Sorry for mansplaining. Please don't take this as criticism.

        *I'm deliberately not going into detail so it is not identifiable

        **I'm carefully not saying exactly how long.

  12. Anonymous Coward
    Anonymous Coward

    Babkup Audit

    Anonymous just in case anybody might stand a chance of recognising the companies involved

    I had 6 months of easy but boring contract work in 1999 for a company who had the foresight to have there backups audited. They discovered that none of their off-site backups were valid. The process was that backups would be made to on-site tapes, then these would be cloned to off-site tapes. Problem was that the backup windows was too small and the clone job was set as low priority.

    The on-site backup was working fine but the clone jobs never got to finish because the tape drives were re-allocated to higher priority jobs.

    I got to re-jig the backup, keeping 2 of the drives under manual control so that I could check that the clone jobs were all finished before releasing the drives to other tasks.

    Then we recalled all of the off-site tapes and I had the great job (when the drives were less busy in the afternoon) of feeding batches of the off-site tapes into the library to redo all of the clone jobs manually.

    It was shocking how many of the tapes arrived back from the vault company with physical damage and even some which could not be found at all.

    Some years later at another job, the same vault company was used. I made myself very unpopular with them by regularly requesting a random tape to make sure that it was available, intact and readable. There was a provision that in emergency they would have any tape available for physical collection within 30 minutes. As I had to virtually drive past the vault on the way to one of our datacentres, I would sometimes drop in on the way and request a tape. The receptionist would silently moan when I walked in after a while.

    This was no guarantee that the backups would work, but I had the reassurance that we were taking steps to maximise the chances.

    The contract with our customer called for annual DR tests. In 8 years they agreed to 1 very limited test. They did not want to disrupt there important business activities.

    1. Doctor Syntax Silver badge

      Re: Babkup Audit

      Upvoted but...

      s/there/their/g

  13. CAPS LOCK

    All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...

    ... at the command line?

    1. Alan_Peery

      Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...

      After you're done with the copies of data you're holding in a temporary filesystem, you clean out the temporary filesystem.

      Just make sure you're in the right filesystem... :-(

    2. Ogi

      Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...

      > it's a bit hard to understand use of rm -rf ... at the command line?

      Nobody every done "rm -rf . /tempfolder"? One typo, and hell to pay if you don't notice it.

      As a junior SA I once did that on the root box of the SAN in the company root directory. I blew away the entire companies data at 6pm, when I was tired and in a bit of a rush to go home. Wanted to clear out some temporary dirs I created for testing. Hit enter and started packing to leave. Only when the monitoring went haywire did I log in again and realise what I just did. Hundreds of millions of files across god knows how many divisions were deleted.

      Spent all night till 3am restoring everything, and writing scripts to pull fresh data, and then setting the permissions just right.

      The next day only 5 people noticed discrepancies in their data, which was a phenomenal result, but it was a life lesson as well. Despite managing to recover almost all the data, I was "asked to leave" shortly after (can't say I blame them).

      Thank god for ZFS snapshots and a verified backup system, otherwise the company could well have ended up having to cease trading. Or at least losing untold millions and millions before they could start to function again.

      I also am really really careful around "rm -rf" commands as root on machines now.

      1. John H Woods Silver badge

        Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...

        "Nobody every done "rm -rf . /tempfolder"?" -- Ogi

        30 long years ago I over-lingered on the SHIFT key, rm -r *.o became rm -r *>o and left me with a single file containing a single byte. Now I usually put an -i in, and when it seems to be right, exit and edit the command line to remove just the -i before setting it off in anger.

        But for specific critical folders you could use cp -al DoomedFolder/ QuickSnapshot/ ... Now, QuickSnapshot contains hardlinks to all the files and folders in DoomedFolder but because you haven't copied anything you don't need all that much space (just a bit for the new inodes) to do it (or much time to execute it). Now you can rm -r DoomedFolder and you've still got a second chance.

    3. Androgynous Cupboard Silver badge

      Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...

      Pray tell, how else would you have us delete a directory?

      1. petur
        Facepalm

        Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...

        ctrl-a and then shift-delete

        bonus: on windows it would have taken ages to delete 300GB

        1. Androgynous Cupboard Silver badge

          Re: All of the above notwithstanding, it's a bit hard to understand use of rm -rf ...

          I tried that, but it just moved the cursor to the start of the line

      2. CAPS LOCK

        Pray tell, how else would you have us delete a directory?

        I was suggesting the use of a shell script. Perhaps I should have been more explicit?

        Quite apart from that I've written my own 'del' command using 'mv' instead of 'rm' for use at the command line.

        Creating such a script is left as an exercise for the reader. Write on only one side of the intertubes.

        1. Doctor Syntax Silver badge

          Re: Pray tell, how else would you have us delete a directory?

          "Quite apart from that I've written my own 'del' command using 'mv' instead of 'rm' for use at the command line."

          I've had an accident with mv and managed to move /bin further down the hierarchy. It should have been possible to recover by booting from the distribution disks but the vendor had omitted the driver disk for the SCSI controller. It was the following afternoon when the controller vendor finally emailed us a driver.

          1. Anonymous Coward
            Anonymous Coward

            Re: Pray tell, how else would you have us delete a directory?

            Reminds me of the time a colleague of mine a few years back was doing some changes to a MQ submission script, that simply polled directories, defined in a configuration file, then pushed the files onto the queue, and deleting the files afterwards (they should have already been archived by this point in the process).

            It took them a while to notice what was going on, but the system basically ended up eating itself! Submitting all the files it was meant to, then working back up the directory tree, and submitting and deleting the next directories contents it found, and so on.

            Thankfully permissions mean the process could only really 'eat' data and a few test scripts, rather than system files, and this wasn't a production environment, although it was in use for UAT at the time! Queue questions from the clinet, "My data's gone, but I can't find it where it should be! Does anyone one know where it went?"...

    4. CAPS LOCK

      Three thumbs down....

      ... that'll lern yer ...

  14. David Austin

    Refreshingly Honest

    That must have been a very painful document to write, but it's a great real live scenario and a future test case - How many people have screwed up backups and kept quite and vague about it for operational reasons or pride?

    Hopefully, someone will learn a lesson from this, but I won't hold my breath.

  15. Anonymous Coward
    Anonymous Coward

    Customers don't want backups....

    They want restores

    1. ecofeco Silver badge

      Re: Customers don't want backups....

      Magically, too!

  16. ScriptFanix
    FAIL

    Oops

    I feel for that sysadmin...

  17. Anonymous Coward
    Coffee/keyboard

    They at least have a backup backup strategy

    In that all the GitHub users will probably have local copies of their content they can helpfully send back to GitHub. Hopefully.

    1. Doctor Syntax Silver badge

      Re: They at least have a backup backup strategy

      "In that all the GitHub users will probably have local copies of their content they can helpfully send back to GitHub."

      This was GitLab.

      1. Anonymous Coward
        Anonymous Coward

        Re: They at least have a backup backup strategy

        Craps... I will bring forward my annual optician visit. Or maybe my brain is rotting.

    2. alisonken1

      Re: They at least have a backup backup strategy

      One question that I have about local git repo - does it also contain the buglist that's kept at gitlab as well? That would be another interesting exercise.

  18. wolfetone Silver badge

    I've done this. Ran an "rm -rf" command on a production server due to an email system creating a huge log file that brought the whole server down. Early in the morning, noisey open plan office, I run rm -rf forgetting I'm in the root directory. It wasn't until Linux started saying "/boot/ could not be removed". And I'm thinking "Why is there a /boot/ directory in this folder?". I cancelled what I could, but it was too late.

    Server was off for 36 hours because the great guys at Rackspace tried to restore a 120GB backup to the 60GB drive what was unaffected by my mistake.

    But hey, it was the first time in my career I did that and so far it's been the last time.

    1. Androgynous Cupboard Silver badge

      Back in the days before package management I was upgrading some libraries including ld.so - the dynamic library loading library. I moved or deleted the old one, and the next command to run was "mv newlibrary.so ld.so". But of course "mv", along with every other command on the OS, was dynamically linked. It didn't end well, although I did learn my lesson.

  19. simpfeld

    It's too easy to blame the sysadmin

    I'm sure they'll get all the stick, there are certainly failings

    But management often tends to not be so interested in DR, until something like this happens. Especially in companies who are running to just keep up with constrained resources.

    I have seen many times IT depts wanting to test DR, and management will not provide resourcing (equipment and/or staffing) to test this. Also they won't take any interruption to production systems to test a DR solution.

  20. Anonymous Coward
    Anonymous Coward

    It even has its own hashtag

    #RMRFocalypse.

    Pretty bad, but hopefully they will learn from this £xp£ri£n$e and test their backups properly.

    I've lost data before but not quite on this scale, lesson learned to lock your PC especially when some random work experience drone comes along with his 2 friends and goes "Oh lookie, a hex editor"... Facepalm!!!

  21. clocKwize

    The only way to be confident in your backup plan is to have tests to make sure its working.

    If you backup nightly, you could automate grabbing the latest backup, restoring it to a throw away instance, ensuring that it completed properly by checking record counts in various tables. You could run that every other day. Or better yet, once your backup process has completed, to verify that it has indeed worked properly.

    You could still get caught out in many ways but verification to some extent would give you more confidence.

    I can understand how this has happened though, start-ups are not the same as large corporations with the resource to have people spent a long time ensuring backups are rock solid and testing disaster recovery efforts monthly etc. In an ideal world, that'd be quiet high on the agenda, but realistically, breaking even is the first hurdle and you don't (technically) need a backup plan for that, so it gets put to the bottom of the list.

    1. Doctor Syntax Silver badge

      "realistically, breaking even is the first hurdle"

      And not breaking is the zeroth hurdle.

    2. Charles 9

      That's if you can afford an instance or some other fallover. Many CAN'T. Yes, it's stupid, but if you're stuck in the middle of the ocean with nothing but a piece of flotsam, what options do you have besides exhausting yourself treading water?

      As said, breaking even is priority one because you're obligated to your investors first. If they don't agree with you about long-term investments, than again you're stuck because they can pull out, killing you BEFORE the disaster hits.

  22. cuddlyjumper

    Livestreaming

    They are actually live-streaming the rebuild here, in case anyone is interested:

    https://www.youtube.com/watch?v=nc0hPGerSd4

    It might seem gimmicky, but it does at least offer a seemingly unparalleled level of transparency in such a situation.

    1. wolfetone Silver badge

      Re: Livestreaming

      It's funny that you can see all of these gamer profiles coming on and asking what GitLab is.

  23. Mage Silver badge
    Facepalm

    GitLab last year decreed it had outgrown the cloud

    Irrelevant,

    Either way, it's only a shared service to allow collaboration. EVERYONE should have their own complete backups.

    Gitlab itself is surely a "Cloud" service?

  24. Matt Siddall

    Saw this last time backups were in the news - seems relevant

    Yesterday,

    All those backups seemed a waste of pay.

    Now my database has gone away.

    Oh I believe in yesterday.

    Suddenly,

    There's not half the files there used to be,

    And there's a milestone hanging over me

    The system crashed so suddenly.

    I pushed something wrong

    What it was I could not say.

    Now all my data's gone

    and I long for yesterday-ay-ay-ay.

    Yesterday,

    The need for back-ups seemed so far away.

    I knew my data was all here to stay

    Now I believe in yesterday.

    1. wolfetone Silver badge

      Re: Saw this last time backups were in the news - seems relevant

      Na, na na

      NA NA NA NA,

      NA NA NA NA,

      Back ups

  25. Anonymous Coward
    Anonymous Coward

    @Doctor Syntax

    I grovel most apologetically. Silly mistake on my part. That is normally something which makes me wince when other people do it.

    Have an upvote for your trouble.

  26. This post has been deleted by its author

  27. kalman

    Barman

    So it seems:

    1) They are not using Barman for backup managment in postgress

    2) They tough to being able to vacuum (full otherwise the space reclaim they needed was not

    happening)

    3) They don't have a PITR ready to be used

    this is how IT works today: "We need a database". "Sure". <after googling> "apt-get install postgresql". "Done". (some time you even find a docker ready...)

  28. sillyfudder
    Facepalm

    not just rm

    I created typo-geddon once using chown.

    As root, I tried to give my user ownership of files from my current dir down (./)and instead put in the space of doom and changed ownership from root on down.

    I realised the command was taking too long after about 3 seconds and hit ctrl-c.

    I genuinely thought I'd got away with it at first until the machine had to reboot, and a lot of the fundamental stuff that ran the machine (IBM AIX) got read from disk again.

    The box itself could be done without for a while so my boss at the time got me to mount the drives and undo a lot of the damage manually to reinforce the lesson (which I've never, yet, had to relearn).

  29. OliP

    1. - They were honest

    2. - They are now live straming the recovery process via YouTube.

    https://www.youtube.com/watch?v=nc0hPGerSd4

    New Gold Standard in dealing with customers after a mamouth fuck up if you ask me.

    Still - shouldn't have happened in the first place, but compare this to other companies and im not sure i could ask for more.

    *No a gitlab customer

  30. Alistair
    Windows

    ick

    This is just one of those horribly ugly situations. Feel for the SA that hit the wrong command in the wrong place at the wrong time. We are all human and I'm sure anyone that's been in the business long enough has done something near enough to identical to feel for this individual.

    I've found that being in that place, *THE* most critical thing to do right then and there is stand up and tell those that absolutely need to know that you've buggered up. And if you know what you can do to recover, lay out those options (Please, note the plural there, you should have more than one option). Otherwise bellow for assistance. Seems this SA at least hit that set of rules.

    Backups. Snapshots. Copies. etc.

    They can *all* fail at different times for different reasons.

    This sounds to be like a case of too many disconnects between groups as to which is what and who owns what.

    I've written DR plans. I've executed them. I've audited DR execution. I've fixed DR plans after the test. I've tried. Really I have tried. But unless your DR process is part of your day to day execution those plans get to be crap every 6 months or so since the apps and systems you're restoring change pretty damned rapidly nowadays.

    Now, I'm gonna go back to trying to figure out why 6 tape drives on a sun box have crossed up data and control path device files.

  31. creepy gecko
    Facepalm

    Oops!

    I feel sorry for the sysadmin, but the failed backups are almost beyond comprehension.

  32. Diginerd
    Alert

    Two Words - CHAOS MONKEY

    https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey

    Testing [RECOVERY] in production is like parachuting without a safety chute...

    ..,if things go truly pear shaped you're only gonna do it once.

    This little guy will suffice as the adult in the room. ;-)

    1. Charles 9

      Re: Two Words - CHAOS MONKEY

      Right, but what if that's your ONLY unit?

      1. Diginerd

        Re: Two Words - CHAOS MONKEY

        1oz of prevention > 1lb of cure.

        1. Charles 9

          Re: Two Words - CHAOS MONKEY

          But sometimes, you're not even allowed the ounce. What then?

  33. ecofeco Silver badge

    $20 Million?

    Too bad they didn't spend enough to hire an admin who knew what they were doing. Or the accountant. Or the boss.

    There's an old saying: It's good to save money in business, but you can save yourself right out of business.

    1. theblackhand

      Re: $20 Million?

      Given the recovery process, it looks like the DBA is pretty competent - he may have made a huge mistake (in my experience, competent people can hugely misjudge the risk of their actions) and is fixing it. Not a perfect fix, but able to recover all but 6 hours of data and able to quantify what was missing in under a day isn't bad given the number of issues found.

      It looks like the root cause was attempting to get replication working from live to staging that broke the db1 to db2 replication process - the issue may have been related to performance limits in the staging environment. There was then a period of high DB utilisation issue that may have partially contributed to the replication problem either directly or indirectly via distracting the DBA. While I can understand the thought process behind deleting the db2 replica and starting it again, there was a risk in these actions that was unfortunately realised. At which point, things started to go horribly wrong as all the back up issues were discovered.

      The bit that is missing is why did all the backups fail? I suspect the backups and backup process had been tested in the past with the earlier DB versions. 9.6 is reasonably new (Sept 2016) so they may have had a working backup strategy up until at least then and arguably based on their issue tracker until mid-December 2016.

      Why is this important? Read through the comments about testing backups and ensuring high availability. They probably had both until last month when they upgraded the database...

      1. David Roberts

        Re: $20 Million? - no testing or complacency?

        @theblackhand I was going to post much the same.

        The backup plan was so broad reaching that it is very unlikely that it was never tested.

        The article includes a bit about using outdated verions which "failed silently".

        My suspicion is that the backup strategy was tested so comprehensively and had so many fail safes that everyone assumed that they were covered and neglected to check on a regular basis because it was "too good to fail".

        All those posting that it was obviously never tested; reveal your position as an insider or other verifiable proof or STFU.

  34. Nate Amsden

    my money would be on bad management

    It seems like their setup was rather fragile. I'd put my money on not having enough geek horsepower to do everything they wanted to do. Having been in that situation many times. Even having a near disaster with lots of data loss(and close to a week of downtime on backend systems), company at the time approved the DR budget, only to have management take the budget away and divert it to another underfunded project(I left company weeks later).

    One place I was at had a DR plan, and paid the vendor $30k a month. They knew even before the plan was signed that it would NEVER EVER WORK. It depended on using tractor trailers filled with servers, and having a place to park them and hook up to the interwebs. We had no place to send them(the place the company wanted to send them flat out said NO WAY will they allow us to do that). We had a major outage there with data loss(maybe 18 months before that DR project), they were cutting costs by invalidating their Oracle backups every night to use them for reporting/BI. So when the one and only DB server went out (storage outage) and lost data, they had a hell of a time restoring the bits of data that were corrupted from the backups because the only copy of the DB was invalidated by opening it read write for reporting every night (they knew this in advance it wasn't a surprise). ~36 hrs of hard downtime there, and still had to take random outages to recover from data loss every now and then for at a least a year or two later. Never once tested the backups (and the only thing that was backed up was the Oracle DB, not the other DBs, or web servers etc). Ops staff so overworked and understaffed, major outages constantly because of bad application design.

    Years later after I left I sent a message to one of my former team mates and asked him how things were going, they had moved to a new set of data centers. His response was something like "we're 4 hours into downtime on our 4 nines cluster/datacenter/production environment" (or was it 5 nines I forget).

    I've never been at a place where even say annual tests of backups were done. Never time or resources to do it. I have high confidence that the backups I have today are good, but less confidence that everything that needs to be backed up is being backed up, because in the past 5 years I am the only one that looks into that stuff(I am not a team of 1), nobody else seems to care enough to do anything about it. Lack of staffing, too few people doing too many things..typical I suppose but it means there are gaps. Management has been aware as I have been yelling for almost 2 years on the topic yet little has been done. Though progress is now being made ever so slowly.

    The place that had a week of downtime, we did have a formal backup project to make sure everything that was important was backed up (as there was far too much data to back up everything(and not enough hardware to handle it), much of it was not critical). So when we had the big outage, sure enough people came to me asking to restore things. Most cases I could do it. Some cases the data wasn't there -- because -- you guessed it -- they never said it should be backed up in the first place.

    Been close to leaving my current position probably a half dozen times in the past year over things like that(backups is just a small part of the issue, and not what has kept me up at night on occasion).

    I had one manager 16 years ago say he used to delete shit randomly and ask me to restore just to test the backups (they always worked). That was a really small shop with a very simple setup. He didn't tell me he was deleting shit randomly until years later.

    It could be the geeks fault though. As a senior geek myself I have to put more faith in the geeks and less in the management.

  35. Cynic_999

    Hindsight is a wonderful thing

    I am reading a lot of sanctimonious comments from people explaining how this could never happen to them because they always test everything and are well-prepared for a failure event. I'd like the people making those comments to honestly answer the following questions:

    1) Do you presently have a spare can of fuel in your car?

    2) Do you have a spare can of water in your car?

    3) Do you have a torch (flashlight) in your car?

    4) Do you carry warm clothes and/or blankets (in case you get stuck in a traffic jam etc. overnight)?

    5) How regularly do you check the air pressure in your spare tyre?

    6) When did you last check that your brake lights were working OK?

    1. Herby

      Re: Hindsight is a wonderful thing

      Yes, it is, but sometimes you need to understand the risks of doing too much.

      Sometimes you need to just rely on your design, and after proving you have made it as good as possible, let it go. One example of this is the ascent stage of the lunar lander. This rocket was only fired ONCE for the takeoff from the moon. It was NEVER tested since the act of testing it with the fuels/oxidizers involved degrades/destroys the engine itself. They built it to be as bullet proof as it could be and over engineered it a bit more. It used a hypergolic fuel mixture and simplified fuel flows (I believe they used gas pressure to empty the tanks, and it had only one speed (ON!). Guess what, it worked EVERY time. As for my vehicle that I use every day:

      1) Do you presently have a spare can of fuel in your car?

      No, but I do watch my gas gauge, and if I forget, I have a AAA (us, AA - UK) card that will get me some.

      2) Do you have a spare can of water in your car?

      No, but on the time the cooling system failed (it was a couple of months ago), I could pull over and park, waiting for a tow.

      3) Do you have a torch (flashlight) in your car?

      Yes, it is only common sense. This is a small device that takes up little space, and has other benefits.

      4) Do you carry warm clothes and/or blankets (in case you get stuck in a traffic jam etc. overnight)?

      No, but in the cases where this might be a problem, I was traveling to a ski area overnight, and DID have some warm clothes I was actually wearing.

      5) How regularly do you check the air pressure in your spare tyre?

      While not on my vehicle, automatic pressure telemetry is now required on new vehicles. I do get my tires rotated on a regular basis (5,000 miles) and it is checked there.

      6) When did you last check that your brake lights were working OK?

      Thankfully the vehicles electronics DOES check this (modern cars!). As for older vehicles, no brake lights will usually get you rude warnings (horn honks) from people behind you. Good practice to check every so often, when servicing.

      So while you do bring up valid points, overthinking things like this can get too extreme. Thankfully the faults described to not cause my vehicle to spontaneously destroy itself, whereas lack of a proper computer backup, can be catastrophic (to say the least).

    2. G2

      Re: Hindsight is a wonderful thing

      7) do you have all of the above and a SPARE CAR?

      8) do you keep the car engine running on (or at least a few hours per day) and driving a few miles and fuelled all the time? (= live backup system, just in case... )

      9) do you have all of the above in a third spare car that's kept running, fuelled and road-worthy all the time on the other side of the continent? (= live backup data being kept in multiple locations)

      and so on... the logistics of these things keeps getting more complex.

  36. Anonymous Coward
    Anonymous Coward

    Dumb

    How can you not notice backups only being a few bytes of data?

    I get database backups can be tricky, but C'MON MAN!!!

    I think there is a job opening for a database admin...

  37. Herby

    Option for 'ls'??

    Maybe if invoked as root (maybe any user?) and arguments are '-rf' it should count the number of files it might delete, and say:

    Wow over 1000 files, are you sure?

    Me? Typically I do it without the 'f' option and see how it progresses, then abort and re-do with the added '-f' option as needed. I get very careful with recursive descents (with good reason!).

  38. Uplink

    Wipey

    YP.... Wipey... He'll never outrun his name now.

  39. jonfr

    Backup of my backups

    I don't have a company and I have backup of my backups. I never know when a hard drive fails. I only backup important stuff that I cannot replace elsewhere.

    I would like to have double or triple sided backup elsewhere, but I have limited budget at the moment. I'll just work with what I got at the moment.

    As for this company in question. I think lack of experience makes this type of errors resulting in large scale problems like this one. Also, bad attention in school when people learn about computers and how they actually work.

    1. wstewart

      Re: Backup of my backups

      Have you tested the backup of your backups? Backing up a corrupt/bad backup will get you exactly where gitlab is. This story is pretty funny though. I'd expect better from a company with their name recognition. That rm -rf that was mistakenly run as a part of a replication process is one of the main reasons for automation. Also can't stop laughing at "The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented" and "Our backups to S3 apparently don’t work either: the bucket is empty"

      1. jonfr

        Re: Backup of my backups

        The second layer of backups is a cloud service (no good way to test, but reported size is correct), the primary backups are fine. I always test them and the reported hard drive usage is as expected.

  40. Hot Diggity

    Expensive Mistakes

    I'm reminded of someone who made a mistake that cost a company a large amount of money.

    The person concerned was called to the CEO's office.

    "I suppose that you want me to leave the company" he said shame-facedly.

    "Leave? We just spent over $1 million on your education. Just don't do it again!"

  41. dmacleo

    irony...

    github (last time I looked months ago) hosts backup programs that most likely would have worked on gitlab database....

    *************************************************************

    shoot disregard this forgot they were separate entities.

    left my stupid comment up to make this comment make more sense.

  42. petef

    DVCS

    Nobody seems to have mentioned that this is git. All checked out repos have all history so it is intrinsically backed up.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like