back to article 'Urgent data corruption issue' destroys filesystems in Linux 4.14

A filesystem-eating bug has been found in Linux 4.14. First reported last week by developer Pavel Goran, the problem struck bcache, a tool that lets one use a solid state disk drive as a read/write cache for another drive. bcache is often used to store data from a slow disk on faster media. Goran noticed the problem after …

  1. Bronek Kozicki Silver badge

    Do not use ".0" release. And if you do, you should know what you are doing. So, hats off to Pavel Goran for volunteering to run it, and then identifying a serious filesystem bug.

    1. StephenTompsett

      Versions...

      General rule of software versions:

      Odd numbers good - probably fix bugs that have been reported since last release.

      Even numbers bad - introduce new features and bugs!

      1. Jonathan 27

        Yeah...

        If you believe that, I've got a bridge in Scotland to sell you. Version numbers are often arbitrary.

    2. gerdesj Silver badge
      Linux

      "Do not use ".0" release. And if you do, you should know what you are doing. "

      He's a Gentooer (like myself but far more knowledgeable). You don't run Gentoo and shy away from .0 software. To be honest you normally embrace pre-release, let alone released. That's how bugs get found.

      You have to repair your systems from time to time in new and amusing ways but Gentoo is great fun. In winter it will even keep you warm when you do an update so you can turn down the heating.

      1. Anonymous Coward
        Anonymous Coward

        "In winter it will even keep you warm..."

        perhaps... but paradoxically I think opening my Windows makes me warmer...

  2. David Roberts Silver badge
    Trollface

    Just when

    Linus had apologised for fucking swearing.

    Incoming in 3 2 1

    1. Gene Cash Silver badge

      Re: Just when

      Linus isn't going to be the only one swearing when they lose data...

      1. fajensen Silver badge
        Trollface

        Re: Just when

        Depends, Depends - what it the data is incriminating data? Maybe, I dunno, this is yet another way to slurp some cryptocurrency and not get nailed over it?

  3. J J Carter Silver badge
    Linux

    That's open source or you...

    Many eyes, but all looking at pr0n!

    1. CAPS LOCK Silver badge

      Re: That's open source or you...

      @Carter - really?, this again? Surely even you can recognise that this is a case of 'many eyes' finding and fixing the problem more-or-less instantly.

      1. Chris 155

        Re: That's open source or you...

        Not it wouldn't.

        This would be the case of an early adopter getting their data munched. This would have been found just as fast in closed source.

        Possibly the cause was found earlier and a resolution released earlier because of open source. Possibly.

        This is the kind of shit that shouldn't get through at all though.

        1. CAPS LOCK Silver badge

          "This would have been found just as fast in closed source."

          Or would it. Here, for example, is a bug that affected me back when I use Win 7: https://social.technet.microsoft.com/Forums/windows/en-US/13a7426e-1a5d-41b0-9e16-19437697f62b/windows-7-64bit-corrupting-altering-large-files-copied-to-external-ntfs-drives?forum=w7itproperf

          It's hard to say how long the bug existed for before being found, but a fix was a long time coming, in fact I don't know if this was ever fixed. It's part of the reason I moved to Linux.

          1. Anonymous Coward
            Joke

            Re: "This would have been found just as fast in closed source."

            It's part of the reason I moved to Linux.

            Talk about throwing the baby out with the bath water.

            There's a work-around - use a 3rd-party tool to copy a file!!

            1. Captain DaFt

              Re: "This would have been found just as fast in closed source."

              It's part of the reason I moved to Linux.

              Talk about throwing the baby out with the bath water.

              There's a work-around - use a 3rd-party tool to copy a file!!

              So, your advice is to fix an OS problem by using something else?

              Uh, that's what he did, isn't it? ☺

              1. Anonymous Coward
                Anonymous Coward

                Re: "This would have been found just as fast in closed source."

                > Uh, that's what he did, isn't it?

                Guessing you missed the joke icon next to the post? :)

            2. the Jim bloke Bronze badge

              Re: There's a work-around - use a 3rd-party tool to copy a file!!

              People would consider file management to be a pretty fundamental part of the operating system - so you are saying windows OS is unfit for purpose..

              Couldnt agree more

    2. Anonymous Coward
      Anonymous Coward

      Re: That's open source or you...

      This is another success story of open source.

      1. Sandtitz Silver badge
        WTF?

        Re: That's open source or you...

        "This is another success story of open source."

        Was that irony? If MS had this data corruption bug in Windows there would be dozens of commenters here telling how "Why isn't MS testing their crapware", or "MS is letting end users test their crap", with everyone upvoting each other.

        1. Anonymous Coward
          Anonymous Coward

          Re: That's open source or you...

          If MS had this data corruption bug in Windows

          I'm a customer of Microsoft, I give them money. My company (among 1000s of others) shovel money in Microsofts bank. They have $millions available for testing and QA.

          And, if the report came externally, it would have been ignored until 10 more people reported it. The patch (if any) would be applied next month.

          1. Michael Habel Silver badge

            Re: That's open source or you...

            That's why the let the Plebs with Free Win X do the testing for them...

        2. Anonymous Coward
          Anonymous Coward

          Re: That's open source or you...

          @Sandtitz

          So, you don't recognize that there are some big differences between the latest release of Linux and Windows? Linux does an excellent job of providing a variety of releases in various stages of development. Windows, the OS used in most businesses desktops/laptops it appears, has become a beta. If you you use a significantly recent version Linux you can expect similar. If you use a Linux release that is a little older you'll have greater stability. With Linux, whether you've chosen a newer or older release, you'll have software updates and bug fixes very regularly and for a whole range of things. With Windows, one gets the feeling, that you get updates and bug fixes when convenient for Microsoft and sometimes, seemingly, only because of the embarrassment of OSS OSs having reported and fixed said issues already.

      2. Bitbeisser
        FAIL

        Re: This is another success story of open source.

        No, it isn't. A success story would have been if that bug would never made it out in the wild. Stuff like this should be tested and found before it ever gets released...

        1. Chairman of the Bored Silver badge

          Re: This is another success story of open source.

          @bitbeisser,

          Respectfully disagree here. Professional software designers do test extensively; and believe me - open or closed source the devs are pros who take pride in their work.

          Bugs in the wild though will happen due to the sheer complexity of the system - for any decently complex system an full factorial experiment of all potential decision paths is infeasible for any reasonable length of time. One is literally trying to prove a negative.

          Suggested link for starters: https://users.ece.cmu.edu/~koopman/des_s99/sw_testing/

          What separates the men from the boys is how you handl a bug or design flaw. Ten days cycle time on a single report is v good.

    3. Doctor Syntax Silver badge

      Re: That's open source or you...

      El Reg, please provide a broken record icon. J J Carter needs it.

    4. Anonymous Coward
      Anonymous Coward

      Re: That's open source or you...

      bcache isn't something the majority of users use (or even heard of)*

      It took ~10 days for this bug to be discovered, fixed, and released. I think that's pretty good.

      I'm not sure what you have against open source. If it's because you're a monkey-dancing Microsoft fan-boy then you should know Microsoft are github's largest org, with the most contributors. If there are other reasons, then please let us know.

      *it's used to turn an SSD into a cache for HDDs

    5. Michael Habel Silver badge
      Pint

      Re: That's open source or you...

      Yummm tasty North Sea Pr0nz!

  4. jake Silver badge

    Slackware-current ...

    ,,, fixed. That was quick, thanks Pat & Co.

    1. Voland's right hand Silver badge

      Re: Slackware-current ...

      This is one of those moments when you start admiring OpenWRT, Debian, etc perseverance on staying with kernel long term release for as long as possible.

      Living on the edge is for err... people who like living on the edge...

      1. disgustedoftunbridgewells Silver badge

        Re: Slackware-current ...

        You start admiring distros who stay with long term releases as soon as you need to do anything important.

      2. jake Silver badge

        Re: Slackware-current ...

        I live on -stable for important stuff, I'm not an idiot. That would be kernel 4.4.88-smp at the moment. But I also run -current on a couple of spare boxen, and report any errata I run across, along with workarounds/fixes if I can. It's called giving back. Try it, you might like it. Or you can just bitch about those of us who do, if that makes you feel good about yourself.

        1. Doctor Syntax Silver badge

          Re: Slackware-current ...

          "Or you can just bitch about those of us who do, if that makes you feel good about yourself."

          I didn't see any bitching, just a reminder to use the appropriate distro for the task in hand. Running a bleeding edge distro is a bit like running Windows Insider.

          1. jake Silver badge

            Re: Slackware-current ...

            I think that's the first time anyone has suggested that Slackware might be "bleeding edge" in about 20 years ...

          2. hplasm Silver badge
            Devil

            Re: Slackware-current ...

            "Running a bleeding edge distro is a bit like running Windows Insider."

            But without the dancing, high-fives and vacuous expressions...

            1. jake Silver badge

              Re: Slackware-current ...

              It occurs to me that folks might not know how Slackware does things. Essentially, there is an LTS version called slackware-stable, with a very stable, solid software package (if not the most modern), and a "work in progress" version called slackware-current that is a kind of rolling release, aiming to be the next -stable. More at slackware.com/info/ and slackware.com/changelog/ if you're interested.

            2. Def Silver badge

              Re: Slackware-current ...

              But without the dancing, high-fives and vacuous expressions...

              He said Windows, not MacOS.

              1. Anonymous Coward
                Anonymous Coward

                Re: Slackware-current ...

                Just dread, dissatisfaction, and vacuous expressions then.

              2. Anonymous Coward
                Anonymous Coward

                Re: Slackware-current ...

                "He said Windows, not MacOS." To be frank since winX the only difference is the size of their wallets

        2. Anonymous Coward
          Pint

          Re: Slackware-current ...

          It's called giving back

          Thanks.

          I'd dread to think what this industry would be like if it weren't for Linux (and OSS in general).

          1. FIA

            Re: Slackware-current ...

            I'd dread to think what this industry would be like if it weren't for Linux

            The horns would be less curly.

            (oh, and you'd get a free toasting fork too.)

  5. BinkyTheMagicPaperclip Silver badge

    Crap, I think I am using that..

    Not lost any data so far, but it's not that rare to see drives drop out of an mdraid for no discernible reason, including one time where a mirror completely failed to assemble despite one of the devices it used being a partition on the SSD the system had just booted from(!).

    You say Windows isn't so great, but I've found its software RAID to be absolutely rock solid with sensible defaults. Not so Linux RAID, it's a pain in the arse. Going to put my backend file server on FreeBSD/ZFS..

    1. Mike 'H'

      Re: Crap, I think I am using that..

      ZFS-on-Linux works pretty damn well in my production systems since 2012..

      1. BinkyTheMagicPaperclip Silver badge

        Re: Crap, I think I am using that..

        How are you finding performance? I did a little reading around and there were some concerns over maintaining kernels and the level of performance.

        I'm running Salix, so currently doing custom kernel builds..

        If it is ok it would make a lot of sense to move to it, as medium term I want to use FreeBSD as a base once the functionality I need is included. I've plenty of ECC memory to spare..

    2. LeeE Silver badge

      Re: Crap, I think I am using that..

      Yeah, I've had random problems with mdraid too. But then h/w raid is far from perfect these days, especially when rebuilding arrays after a failure, now that 'disks' have become so large; it's likely to take a couple of days and then fail to successfully complete anyway. I much prefer JBOD based systems now, and make multiple backup copies (using standard os tools - not proprietary packages) frequently.

      1. BinkyTheMagicPaperclip Silver badge

        Re: Crap, I think I am using that..

        Thanks for confirming it's not just me (although I knew it wasn't hardware based, as I've seen the same symptoms on multiple systems..). One system I'm running on hardware RAID is on RAID50, so you can run the odds of losing one disk from each span..

  6. kryptylomese

    "with sensible defaults" :)

    1. BinkyTheMagicPaperclip Silver badge

      Yeah.. If you shut down a system with an mdraid RAID10, and on bootup one of the drives isn't there, it won't establish the RAID by default on the infinitesimal grounds that something might be corrupt, except for the fact it stopped and started in exactly the same state.

      How is that remotely sensible? Not to mention that there is some combination of operations where it's possible for a mirror not to start up at all (one member offline, the other member has a very brief blip perhaps). That's not even getting started on the ridiculous rebuild times when a disk magically becomes slightly out of sync.

      Windows is a little finicky whether it does certain things in standard vs dynamic disks, but other that it's quite straightforward.

      Still it's possibly better than ESXi which can't be arsed to implement any software RAID at all..

      1. Anonymous Coward
        Anonymous Coward

        @BinkyTheMagicPaperclip

        Have you reported these bugs?

        1. BinkyTheMagicPaperclip Silver badge

          The shutdown/startup issue is 'working as designed', there's a kernel boot parameter to allow an array to start up in a degraded state. I don't agree with this, but it's easy to work around.

          The long rebuild times, as far as I can make out, are normal.

          As to the occasional dropouts and the mirror failing to establish itself, the latter has only happened once - nothing much in the logs. Haven't looked at increased logging for the other case.

  7. pqa

    It doesn't appear as though the patch made it into 4.14.1.

  8. Anonymous Coward
    Anonymous Coward

    I have just encountered a simple fault on AWS. Maybe something related to this.

    I made a 1 character change to a file.The change was effective, and the timestamp for the file was correct. Went back the next day and it had reverted to the original file & timestamp. I have repeated this again today with the same result.

    1. kain preacher Silver badge

      Ouch, sounds like that can get nasty. Imagine if that was a config file.

  9. Henry Wertz 1 Gold badge

    difference here though...

    "If MS had this data corruption bug in Windows there would be dozens of commenters here telling how "Why isn't MS testing their crapware", or "MS is letting end users test their crap", with everyone upvoting each other.""

    And rightfully so, usually; gentoo it's typical to either run very recent or bleeding edge versions of almost every package. You would not have seen this bug running any typical Linux distro. People tend to make comments regarding Microsoft's mistakes (and upvote it!) when people run the regular release of Windows, update it, and run into big problems; not when they are running something like the Win10 bleeding edge channel and run into problems.

  10. EnviableOne Bronze badge
    Go

    Maybe....

    If linus spent more time looking at the code and less time appologising to Chris Kees, he might have spotted this one .....

  11. pqa

    The fix is now in 4.14.2.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019