back to article Microsoft foists fake file system for fat Git repos

To lighten the burden of massive Git source code repositories, Microsoft has created a virtualized file system that allows developers to interact with large codebases without sending excessive amounts of data across the network. Git (when not applied to people or animals) refers to a distributed version control system for …

  1. hellwig

    It's almost as if....

    The concept of "decentralizing" the repository has downsides for large projects. Not sure how it was ever seen as practical for something like the Linux kernel (or apparently Windows?). Isn't the point of technology these days to move everything into the cloud (or at the very least, centralize it)? Git sort of threw that philosophy away.

    At least this new GVFS is an OPTION to use the benefits of Git without the nasty details of storing all that data directly on your PC. But really, aren't there other CMS with this type of option baked in? Aren't they just using Git for the sake of using Git at this point?

    1. Anonymous Coward
      Anonymous Coward

      Re: It's almost as if....

      No.

      See: https://en.wikipedia.org/wiki/Git

      The whole point of the Git architecture is to make it faster, have higher integrity, and support distributed non-linear workflows, and it was built to support Linux. GVFS is produced by Microsoft because of the unwieldy size of their code base and their wish to use Git, AND not tie up their networks and desktops/servers with thousands of clones of the Windows source. Looks like a smart move all 'round, to me. The only time the upstream repo host is required is when you check in changes, or need to sync or grab a clone of what you're working on. If it goes down, that does not cause work to stop everywhere. It just keeps the project from getting updates, or a developer from getting a clone, not a show stopper by any means.

    2. Anonymous Coward
      Holmes

      Re: It's almost as if....

      "Isn't the point of technology these days to move everything into the cloud (or at the very least, centralize it)? Git sort of threw that philosophy away."

      No, let's put this headstanding nonsense back on its feet.

      The closer something is to your processor, the faster you can work on it. L1 cache speed is insane. That 8GB of main memory is pretty quick but the round trip will hurt. The hard disk (assuming it's a disk) is the slowest thing *in your machine*, which is why spilling into swap space is a total deal-breaker, makes a program take years instead of minutes so you go buy more RAM.

      But the slowest, most unpredictable thing above all-- that's the network.

      So you have caching. Local caching on your hard disk of stuff off the network, local caching in main memory of stuff off the hard disk, local caching in the CPU of the stuff in memory. Caches are smaller and faster and more expensive as you get closer to the cores, where all the work has to happen. We get by with using what we absolutely need for the thing or the few things that we're working on *right now*.

      Those nasty details you mention? A git repo is a cache, and it makes everything faster. Except a cache miss, which is always expensive, and in this case is also known as "the conditions immediately preceding 'git clone'".

      Apparently they made it so your own unpredictable (by the machine) behaviour is what decides the parts of the cache that burn time getting populated. It makes sense. But you pay a different price, since you didn't sync everything at once then you can't get additional unexpectedly needful files unless you're still on the right network, and of course opening such a file still takes a tad bit longer than if it was on disk.

      And the Godforsaken cloud... it's what, a big imaginary hard drive on the other side of the network? But it's even slower, and its reliability is every bit as imagined. The people running that way didn't utterly "throw away the philosophy" of getting more data closer to your CPU, because all these silly apps still have to live in *your* RAM and they still have to phone home, and they hopefully know all about syncing and caching and minimum necessary traffic and whatnot.

      To answer your question, no. The point of technology these days is NOT to move everything into the cloud or centralize it. The point of technology these days seems to be "make everything operate as much as possible simultaneously, in parallel non-interdependent tasks", and that means decentralizing a great deal.

      P.S. clouds are made of dihydrogen monoxide, so how about that?

      1. P. Lee

        Re: It's almost as if....

        >And the Godforsaken cloud... it's what, a big imaginary hard drive on the other side of the network? But it's even slower, and its reliability is every bit as imagined.

        > no. The point of technology these days is NOT to move everything into the cloud or centralize it.

        We're talking about two different areas here. One is coding, where people who know what they are doing, do stuff. The other is Vendorland, where everyone-else needs to be turned into a dependent licensee who must be made to pay continually. In Vendorland, we try to stop customers doing things, because if they can do things for themselves, they don't need Vendors.

        The thing which makes it all work are the tax laws around opex/capex and managerial abstraction theory which suggests that the "core business" of an organisation is much smaller than it really is.

        However, I think the OP was asking if the "modern" cloud way of doing things would be to give you a terminal to a machine which is closer to the git repo. Think of it like this. Photoshop (=local hardware) may expensive, but worthwhile if your business is retouching photos, but for the average Facebooker (low value photo/hardware usage) , it probably isn't a good deal.

        1. Lee D Silver badge

          Re: It's almost as if....

          To put this simply,

          I imagine "git clone" is incredibly fast.

          And then the first time you compile that code-base, it has to download everything that the git-clone skipped doing for the sake of speed, anyway.

          It's a shortcut, a different way of working, that doesn't pay dividends in most things.

          To be honest, you really SHOULD NOT be git cloning huge monolithic projects. If they really have hundreds of gigs of code, they should be breaking that down.

          And then how many projects is one coder really working on at any one time? They are likely to suck down only a handful of active, smaller, git repositories to work on. Not one huge damn thing.

          The Linux kernel is tiny in comparison. But it doesn't contain thousands of tools that MS might be construing as "part of" Windows.

          And when it gets unwieldy, you're going to want to break the codebase down - knocking drivers out of the kernel repo and into their own, for example, so that you're NOT downloading everything every time you want to change one file, patch it, and submit the results.

          If MS really have one or even a handful of megalithic repos for something like Windows and Office, they are doing it wrong. And even then, many huge projects use only one repo and don't see the kinds of hits they're talking about here. And even if they were, there are better ways to fix them than what is no more than a UI trick - let's let the clone operation succeed quickly, but after that we still have to do the proper clone in the background as soon as any use is made of it, and that would take longer anyway and require extra software and incompatible servers and etc.etc.

          It makes me question Microsoft's coding practices, not Linux's or other projects that use git.

          1. Anonymous Coward
            Facepalm

            Re: It's almost as if.... huge monolithic projects...

            But single codebases are good for global refactoring, so the code doesn't uncontrollably drift.

            A real-world example: sometime between Windows 7 and 10, Microsoft had to change the word 'privacy' to 'user telemetry' and flip the meaning of all the opt out controls throughout the entire Windows codebase without any customers noticing. A single repository for Windows allowed Microsoft to respond in a suitably agile way.

          2. Anonymous Coward
            Anonymous Coward

            Re: It's almost as if....

            Perhaps their workflow is to have centralised build servers which would bypass this problem.

        2. Anonymous Coward
          Anonymous Coward

          Re: It's almost as if....

          I thought the OP was asking why we wouldn't just have a single-source-of-truth in the cloud, as if that would scale. The wording isn't perfect and neither is mine: of course "the point of technology" is to make up tools and toys that people need or want, but I think we both meant "the trend"-- where I meant to imply that in software design or engineering or whatever, the (goal? new normal? only way forward?) is having more cores working on immediately adjacent memory at the same time, and to draw a parallel (pun intended) with software development-- having more programmers working on same codebase-- which git is all about. As for the cloud being a trend: they can have it.

      2. Anonymous Coward
        Anonymous Coward

        Re: It's almost as if....

        A fine piece of Techsplaining there, dbtx. I imagine your coworkers queue up to ask you stuff.

        1. Anonymous Coward
          Anonymous Coward

          Re: It's almost as if....

          Keep imagining. I like to imagine that someday I'll once again have coworkers at all, having recently left behind the nice people who put the pizzas on the other end of the conveyors going through the oven.

          Sorry if the very idea that "taking responsibility for your information means carefully placing it inside a Somebody Else's Problem field" gets me going.

  2. 2Nick3

    GVFS eating its own tail??

    Assumedly the GVFS code is checked into Git on GVFS. What if there is a bug in the GVFS code that takes GVFS offline? How will they get to the code to fix it?

    Way too deep for a Friday afternoon - off to the pub!

    1. Anonymous Coward
      Anonymous Coward

      Re: GVFS eating its own tail??

      I know its a joke, but just incase :)

      Without really looking at how it works, I would imagine that its just checked into git like any other git project, stored in exactly the same way and can be checked out with the normal git client.

      What this does it create a virtual file system on your own system and uses extensions to git to get the metadata to populate that virtual file system.

      So it doesn't change how the data is stored in git.

    2. Frumious Bandersnatch

      Re: GVFS eating its own tail??

      Why should it be turtles all the way down? Presumably the GVFS code isn't large enough or have enough developers to warrant being self-hosted.

      Even if it was self-hosted, there's nothing stopping you from making a full clone onto a non-GVFS disk. That's probably what people who have to work away from the office have to do anyway. I think that someone else made a comment about having a single point of failure, but realistically speaking you will have one or more backup clones. The cost of keeping them up to date (on non-virtualised storage) will be trivial. All this does is cut down the overheads for a horde of developers who would regularly clone full repos and not do much work on them.

  3. Anonymous Coward
    Anonymous Coward

    Clever. For the interested amateurs among us this remote work/lazy loading approach is effectively the same solution as Google applied for their internal p4 clone, while Facebook implemented a markedly different model of exploiting the filesystem to minimise the amount of diff to be doing. Seem to be solving different problems, so it'll be interesting to see if Facebook end up ditching hg in the end.

    1. Jim Mitchell

      ACM paper on Google's source code system:

      http://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext

      86TB in one repository

  4. ZenCoder
    Terminator

    Wow ... they must have a record of every change to every file that ever existed.

    Maybe someday machines archaeologists can study how they achieved intelligence partly by adapting a misplaced, badly malformed, and dyspeptic Microsoft Bob into a brain.

  5. This post has been deleted by its author

  6. aaaa
    Alert

    Proves Git is unsuitable for commercial dev work

    Linus was clear that he designed Git to reduce the pressure off him personally, by making more work for everyone else (https://marc.info/?l=linux-kernel&m=111288700902396) - it's great for him, and maybe for a lot of other open source 'community' projects.

    MS have bastardised Git so that it's not longer distributed, and therefore you don't get the performance or workflow benefits (after the initial overhead of a 'clone') of working 'offline'. The only reason to do this is because you are 'forcing' everyone to merge into a central source frequently anyway - which is what Linus was trying to avoid.

    OK - it remains 'compatible' so presumably someone can still use a 'normal' git client to do a 'full' checkout, but I wonder why MS don't just use TeamSystem - it's their own product after all, surely they would prefer to make changes to improve their own product? Is this a sneaky way of them warning the market that TeamSystem is going away?

    All distributed version control lacks fine ACL's and fine grained auditing (because commits can be made to 'local' repositories then merged up as 'fat blobs') and many other security and IP protections that commercial software development normally requires.

    1. Doctor Syntax Silver badge

      Re: Proves Git is unsuitable for commercial dev work

      "Is this a sneaky way of them warning the market that TeamSystem is going away?"

      Or is it an acknowledgement that git has become the predominant version control system?

    2. Frumious Bandersnatch
      Thumb Down

      Re: Proves Git is unsuitable for commercial dev work

      re the title: no it doesn't

      I clone various Linux kernel trees quite regularly. It can be a bit of a pain over slow links, but once I have the clone, I can pull updates with minimal hassle.

      MS isn't "bastardising" git, either. Neither is it forcing a centralised model on developers. It's using lazy fetches to minimise the amount of downloads that individual devs need to make before they can start bashing on the code. Granted, if they want to actually *compile*, they'll need to do more fetches, but not, one would assume, the full repo + history. Anyway, a few things:

      * The basic copy-on-write semantics are still there (developer's local edits are still local until pushed back and they still have to be merged back in in exactly the same way as before)

      * Nobody is forcing anyone to use this file system, since they can still use regular clone to a local, non-virtual disk

      * This is probably aimed at intranet deployment, where it should definitely help reduce unnecessary traffic (though I guess if it's well-designed, with well-thought out security, you could also use it on the wider net)

      It's a file system, not a fundamental change to git itself, hence it's not enforcing a centralised development model, nor proving that git is fundamentally flawed.

      1. Dan 55 Silver badge

        Re: Proves Git is unsuitable for commercial dev work

        There is some change to git, TFA says at the end that it requires a protocol change.

    3. Ian Ringrose

      Re: Proves Git is unsuitable for commercial dev work

      I expect MS wanted all the benefits of GIT for the part of the source code the developer is working on, this could include files in many different folders. The build script will make use of prebuilt binaries for the 99.999% of windows the developers is not changing.

      I expect that very quickly all the files the developer needs to work will be on their machine, so giving the benefits of a distributed system, while only distributing the subset of code the developer is using.

    4. Planty Bronze badge

      Re: Proves Git is unsuitable for commercial dev work

      No it doesn't. We use Enterprise GIT here (Atlassian Stash, AKA BitBucket Server), and it's superb. It knocks the socks of anything else that we have used, including commercial products costing 20x the price.

      It's also works really well, simply because it's what everyone knows these days. The GIT DCVS and branching model is what 99% of developers coming to us are familiar with.

  7. Ilsa Loving

    Why?

    If you're having to resort to these kinds of tricks with your repos, doesn't it kind of defeat the whole purpose of using git in the first place? You're now dependent on a central server repo.

    You may as well use SVN or Mercurial, which never had this issue.

    1. This post has been deleted by its author

    2. Anonymous Coward
      Anonymous Coward

      Re: Why?

      1) they don't define a central repo. Maybe there's one or five or whatever that each already "has" everything but I'm sure just like everyone "has" their own work they can pull around permanent changes between each other without pushing back to the mothership, else it wouldn't be git

      2) SVN does define a "center", that's one of its main problems. write permissions & associated politics, according to Linus himself, iirc

      3) Mercurial never had this issue? OK but did it ever have such demands on it? Really, I'm asking. ISTR a fairly large cross-platform open source project chose mercurial over git because of better disaster recovery and more & better tools available on Windows, at the time. But it wasn't this large.

    3. Anonymous Coward
      Anonymous Coward

      Re: Why?

      "...You're now dependent on a central server repo."

      Not entirely. You're still able to work on files you've previously touched or used. Given you're likely to download everything related to what you're working on to compile/test your own work the impact of repo downtime is minimal.

  8. Mark Simon

    Good thing Git is Open Source

    That Microsoft were able to adapt Git to suit their own needs is largely due to the fact that Git is not itself a Microsoft product.

  9. Adam Connelly

    Repository too large?

    Obviously it's impossible to say without actually knowing the structure of the repository, and exactly what's stored in it, but it basically makes me think that they're trying to bend git to be used in a way that it wasn't intended.

    Before git, it was a lot more common to have a single repository with all your code inside it, even if that meant you were storing multiple unrelated products in the same repo. With git, the approach tends to be an individual repo for a single self contained component.

    Comparing the Linux kernel with Windows probably doesn't make much sense. It would make more sense to compare windows to a particular Linux distribution. All the code that makes up a Linux distribution certainly isn't stored in a single source repository. I know that's partly because of the nature of open source development, but it also makes more sense to me to split up code into smaller, self contained components like that. Hosting sites like github basically encourage that by making it so easy to setup new repos.

    It sounds like a cool concept regardless, but to me one of the main advantages of git is that it allows me to work offline, and I'm also not worried about our central repo dying because every developer has a complete copy of the repo that's going to be at most a few days old. So if I'm understanding the way the lazy loading works, it's not really a feature I would actually want because then you end up with the appearance of having the code available, but as soon as you lose your network connection you're totally screwed.

    1. Ian Ringrose

      Not that bad.

      The lazy loading will very quickly get you all the code (and its history) you need for the task you are doing. At that point you can work of-line. (Git also make it easy to have 101 central servers that tack each other, and you only need one of them to be working.)

      However a larger benefit of git is being able to have local branches etc and combine them with other peoples changes without “publishing” them to a central server.

  10. JulieM Silver badge

    Turned Full Circle

    The whole idea of local copies of repositories was to keep everything decentralised and avoid single points of failure.

    There was a file system extension for the Amiga that allowed you to pretend to write to a CD-ROM, by having a hidden directory on the HDD (or even a floppy disk! Remember floppy disks?) where the changes were stored. I'd be surprised if Linux didn't already have something similar (you can install packages onto an Ubuntu Live system .....) that could be used to mount a network share read-only and store altered files locally.

    Resynchronising it all after the event is no less a nightmare, of course. Worse, if a remote file gets accidentally altered (this is Microsoft we're talking about ..... assume they would do that).

  11. stephanh

    ClearCase competitor?

    This seems very similar to ClearCase's MVFS (MultiVersion File System). Wonder if Microsoft actually wants to compete with ClearCase on this terrain.

  12. Frumious Bandersnatch

    Another option

    Just adding an observation: git allows for shallow clones with 'git clone --depth 1'

    For big projects, this won't have the same level of bandwidth saving as a custom lazy file system (as here) but it can still have huge savings over doing a full clone of a repo with a long history.

  13. bharry_msft

    Actually Team Foundation Server and Visual Studio Team Services both support Git hosting (and the GVFS protocol). So they aren't going away - just getting better.

    On the issue of removing the "D" from DVCS, I think we need to explain that better. I'll work on that but I'll give a short answer here. We're not removing the "D" at all. On the server, at least, it's just GIt (with a few very small restrictions lie end of line translation). You can connect to it with any Git client and it has every single property Git has. That same repo that you can use a normal Git tool against, you can also use a GVFS enabled client against. You choose. Now, if the repo is big enough, you might choose to use the DVFS client. It still retains all Git semantics - except, if you go offline, you only have access to what you've cached locally. If you've already cached everything you need, you are golden. Over time, I thnk we can look at adding tools for managing what is cached even for people with really big repos who what to disconnect.

    Regardless of whether you disconnect or not, though, all the Git versioning behavior and the ability to create remotes to enitely unrelated repos, etc. are still there. It's just Git with the ability do download stuff on demand and a ton of performance improvements. And if you don't like it, don't enable GVFS and just use Git without GVFS. Your choice.

    Brian Harry

    Microsoft

    1. Ramazan

      I clone my repos by git+ssh. If I don't want to store 240GB locally, I just ssh to server and build it there. Or mount the repo via sshfs. There are also alternatives like CodaFS that can cache files locally, so...

      If I understand correctly, to effectively operate a GVFS repo you need proprietary MS Team Services (WTF is that?) and open-source GVFS-enabled git client.

      "GVFS relies on the GvFlt filter driver, available as a prerelease NuGet package with its own license."

      Another proprietary piece of shit?

  14. Anonymous Coward
    Thumb Up

    Welcome back Clearcase...

    ...if you ever really went away.

  15. batfastad

    MS

    I like the way they reliased they've got a problem once a git checkout hits 3 hours. Presume they were happy with 1-2hr checkout times?

    Maybe that's what Windows update has just been doing in the background all these years!

  16. batfastad

    Git Submodules?

    http://bfy.tw/4L93

    1. ThomH

      Re: Git Submodules?

      One suspects that the Windows repository has been ported from repository to repository going back to time immemorial and is not modular because an earlier system did not allow it to be modular, and it'd now be an unfathomable amount of work to refactor.

  17. Planty Bronze badge
    Stop

    Microsoft: you are doing it wrong

    270gb repo for Windows. That sounds like they have binaries in there, which is always bad news. (Git LFS sorts that out).

    Rather than creating a software solution to a non problem, why not just sack the clown in charge of your source code system.

    It's also very telling they aren't using TFS backend.... That's the REAL story here, and should raise major alarm bells for anyone else that does or is planning to use it.....

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like