GVFS sounds super dumb
Help me out here, what's the point? It sounds like Git's antiparticle.
Microsoft has adopted Git to manage the vast collection of code that is Windows' source, and has shared performance issues it's had to fix along the way. The state-of-the-nation report for what Microsoft calls the “largest Git repo on the planet” follows on from its launch of the “fat Git repo” handler, the Git Virtual File …
There are tons of companies which store source to all their products, and presumably their tax returns and porn stash, in a single giant repo. Perforce and ClearCase actually encourage such a way of working.
Now those companies may want to use "git", but of course not to the extent that they would change their way of working and split up their repo a bit. So now they can buy "Microsoft git" which presumably has some token integration with actual git, but for the 30gb repo support you have to use Visual Studio and not the normal git client.
I suppose "embrace and extend" is still a thing at MS.
"but for the 30gb repo support you have to use Visual Studio and not the normal git client."
First of all, you can check out the gvfs source code yourself on github
Secondly, VS 2017 simply uses git.exe to do git related tasks. VS 2015 took a different approach, putting all the git functionality inside a dll, which was probably convenient at the time but ate quite a lot of memory (the VS team's biggest sin is ignoring 64-bit support for over a decade now).
AFAICT gvfs is simply a layer under git that allows the developers to avoid pulling in the entire repository. Few developers are likely to touch the entire code base, yet the build servers probably need the whole thing.
OTOH, according to the github page, the latest version of Windows 10 *is* a requirement. So some OS support seems to be needed for this to work. I have no idea if this can be ported to other operating systems.
Is it more feasible to force the build servers into pulling thousands of repositories at build time? It would surprise me if the answer is 'yes'.
Now oldcoder, your comment is really, really dumb.
git is running. All the time. All the change tracking is happening. If you ever need the information, it will be downloaded when you need it. When you don't need the information, it's not downloaded. It's still all there in git.
You have 100 teams working on different things. They all use one git repository. Everything any team member looks it is always there - only the things they don't look at are not downloaded.
Remember what you're dealing with...
C:\>cp file file.old
'cp' is not recognized as an internal or external command,
operable program or batch file.
cp exists in Powershell. If you're still using the DOS-like console in Windows in 2017, you deserve everything that's coming to you.
>Powershell is the bee's knees. I was late to come around to it, but I'll never use cmd.exe again.<
And look how many keystrokes I can save by typing 'cp' instead of some verbose COBOL crap like 'copy', intended to make scripting 'easy', so that 'you don't need to be a dev to do scripting'.
As if making readable scripts ever worked. That's the problem with readable scripts: it makes people think that anyone can do it.
Perforce and ClearCase actually encourage such a way of working.
That's because they can. Perforce was designed to scale, and has always scaled much better than Git. Splitting things up to into multiple repositories is just admitting that Git doesn't scale while quietly pretending that the added inconvenience is some how better.
To be honest, I still don't understand the fascination with Git. The only thing it has going for it (that I can see) is it's free. And at the end of the day, as with all things, you get what you pay for.
I use Perforce at home (after being introduced to it 12 years ago at work) and the number of times I've had to Google how to do something in all that time I can count on one hand. With Git (when it screws up - and it does screw up) I find myself Googling at least once a month (usually when it craps out during a rebase or merge and I need to retrace mysteps and try again). The perceived cost versus the actual costs just don't add up in my mind.
To be honest, I still don't understand the fascination with Git. The only thing it has going for it (that I can see) is it's free.
Cult of St. Torvalds, I think.
My impression of git is that it feels like something a programmer whipped up in a week or so to scratch an immediate itch, without any thought to user-friendliness or scaling. Which is of course exactly how it originated. He needed to get off BitKeeper ASAP when their license terms became onerous, so he threw something together.
"My impression of git is that it feels like something a programmer whipped up in a week or so to scratch an immediate itch, without any thought to user-friendliness or scaling"
It scales better than SVN and the design is pretty neat - if you bother to understand it. Which takes some effort, as it is indeed quite unconventional (think two-dimensional hierarchy, where one dimension are files and other is commit history). However you make a good point that it was indeed whipped in a hurry, hence upvote.
Give the guy his due.
He wanted to continue using Bitkeeper. Lots of people in/around Linux used it and paid for it (even if they didn't always have to).
Then the owner of the company that make Bitkeeper decided to be a twat because someone from Samba fame started to reverse-engineer it's proprietary formats so they could integrate with it.
He pulled the rug, the software was made unavailable.
So Linus knocked up an alternative in a few days, that pretty much sent Bitkeeper scrambling and now even Microsoft use it, and Bitkeeper is nowhere to be heard of. Since the very early days, it's been almost entirely other people - including Microsoft - developing git, but you have to admire the way that was done.
"Okay, you won't play ball any more, despite it being nothing to do with us kernel developers at all? Okay, I'll write an alternative that's more focused on our process, better for us, and does things yours can't. Oh, look, there it is, done. Bye!".
There aren't many people who can re-write an independent implementation of a large commercial product overnight, that ultimately leads to nobody even touching the other software any more, and Microsoft basing product lines and their entire development process on it.
Heres an interview w Linus about it
I like git, mostly. I like that it tries to parallel the file system in its use. I like that it makes sense on the command line, doesn't _need_ a daemon and can just be moved by file copies. I am sure that some other version controls do some things better. But it's free, pretty good at what it does and allows you a lot of growth if you want to become expert at it. Subversion never clicked with me, Sourcesafe sucked and Clearcase makes me wonder how its creators feel about creating such a loathed piece of software. So best, in my limited experience, by a long shot
You are being daft here. There is no "Microsoft git". There is git, with all the git commands that you know, using a virtual file system on the client. I'm right now making a living using a git repository of 100 MB. These guys have a repository of 300 GB. Without that virtual file system, git can't handle it. I'll congratulate Microsoft for using a very smart approach to a difficult problem.
I too thought of ClearCase when I read this, with rather mixed memories. I do remember working on a large team where ClearMake really came into its own though, pulling in libraries on the fly that others had compiled. I wonder whatever happened to it... but not enough to google and find out.
I can think of several possible points:
1) If your git repository is 300GB (perhaps because you have several decades of spaghetti dependencies in there) then you don't want to pull it all in at once. The usual DVCS approach of "grab the repo and party on dude" doesn't scale. (Yes I've heard of re-factoring and technical debt. Apparently, despite re-writing Windows from the ground up with every major release, MS haven't.)
2) If your toolchain doesn't support git, you need to make it look more like a normal part of the file system, because everyone supports "normal files". So MS have written a filesystem driver that does that. (According to the blog, they intend to ditch this approach in the longer term, in favour of building git support into NTFS. What ... the ... fsck! Can you spell "retrograde"?)
3) Having done 1 and 2, your next problem is that you don't have all the files locally and still need wire access to the originals, so some kind of proxy might be nice.
I can see that purists might reckon that all this is solving the wrong problem, but if the Right solution is quickly re-factor 300GB of source code then I can also see that MS might be forgiven for pursuing this approach. When you are up to your nose in shit, opening your mouth to call for help isn't necessarily the thing you do first.
SQL Server for Linux is essentially running on a compatiblity layer (line Wine, but not Wine) and Visual Studio for Linux is Xamerin Studio renamed. Microsoft isn't even really attempting to change, they're just pretending they are. Using GIT is just the best option for them at the moment, they've used a lot of source control systems in the past.
".....despite re-writing Windows from the ground up with every major release, MS haven't."
I've never heard this from Microsoft. That would be like saying Linus writes Linux from the ground up with every major release. It's just utterly stupid and incorrect.
The achilles heel for Git is that you must pull ALL the repository in order to use any of the respository. Various ways exist to work around this issue - shallow clones, submodules, subtrees, repo etc. but nothing is very good.
I suppose the idea for GVFS is that when you do a clone of Windows, you don't transfer 300GB of crap to your machine before you even start. Instead you "clone" and the filesystem looks like the files were fetched but the fs only fetches a file's contents on first read. So if you're working on one DLL with 100 files you don't need to download the gazillion other files in the codebase.
Clearcase (contender for the worst source control system ever invented) did this too with a thing called a dynamic view. The difference in Clearcase's case was the dynamic view could change while you were using it if someone else committed files to the same view. Enjoy trying to debug problems when header and sources keep changing underneath you.
At least GVFS would behave like Git in that what you see isn't going to change unless you pull / fetch / merge. I'd like to see how MS intend to open this up outside of themselves though.
>Clearcase (contender for the worst source control system ever
Contender? You're being unduly generous and magnanimous. More like so far ahead that no one else is in the same game.
Add to it quite possibly the worst GUI ever inflicted on users. And the crappiest and flakiest backend Windows services.
Clearcase the worst? Pfeh, I see you never tried Visual SourceSafe. Combine the primitivity of RCS with the complexity of Clearcase - or perhaps it was just the Microsoft's MFC-era designers' capacity to overcomplicate things by exposing the wrong things to the user - and you are close.
Ful disclaimer: I have only ever encountered VSS briefly because, well... see point above. Clearcase might have caused more problems to the world by luring a team in until it is too late, but we are talking about the worst source control system, not the most evil.
No, it's a 300GB repository, not a 300GB code base - which includes all the branches, and large teams like that working on Windows usually do an extensive use of branches, unlike most smaller projects mostly working on a single one, and maybe just using branches only to mark releases and some maintenance.
That's actually a bit unclear, if the total repo is 300gb or a single branch. But note that git pulls one branch at a time so the relevant number for scalability is the size of a single branch.
To put into perspective how ridiculously large this is: the source code for the entirety of Debian is about 270GB. And that contains a vast suite of applications: everything from EDA tools to several office suits to multiple browsers to compilers to FPS games. A total of 28 thousand different packages. Windows is big but not that big
Given this, it is almost a certainty that the 300GB is not just source code. Perhaps it contains the entire build chain. Perhaps they are storing build artefacts in the repo.
I guess it depends if you only do "source code" control or "whole version" control.
It seems likely that they hold everything in there so you can track the code, the compiler settings, the resources and of course the test results.
As others have noted there will likely be different branches for "Home" "Small Business" "Enterprise" editions as well
OTOH I'm not so sure that includes Office, Dynamics or the languages.
But Debian doesn't handle, for example, the whole Linux kernel repository, and all its commits/branches, I guess it just pulls some of it. The same is true for other projects, when they are hosted elsewhere and not directly by Debian.
I have some open source project inside my VCS for libraries and applications I need to build latest versions not directly supported by Debian - but I just pull the stable releases, not the whole commit history.
Inside the Windows repository there are probably all the version of Windows they need to support (which may stretch down to XP, if it's still on paid support), the upcoming ones, plus the SDKs and related development and build tools.
Two different businesses, working in a different way.
"the source code for the entirety of Debian is about 270GB. And that contains a vast suite of applications: everything from EDA tools to several office suits to multiple browsers to compilers to FPS games."
I'd love to be able to modify the behaviour of certain office suits.
Where do I get the sources?
> But note that git pulls one branch at a time ...
Hmmm, not exactly. When doing a "
git clone" (by default) it will grab the entire remote repository (all branches) and set that up locally.
"One branch" is only what gets pulled after that when you're updating thing (eg "
git pull"). And that's only if you've not set up further tracking between your local branches and remote ones (eg:
git checkout -b somebranch --track).
So, "one branch at a time" is kind of yes and no, but mostly not really. ;)
Yeah, I was about to comment on that myself.
I think it's a sad display if you're selling items and then don't use them for your own setup. I mean: doesn't that tell us something about the items you're trying to sell us? I'm always very keen on that myself.
Back in the iPaq days the CEO of Compaq would give speeches and all and what was that one small detail which managed to caught my eye? He didn't use an iPaq, no way: he often used pen and paper to jot down notes. Errr, ok.... So it wasn't that revolutionary product which everyone could use afterall, eh?
Microsoft, back in the days (1990 - 2000), relied on Unix (Sendmail) to handle all their e-mail. Because Exchange just couldn't handle it, rumor even has it that they had tried to implement it a couple of times but that Exchange completely crashed because it simply couldn't handle the load. Now: in all honesty we need to keep in mind that Exchange was more than an MTA alone, so my example is a little bit flawed, But even so...
And there are tons of example. When a company tries to sell you a product after which it turns out that they're not using it themselves then I think something isn't quite right with the product ;)
I think it's a sad display if you're selling items and then don't use them for your own setup.
Well in all fairness the article does mention Azure being used for the storage, so it's more than likely they're actually using Team Foundation Server which happens to support Git as one of the underlying VCSs.
Re: so it's more than likely they're actually using Team Foundation Server which happens to support Git as one of the underlying VCSs.
The absence of any mention of TFS, gives rise for concern: is this the first indication that firstly MS will be discontinuing TFS support for non-Git VCSs and secondly, does this mean that TFS will become even more of a MS shim over Git?
IIRC they also used to run an AS400 for their warehouse management, back when they were still monopolizing shelves with (mostly) empty boxes.
Of course now they've had 17 years to integrate the 2 software packages that make up MS Dynamics I'm sure it's up to the job
Crashed exchange servers were very common.
Specially when it ran out of storage. For some reason it never could seem to send a reject message when it had insufficient space for the message - and crashed instead.
Took out an entire organizations mail system for about a week with that one - 5 redundant servers, all crashed with the same message, and only required about 15 minutes to do.
Got accused of attacking their servers... until it was pointed out that the message crashing them came from their own staff sent to our server to forward to theirs, and that the message was requested by one of the managers in their organization. (it happened to be an 8MB photograph of the staff).
Some exchange versions had the charming feature of being able to run out of storage space even if the disk WASN'T full. For example, the Small Business Edition of Exchange 4.0 was limited to 16 GB. If you hit that cap, the server simply keeled over, and you had to use external tools to remove enough messages to get the database down to size.
Microsoft, back in the days (1990 - 2000), relied on Unix (Sendmail) to handle all their e-mail.
It used to be good practice to put a separate MTA between Exchange and the outside world, because it really wasn't designed to cope with the wild world of the Internet. Exchange 4.0, for example, would only reject mail for invalid users *after* completing the SMTP session, making it a huge source of backscatter spam. I used to run an Exim server that would query the exchange server and then reject mail to invalid accounts before the SMTP session ended. It was also a handy place to do spam filtering.
Visual SourceSafe has been discontinued many years ago. Anyway, like the article says, it was never much used within Microsoft, SourceDepot was used instead. Anyway, in late 1990s, VSS was still better than nothing - as I saw many teams working without a VCS at all... and its GUI was a good way to make people learning version control, instead of just using clumsy command lines, which usually lead people think version control was difficult, merging very dangerous, etc. etc.
Visual Sourcesafe got re-branded and hidden inside TFS. It's called TFSVCS..... It's still as broken as bad as ever. It's slightly better at not corrupting itself, but it's got all the limitations it's always had...
If you are using TFS and didn't pick GIT backend, you are essentially running Sourcesafe with a new UI....
VSS was essentially a file-based solution that worked through shares - and that was one of its main weaknesses. It was also quite unusable and unreliable over slow connections for that reason.
It required a quite careful maintenance, avoiding large repositories, and a quite reliable network to minimize issues.
TFS is database-based, and works through HTTP. I wouldn't be surprised if they re-cycled part of the VSS code, though.
Not that CVS was that astounding piece of code, back then, and didn't have its own issues. CVSNT was slightly better, but it was essentially a one-man product and delivered its shares of trouble as well.
More expensive solutions like Perforce were better, but far less available, especially in small companies and small teams.
Anyway, back in those days even VSS or CVS (SVN would have been available only in 2004) were far better than *no VCS at all* - I saw more than one team just making copies of files to some server shares - the worst situation I encountered was when the shares were on the department manager PC, on a single disk, with no redundancy at all....
SourceSafe had the same issue as Access: Not being a real server, so relying on the network file system to handle multi-user concurrency - bad idea.
TFS on the other hand uses a proper SQL Server Database to store things. It has absolutely nothing to do with VSS anymore and in fact Git is now the preferred source control system in TFS. All in all TFS is probably the best and fastest developing ALM tool around today.
TFS needs to be the fastest, it's years behind other offerings like the atlassian suite of products.
Issue tracking in TFS 2017 is still horrible, and TFS web UI for pull requests is clunky, ugly and intuitive. The whole lot feels like a quickly lashed up product where the glue is falling apart.
A well setup JIRA/Bitbucket/Bamboo/Confluence suite whilst might be a little less integrated (although the do integrate pretty well considering they are all stand-alone), will still vastly outperform and outspec TFS, and cost less (and in our experience) have vastly superior uptime and far better user feedback.
"Are you sure of that? There's this little product called Team Foundation Server that is really VSS under the hood. Give me Bazaar, Subversion or if necessary Git over it any day."
VSS was discontinued, the last released version was 2005.
Saying that TFS is really VSS under the hood is like saying a proton is a lotus under the hood because some small parts are common.
For a start TFS is not just source control, secondly users can choose a source control provider for TFS (GIT is one of the options)
That aside TFS does include TFSVCS which probably does share some small parts with vss but to say it's the same is plain wrong.
"Anyway, in late 1990s, VSS was still better than nothing"
My mileage varied. Back in '98 a PHB decided to tidy up our VSS repo by deleting source to EOL products... Sadly he didn't know that deleting a file in VSS meant that the file, it's entire history and all previous revisions were also deleted. CVS was (and still is) better than that and it costs nothing.
Possibly the *worst* source control package ever.
One thing which strikes me - as someone whose job has descended into asking awkward questions of the marketing brigade ...
Where is the "AI" fairy dust for source control ?
Even going back 15 years, I was looking for source control systems that understood the semantics of the source they were shepherding, and were able to think not in file terms, but module, procedure and function terms.
We need a new icon ... "the future has failed us"
This post has been deleted by its author
But it does sound like they have some serious structure issues. Have they not heard of git submodules? https://git-scm.com/book/en/v2/Git-Tools-Submodules
For the real world, standard git is pretty fast, a 3m LOC project, 40 developers, 4 years worth of branch history, pure source, no binaries, a clone from fresh is about 30 seconds, branch switch is instant. This is luxury compared to our previous SCM IBM Synergy, where a clone would take about 3 hours (same network!!!), And a reconfigure -a branch switch (which would usually fail), about 20 minutes (when it failed , you deleted and spent 3 hours checking out a fresh copy).
GIT and an enterprise wrapper to provide user control, pull requests workflows and structure, it's light years ahead of other offerings. It's also very clear Microsoft are dumping TFSVC (which is essentially sourcesafe) as the backend and moving towards GIT as TFS favored backend.
"But it does sound like they have some serious structure issues. Have they not heard of git submodules? https://git-scm.com/book/en/v2/Git-Tools-Submodules"
You see, submodules are a hack that is needed because git cannot handle 300 GB repositories. (Actually, it can handle them just fine, but on my work machine I cannot even clone a 300 GB repository, and it would saturate the network for ages if I bought a bigger machine).
What Microsoft has done is just a clever trick to use the whole complete git, without having to do stupid things like submodules that you don't actually want. Why would you have thousands of developers learn about submodules, getting them right, when they can just work with the whole thing?
This must be some other linux from some other dimension as the current kernel branch is 184MB (compressed) 730MB uncompressed, not including all branch history, not including all the rest of the OS and all the applications and services and drivers that come with the OS.
"Linux repository is less than 200MB, that's all sourcecode and all branch history."
Latest kernel 4.11.3 takes about 650MB when unpacked. There are several kernel branches supported concurrently so that 650MB is still way off.
I'm sure just the Windows kernel itself would be in the same ballpark. That 300GB also contains the user interface, web browsers, all sort of applications that come with a Windows install and so forth. Maybe even bitmaps too. Does it also contain every supported Windows version as well (XP, 7, 8.1, 10) and the different branches for each one? The article didn't mention.
Perhaps a more proportional comparison would be to check the default installation of Ubuntu (or Mint or some other popular distro intended for general populace) and count the total size of the repos for all those programs, libraries and such.
For the real world, standard git is pretty fast, a 3m LOC project, 40 developers, 4 years worth of branch history, pure source, no binaries, a clone from fresh is about 30 seconds, branch switch is instant.
The fast switch works because you have to download the entire repository first, so it's really just shuffling files around on local disk. That's pretty brilliant for ~600 MB of kernel source code. It's less brilliant when that means downloading and storing 300 GB on each and every workstation.
My experience with 'git clone' on a project of any substantial size is that it's best to start it and then go do something else for a while.
All the branches are of course deltas.
Our codebase of 1 branch is about 300mb, the history for ALL the deltas for all the branches is about 70mb extra. That 70mb overhead allows me to sit on my boat and work, branch, merge, work totally offline, switch branches to my heart's content and basically at the end of the day push my work.
Cheers. Feel sorry for the suckers on TFSVCS in their hot cube farm... They have to deal with TFS crap model, and need a permanent connection to work!
"“O(modified)” – instead of the number of files read, key commands are proportional to the number of files a user has current, uncommitted edits on"
Well, that's what Perforce does. The server keeps track of what you have and what you change. It thus only transfers in new/changed files on sync, and only uploads changed files on checkin.
And guess what "Source Depot" is? MS bought a source license for P4 years ago, and SD is the result.
So, basically, having gone all agile-y and adopted Git, they are retrofitting it with all the grown-up features that P4/SD had.
Some divisions of Microsoft are not part of this project. Based on a comment in an earlier posting about this migration, the Bing group does use git, but doesn't keep its code in this super-repo. Apparently, they use sub-repos in their setup, and again apparently it has been a major pain in the metaphorical ballsack to keep everything in sync.
This is the big problem with sub-repos; you end up having to manually keep things in sync, whereas a "one big repo" solution makes it a lot harder for a dev to commit something to the head of their component that won't build against the head of every other component.
Biting the hand that feeds IT © 1998–2020