Recently I copied 60 million files from one Windows file server to another. Tools used to move files from system to another are integrated into every operating system, and there are third party options too. The tasks they perform are so common that we tend to ignore their limits. Many systems administrators are guilty of not …
cygwin + bash + rsync
You could have setup cygwin with bash and rsync on the windows machines as well. Rsync is better then cp when it comes to moving files between systems.
So you've tried this and it works with 60 million files have you?
"So you've tried this and it works with 60 million files have you?"
So you've tried this with 60 million files and it hasn't? Have you?
Nope, not tried it, that's why I'm not suggesting it as a solution ... What's your reason?
So you've tried this and it works with 60 million files have you?
I have used rsync extensively, one of the things i commonly use it for is duplicating entire system installs from one system to another... I can do an initial copy while the source system is live to get 99% of the data, and then only need to copy the differences once i have shut down the source system. I have a busy mailserver which uses the maildir format (one file per message) and that had more than 60 million small files on it when i migrated it to new hardware.
You're missing the important point
You've tried this in cygwin on a windows box have you? And it works?
No'one is disputing rsync works on linux. The article is about how to copy that many files from a windows box. If someone suggests doing it in cygwin it's a more useful suggestion if they actually know whether it works, rather than leaving someone else to do their work for them.
In theory any of the other tools discussed n the article should also work. In practice they didn't.
It has options for 'synchronising' two folders, including file permissions.
Dump the text output into a file then search for 'ERROR', or better, pass it through FINDSTR to filter out anything but lines beginning with 'ERROR'
Sure, it bombs out om folder/filenames longer than 256 characters, but those are an abomination and the users who made them(usually by using filenames suggested by MS Office apps) really needs to be punished anyway.
Been pushing around a few million files this spring and summer...
Robocopy DOES support paths longer than 256 characters... In fact there's a flag ( /256 ) that *disables* "very long path (> 256 characters) support".
robocopy also has the /create option which copies the files with 0 data. OK, a total operation take 2+ passes, but it has several advantages :
1. 60million files WILL cause the MFT to expand. If you are filling the disk with data as well as MFT then you can end up fragenting the MFT which can lead to performance problems. If the disk is only writing MFT (such as during a create), then the MFT can expand to adjacent space.
2. Since there is no data, the operation completes is a fraction of the time, so if you log the operation, you can see where any failures will occur, and fix them for the final copy.
For planned migrations, you can run the same job several times (only run the /create the first time), and therefore the final sync should take a lot less time :)
...use Cygwin? Gives a lot of the *nix style commands and abilities right within Windows.
If one is going to go to the bother of having a Linux box sitting about just to show the Windows severs how it's done, one has to question why one even has the Windows serves in the first place. :-)
Perhaps it's because the windows servers aren't used simply to copy files?
And you've tried copying 60 million files in cygwin have you? How did that go?
Easier: Use Services For Unix, that one uses an Interix subsystem to run all your UNIXy stuff.
Get a grip
I asked is "cygwin" had been considered, I did not say "Use cygwin! It's the wins! L0LZ!!11!" The author had looked at various tools and not mentioned "cygwin", so my question seems perfectly reasonable to me (others have asked the same question).
You seem to have taken umbrage at a few "cygwin" related posts, do you have an issue with this tool (I used it for some light-weight ssh and rsync work and really like it). If you do, have you filled bugs, got involved?
Or do you know something about "cygwin" and large jobs? "Oh, you can't use cygwin for that because it's job index will overflow, see bug-1234".
Or are you just some reactionary pillock who can't see through the Windows? "!Microsoft==Bad"
I know which conclusion I am drawing at the moment...
You're jumping to the wrong conclusions. The point is simply that before trying you'd expect half the tools tried in the article to work on windows. However, they didn't. So it's helpful to make clear whether you know the solution you're suggesting works or not. It sounds like you don't. Also, you didn't ask whether cygwin had been considered. Re-read what you posted. It's nothing to do with windows / linux / os of choice. I work mostly on bsd and linux derivatives for a living. Oh, and i like cygwin.
Pull out hard drive.
Move hard drive to other server.
Put in hard drive.
Or backup to tape and then restore.
Using something like tar or rsync may well have been better than cp
The title is required, and must contain letters and/or digits.
All well and good until you have a RAID array
and some added bonuses ...
+1 This has the most desirable side effect of stopping any bugger from trying to update the "wrong" file, or creating new ones while the copy is in progress.
p.s. Don't forget the second part of any professional data copying activity is to VERIFY that what you copied actually did turn out to be the same as what you copied from. Many a backup has turned out to be just a blank tape and an error message without this stage.
Did I mention that the systems were both live and in use during the copy?
Also, RAID card on the destination server was full.
And you do know the price of a tape drive, right? Company I worked for bought a pair of Ultrium-4 drives from IBM. The bloody thing itself costs a little over US$3000 a pop, and you need two. If you work in a company where accounts is a /b/tard, you'll know how painful it is to get them to approve the upgrade for a drive, let alone two drives.
, err, how do you do that with a SAN or NAS
Clearly you've never worked in a decent sized enterprise.
There are LOTS of scenario's when you can't just move the disks or use backup hardware. Maybe not all the data on one disk is being moved. Maybe the data is on different SANs or NAS and can't be swapped or zoned. maybe you can't attach the same type of tape to each server.
Most obvious conclusion ever
"So the best way to move 60 million files from one Windows server to another turns out to be: use Linux."
D'oh? On a serious note, you should also give rsync a go. And try also _storing_ the data on an OS that is not windows, you might find that you are then able to do things you need to do, like say copy 60m files, without resorting to booting virtual machines of another OS just to use that OS's utilities to manage your main OS. Just saying. Please don't flame me.
how about a straght cmd line like xcopy?
You're learning, Trevor!
> So the best way to move 60 million files from one Windows server to another turns out to be: use Linux.
It's only a matter of time before you see the light!
screen -R big-copy
rsync -avzPe ssh user1@box1:/src-path user2@box2:/dst-path
Go for a beer or 500 ;-)
Not easy once you try doing a file transfer via rsync through an ssh tunnel, like your suggesting, but the destination server isn't running an ssh server....let alone use / as a path convention.
Wrong ... you mean, it's running Windoze.
"Not easy once you try doing a file transfer via rsync through an ssh tunnel, like your suggesting, but the destination server isn't running an ssh server....let alone use / as a path convention."
Well, if the target system is running LInux, then turn on its sshd service!
If it's running Windoze ... well, borrow another PC, boot an appropriate Linux live CD, mount the MS Shared folder as CIFS, start the ssh daemon, and rsync through the temporary Linux system to the MS system.
Linux: the system that provides answers and encourages creativity.
Windoze: the system that erects obstacles and encourages stupidity.
large file transfers are always a challenge
Personally, I swear by Directory Opus, by GPsoftware.
I can attest to it's incredibly reliable performance, error handling and insanely flexible advanced features.
Aside from being able to copy vast quantities of data, handle errors, log all actions, migrate NTFS properties, automatically unprotect restricted files and re-copy files if the source is modified, it also has built-in FTP, an advanced synchronisation feature (useful for mopping up failed files after you've fixed the problem that stopped them being copied), and a truly unparralelled batch renaming system which among other things, can use Regular Expressions.
It also has tabbed browsing (you can save groups of tabs), duplicate file finding, built in Zip management, custom toolbar command creation, file/folder listing and printing....
Stangely, not a lot of sysadmins know about DOpus. I learnt of it during my Amiga days, in what seems like a lifetime ago. I always have a copy installed on my workstation, and at least 1 of my servers
Directory Opus is great ...
... but I doubt it would copy 60 million files.
I second (third? fourth?) the vote for cygwin.
Rsync is your friend, especially when you've got the ssh tools loaded so you can use scp. :)
Come to the light, Trevor! :)
Good story - quite an eye opener
I like it! I'm often frustrated dealing with 10,000 small files in Windows, never mind 60m! On a desktop Windows 7 is painfully slow displaying 2000 photos in a single folder. It shows them straight away (detailed view, not thumbs) but then takes 20 secs to sort them by date modified! Aah!
But why do you have 60m files? Could you store that data in a better way? Could it be put into a database for example?
why have 2000 in a single folder
surely you could come up with some kind of organisation, holidays, family etc. The file system is hierarchical for a reason!
are you one of those people who has hundreds of files dropped directly into the root of c: too?
Or cluttered the desktop with icons and icons and more icons?
Are you one of those people who wastes your life endlessly taxonomising?
And hierarchical taxonomies are largely at odds with reality. Here is a photo of an interesting building I saw on holiday. Does it go in the "building" folder, or the "holiday" folder?
Store 60m files in a database? SharePoint perhaps? Not quite as easy to access/control/backup as an NTFS storage tree. Sorry.
Reply to post: Why? Because.
> Here is a photo of an interesting building I saw on holiday.
> Does it go in the "building" folder, or the "holiday" folder?
Nah. It goes in the 2010-08-24 folder, tagged with 'holiday' and 'architecture'
2000 in one folder
Because 2000 can quite easily fit on one memory card these days, and a long enough holiday can also generate that many.
From Windows to Windows I'd use robocopy.
For Unix to Unix use rsync.
Both of these are very fast and support all sorts of failures where a copy is restarted (the existing files are skipped).
From the article:
I wanted to give several command-line tools a go as well. XCopy and Robocopy most likely would have been able to handle the file volume but - like Windows Explorer - they are bound by the fact that NTFS can store files with longer names and greater path than CMD can handle. I tried ever more complicated batch files, with various loops in them, in an attempt to deal with the path depth issues. I failed.
robocopy can handle large filenames
....which makes me very interested in what he was doing, and why robocopy wasn't an option
I've managed to use robocopy to create files I couldn't delete from windows before (because I'd gone from c:\ to d:\somefolder\someotherfolder and that pushed the bottom of the folders past the filepath limit)
I've used Robocopy for 60m files
Many times, it is fast and the /MIR option is awesome since a failure for any bizarre under the hood reason can be left and corrected at the end. Also if users are updating files whilst you copy, the final run picks up all those changes.
I'd be tempted to use rsync - its fallover behaviour is more reliable that cp's.
It can copy over ssh (or rsh), its own network protocol, as well as between local directories (or NFS / Samba mounted shares).
Ability to copy file and folder names exceeding 256 characters — up to a theoretical limit of 32,000 characters — without errors
Its all about preferences?
These things tend to end up as personal preferences so whether Linux or Windows my favourite tool is rsync and if I'm copying more than a single file within a Linux box I choose it over using cp any day.
I use a tool called Richcopy. It can do multiple thread copying (copying a couple files at a time, supports x number of retries, and gives you a handy readout at the end on what files failed/had an issue/etc. It also has a compare feature to compare the two locations once you're done.
You did read the article, did you? Richcopy is not only mentioned, it was the only tool that worked under Windows for Trevor.
"Richcopy can handle large quantities of files, but can multi-thread the copy, and so is several hours faster than using a Linux server as an intermediary"
Richcopy is not so fast that there is time to defragment an NTFS partition with 60 million files on it before CP would have finished."
Richcopy copies more than one file a time.
If your destination is a Windows server, then doing this causes massive fragmentation. So the total time to finish the copy is "time to copy" +"time to defrag". The goal is not just to get files from server A to server B, but to get them there in a ashion that ensures that server B is ready for prime time.
pretty simple. richcopy's multi-threading approach is faster but results in a massively fragmented filesystem which you then have to defrag. So in total, the richcopy route is slower, because richcopy+defrag takes longer than cp.
I don't know what the state of rsync servers on windows is, but on unix systems it's the one true way for copying files over a network.
It's fast, handles failure well, doesn't get it's nickers in a twist when doing large recursive copies, and will run over ssl with a bit of work. It can also give you plenty of feedback about progress if you need it.
The idea of copying that many files using something like a file manager or cp, no matter how good, just fills me with horror.
Roadkil's Unstoppable Copier
Is great too.
- Xmas Round-up Ten top tech toys to interface with a techie’s Christmas stocking
- Google embiggens its fat vid pipe Chromecast with TEN new supported apps
- Microsoft: Don't listen to 4chan ... especially the bit about bricking Xbox Ones
- Exploits no more! Firefox 26 blocks all Java plugins by default
- Shivering boffins nail Earth's coldest spot