From: Sam Vilain (sam_at_vilain.net)
Date: Fri 06 Aug 2004 - 01:53:43 BST
Jörn Engel wrote:
>>There is vunify which is part of util-vserver package. What is
>>better for general usage is Sam's unify-dirs script. It is located
>>Just use this without the -l or -i options and it will just do hard
>>links, which is what you want for the current implementation of cow
>Darn! Doesn't work for me yet.
>One personal problem is a slow (USB1) 300GB hard-drive that contains
>some identical files. I was thinking about hashing only the first 4k
>or so of each file and the do a direct comparison in case of hash
>collision. Even with sha1 over the complete file, there is no
>guarantee that a hash collision means two identical files.
The chances of bits on your hard drive platter randomly losing their
magnetism or capacitors in your RAM losing charge and changing are
probably higher than two different files having an SHA1 collision :-).
Hey, maybe *that's* why I get those random reiserfs corruptions!
Hashing only the first block of the file as an optimisation is a
The script could be easily modified to do this as a seperate step,
however bear in mind that it will only even consider checking the file's
contents if the files already have the same owner/group/permissions,
relative path and file size. My assumption was that if these all match,
the files are probably going to be the same anyway.
>Also, I want a database with all already known files. Untimately this
>could be turned into a daemon that watches the complete fs tree for
>changes and turns new files into cowlinks shortly after creation.
>With such a daemon, "cp -r" will temporarily flush part of the page
>cache, have the same result as "cowcopy -r".
Nice idea, but I think on UNIX that's pretty much a can of worms with no
easy answer. You'd need something in the kernel that notifies userland
when any inode on a filesystem changes. Have a look at the intermezzo
module if you want to go down that path. If you can provide the kernel
half, I'll be more than happy to extend unify-dirs to work with it :).
Failing active monitoring, as a simple compromise there's no reason that
unify-dirs couldn't optionally store its internal inode/stat/SHA1 hash
cache in a Berkeley database, and run the script every hour or so via
cron. It would certainly prevent the copious stat()'ing that the script
does, at the expense of not noticing unlikely unification situations
until the DB cache entries expire.
Of course, it would still absolutely hammer the VFS every time it runs
with readdir() calls and find all those glorious reiserfs corner case
bugs, but in my experience with a "handful" (say, 30) of vservers that
are already mostly unified the script completes in under a minute when
unifying just the OS (eg, /usr, /lib, /sbin and /bin).
Who knows, maybe there are other optimizations possible - like only
stat()'ing the leaf directories in the heirarchy, to see if any files
have been added or removed before actually using readdir() to read them.
Again this will not catch some unlikely unification situations until
full stat()'ing happens.
Vserver mailing list