From: Christian (chth_at_gmx.net)
Date: Fri 15 Nov 2002 - 10:57:40 GMT
> second, do not let you stop by anything, or anybody!
> (the vserver list seems good at stopping people from
> doing useful things ...)
this is one of the friendliest lists i am.
> and finally I've come to the conclusion, we are both
> (or at least I am) bad in explaining things, so I will
> try to shed light from different angles ...
I'm too :)
> How I see the process:
> a) selecting files on an apropriate basis
> (including find syntax, file lists, patterns, etc)
> b) comparing/sorting the files per size
> (this might be done with your bucket algorithm)
> c) comparing files of equal size agains each other
> (you can't assume that files of equal size are
> the same, so you'll have to compare them)
> d) unifying all files found to be equal
a) recursing through the filesystem (while filtering) and store the
file-data in a ordered set with the size as key, so we have a size-sorted
set of candidate files after this step
b) not nesseary, done in a)
c) comparing files of equal size agains each other .. thats what the
bucket algo does, in a smarter way than 'each file against each other'
d) each bucket with more than 1 file can be unified, it will rather be
integrated in c) to save memory requirements
> Some ideas I had regarding this process:
> - why a brand new selection/pattern syntax if
> find probably already does what you want?
find cannot select on file-attributes. someone prolly dont want to unify
files which are marked as 'immutable_file' instead 'immutable_link'.
> - what about external knowledge, in the form of
> include/exclude lists?
Maybe later, i am thinking about external config-files which can contain
include/exclude patterns and other options.
But at the current point i would like if it simply works at first.
> - why not generate some hash value for the files
> (in step c), so they could be compared instead
> of the files ...
generating a hash: iterate though a file and do a moderate expensive
my bucket algo: iterate though a file and do a cheap comparsion.
we need to iterate through the entire file anyways which is most expensive
Hashes have (microscopic) small chance of failure while being more
expensive to compute than the bucket thing.
I don't intend to use hashes.
> - maybe one can store the hashes of once unified
> files (together with the file name, location,
> creation time, etc) and reuse this information
i earlier mentioned to use db3 as temponary file data storage, this could
be extended as persistent store but i dont see much use from that, since i
dont keep hashes and u need to scan the filesystem anyways to find
modified files and you would need to recalculate the hashes of the
> Some (might be) useful information:
> - be careful about filesystem change (-xdev)
> - avoid/block recursive/broken links
links (and other special files ) will be completely ignored, only
directores are used for the recursion and plain files for unification
> - do not modify/touch the files (timestamps)
maybe i do (optionally). Example: i have 2 debian/woody installations
which are currently not unified and updated indepently, when this new
vunify is finished i want to unify them!
> - do not assume (virtual) memory is unlimmited
mmap only consumes address space (which is limited too), the only big
memory amount i need is the file-metadata storage which might be grow to
some tens or hundred megabytes, thats the place where i'm thninking about
db3 if it shows up problems. But i think noone has such a big server with
such less memory that it will be a problem ...
> - if you want it fast, code in C
this thing is so much affected by disk-io, it will spend the most of the
time in waiting for disks i guess that only 0.5-1% are spend in the
programm itself. So even a very worse language which is 10 times slower
than C will only make the programm 5-10% slower.