From: Christian (chth_at_gmx.net)
Date: Thu 14 Nov 2002 - 23:38:41 GMT
On Thu, 14 Nov 2002 10:32:06 +0100
Herbert Poetzl <herbert_at_13thfloor.at> wrote:
> On Thu, Nov 14, 2002 at 01:19:45AM +0000, ragnar_at_this.is wrote:
> > Hi,
> > > > because you might run in an O(n^2) issue ...
> > >
> > > I don't really care this is not a performance important task you
> > > might run
> > > it once a month and it can take many hours, no problem and the above
> > > algorithm might be somewhere in O(n) maybe little worse.
> > I do not see a performance problem.
> somebody will care, if a memory mapping process
> goes wild on thousands of files, making millions
> of comparisons for several hours *G*
huh? there are not many files which become a unify-candidate by match all
filters and exact same size
also i saied 'mmap() reasonably many' files, i think about "map as much
files which are comfortably fit in the free address space" wich might not
much more than 1 GB on normal x86's ... see below
> how to compare 5 files?
> - compare 1st with number 2-5
> - compare 2nd with number 3-5
> - compare 3rd with number 4-5
> - and compare the last two
> makes a total of (4+3+2+1)=10 comparisons
algorithm (just from scratch, hopefully noone patented it):
we put all same-size candidate files together in one bucket,
start a iterator from 1 to size_of_files,
compare the iterator position of the first file in the bucket with all
other files same position,
each time we find a different value than the one of the current position
of the first file in the bucket we check if there is an other bucket we
'just created' with the same value at that position, then we move the file
there, else we create a new bucket and move the file there and remember
this bucket as 'just created'. The 'just created' list is reset each time
the iterator is incremented.
So we need to scann each file only once, i would call that O(N) and
comparing cost is almost nothing compared to disc access and is even
faster than calculating any checksum. There are a very few cases when not
all files map at once, and a few optimizations (buckets with only 1 file
dont need to be compared with anything else..). mmap was my intention for
easy file-handling which is more a implementation detail, reading char or
blockwise or mmap blocks instead whole files will do just fine. (well i
must agree that accessing many files in parallel will be more worse than
accessing many files in serial order, if the files are not fragmented)
> cksum (for example) would map each file only once,
> and produce a hash value, which could be compared
> in the same way as planned, but much much faster,
> equally fast like the size sort/comparison ...
> please correct me if I'm wrong ...
please correct me if I'm wrong ... make it better. I want a) a solutions
for unifying vservers and b) a good one. That doent imply that my one will
be the best .. but someone has to start ... well there is the Perl Version
i check it latter but it looks much more limited than my proprosal since
it has no file-selections et all (and can run into symlink-race
I try to start coding on it within the next days lets see then.