About this list Date view Thread view Subject view Author view Attachment view

From: Jörn Engel (joern_at_wohnheim.fh-wedel.de)
Date: Thu 14 Oct 2004 - 13:16:14 BST


On Wed, 13 October 2004 22:29:35 +0200, Herbert Poetzl wrote:
> On Wed, Oct 13, 2004 at 06:22:37PM +0200, Jörn Engel wrote:
> > On Wed, 13 October 2004 16:29:05 +0200, Herbert Poetzl wrote:
> > > On Tue, Oct 12, 2004 at 06:12:57PM +0200, Jörn Engel wrote:
> >
> > New-variant cowlinks are closer to symlinks than anything else. Like
> > symlinks they allocate an extra inode per link. Like fast symlinks
> > for ext[23] they store the link information in the inode itself.
> >
> > Still, don't think of it as a symlink, it's not. Close, but
> > different.
>
> would the following setup
>
> H1,H2 -----> C1 --,-> I1
> H3,H4,H5 --> C2 -´
>
> (see I'm capable of ASCII art too ;)
>
> have a) one, b) two or c) three inodes in
> the inode cache after H1-H5 has been read from?

c)

> > Cowlinks have a link count of n for n hard links to them.
> >
> > The link count to the real inode could simply be ignored, but it can
> > also be used for optimization. It's n for n cowlinks pointing to the
> > inode. (Assuming no direct hard links to the file. Direct hard links
> > would have interesting properties as well, but they complicate things
> > for now.)
> >
> > Doing the cow_copy on a regular file takes two steps:
> > 1. Turn the file into a cowlink/inode aggregate.
> > 2. Add a second cowlink to the picture.
> >
> > With above ascii-art, it should look like this:
> > Before:
> > H1 -----> I1
> > After 1:
> > H1 -----> C1 -----> I1
> > After 2:
> > H1 -----> C1 ---+-> I1
> > H2 -----> C2 --/
> >
> > I1 has a link count of 2.
> >
> > Writing to H2 will cause the cowlink to be broken, so we have this:
> > H1 -----> C1 -----> I1
> > H2 -----> C2 -----> I2
> >
> > So now we end up with two distinct files with no relationship
> > whatsoever. I1 and I2 both have a link count of 1 to reflect that
> > fact. Therefore we can optimize above picture a bit:
> > H1 -----> I1
> > H2 -----> I2
>
> hmm, okay, but when would this optimization be done?
> a) - for I2 at the time we write to it?
> b) - when the last file handle is closed?
> c) - by some userspace daemon who scans the fs?

Not sure. I added letters to your list for easy reference.

a) would hurt if soon after the write another cow_copy is done. Step
1 of cow_copy would simply revert the "optimization", so overall we
simply burned cpu (and possibly io) cycles.

b) can have the same problem. Imo a) seems more natural than b), so
I'd either pick a) or none of the two.

c) makes sense, simply because cowlinks could use a userspace daemon
anyway. Unaware users, security updates on n servers and pure luck
can all create identical files that are not cowlinked. Detecting them
is best done in userspace, so we have a daemon. It could easily do
this optimization as well.

Said daemon isn't simple, though. First off, it needs notification
for any file changes on the whole fs. Not sure if inotify is
sufficient to do this. Second it needs a database of all known files.
With huge amounts of files, this database could grow pretty large
itself. And third, there is no race-free way to combine two identical
files to cowlinks in userspace. So a kernel helper function is
needed - yet another syscall. For above optimization the same is
true. Not sure how happy the kernel gods would be about all this.

> > Going back to the direct hard link issue, Ted had some interesting use
> > for this. Maybe you can comment on it:
> > H0------------\
> > H1 -----> C1 --\
> > H2 -----> C2 ---+-> I1
> > H3 -----> C3 --/
> > H4 -----> C4 -/
> >
> > Above is a scenario where four virtual servers share the same file,
> > each having a cowlink to it so noone steps on the others toes. File
> > in question is called "sshd". Now an exploit for sshd becomes known
> > and the developers release a security fix.
> >
> > Traditionally all four servers need to get an update. This would
> > break all cowlinks, so space is used four times. Worse, it requires
> > the administrator(s) to do the update four times. Ted sees an
> > alternative to this.
> >
> > Instead, the direct hard link "H0" can be used as a backdoor to update
> > all copies of sshd simultaneously and keep the space benefit.
>
> not a good idea IMHO, it might sound nice in this
> example but usually things are more complicated:
>
> - you do not update ssh but the sshd package
> - the new sshd requires a new ssl
> - scripts have to be run to create the new xsa key
>
> and the fact that you do it four times is not really
> an issue, provided that you can reunify the result
> over all vservers (which is how it is currently done
> with linux vserver's iunlink)

So the idea is officially shot down. Good, keeps things simpler.

> > > > o Writing to any hard link will write to all hard links, as always.
> > >
> > > but it will also break the COW link into
> > > two segments, right?
> > >
> > > example:
> > >
> > > H1,H2 --> C1 --> I1
> > > H3,H4 --> C2 --> I1
> > >
> > > write to H3 will result in
> > >
> > > H1,H2 --> C1 --> I1
> > > H3,H4 --> C2 --> I2
> > >
> > > what about link count of H1 before and after?
> >
> > H1 doesn't have any link count. Only C* and I* do.
>
> yeah, sorry that was what I meant
>
> > Before: C1:2, C2:2, I1:2
> > After: C1:2, C2:2, I1:1, I2:1
>
> okay, that's what I thought ...
>
> so an open without MAY_WRITE would actually get
> inode I1, where one with will get C1 or C2
> or am I wrong here?

Not wrong, just confused. This is the most complicated part of it
all. ;)

Let me start with something simpler: fstat(). It returns a struct
with 13 fields. For regular files on a real unix filesystem, all 13
fields come from the inode. For cowlinks, they come from two seperate
inodes.

st_ino, st_mode, st_nlink, st_uid, st_gui, st_atime, st_mtime and
st_ctime all come from C1.

st_size and st_blocks come from I1.

st_dev, st_rdev and st_blksize could come from both.

So in the end, I1 knows about the file _contents_. Size and number of
blocks belong to the contents. C1 knows about the file _status_, i.e.
it's owner, group, access mode, access times etc.

For open I'm not sure yet. I think an ro open should get I1 and an
r/w open should get I2. Simply because after the file is opened,
noone cares about owner, group etc. anymore. If there is some
remaining corner case I missed it will have to be more complicated.

This, btw, brings us to the worst part of cowlinks. Take this simple
example, using your enhanced ascii art. ;)

    C1 ---,-> I1
    C2 --´

A program has this example pseudo-code:
1: fd1 = open(C2, ro);
2: fd2 = open(C2, rw);
3: write(fd2, "foo");
4: fd3 = open(C1, rw);
5: write(fd3, "bar");
6: read(fd1, ...);

Obviously since fd1 and fd2 point to the same file, we expect to read
"foo", not "bar". But let's go through step by step.

Before:
    C1 ---,-> I1 ""
    C2 --´
    fd1 -> empty
    fd2 -> empty
    fd3 -> empty

After 1:
    C1 ---,-> I1 ""
    C2 --´
    fd1 -> I1
    fd2 -> empty
    fd3 -> empty

After 2:
    C1 -----> I1 ""
    C2 -----> I2 ""
    fd1 -> I1
    fd2 -> I2
    fd3 -> empty

After 3:
    C1 -----> I1 ""
    C2 -----> I2 "foo"
    fd1 -> I1
    fd2 -> I2
    fd3 -> empty

After 4:
    C1 -----> I1 ""
    C2 -----> I2 "foo"
    fd1 -> I1
    fd2 -> I2
    fd3 -> I1

After 5:
    C1 -----> I1 "bar"
    C2 -----> I2 "foo"
    fd1 -> I1
    fd2 -> I2
    fd3 -> I1

At this point, 6: would read "bar". Clearly we did something wrong.
But how would it be done correctly? There are two options:

a) The open in 4: would again cow the file away:
After 4:
              I1 ""
    C1 -----> I3 ""
    C2 -----> I2 "foo"
    fd1 -> I1
    fd2 -> I2
    fd3 -> I3
We end up with a dangling inode I1. It would have to get killed after
fd1 is closed. If the system isn't shutdown cleanly, we'd still have
allocated space for I1 on the fs, so an fsck should be done now and
then. That's all nice. But reading from fd1 still doesn't give us
"foo", it gives us the old contents "".

b) fd1 needs to keep some knowledge about C1. When C1 is COWed away,
all pages mapped into fd1 need to be discarded and paged in again.
When reading from fd1, we notice that it's now I2 to read from and
successfully read "foo".

So it looks as it b) is the way to go. But that means that I was
really confused before and open needs to know about the cowlink. It
also needs that I have to fiddle in mm/. So now I don't only get Al
Viro mad at me, Andrea, Rik and the other vm gods will start to curse
my name as well. Time to order a new asbestos suit. ;)

Comments to this?

Jörn

-- 
To my face you have the audacity to advise me to become a thief - the worst
kind of thief that is conceivable, a thief of spiritual things, a thief of
ideas! It is insufferable, intolerable!
-- M. Binet in Scarabouche
_______________________________________________
Vserver mailing list
Vserver_at_list.linux-vserver.org
http://list.linux-vserver.org/mailman/listinfo/vserver


About this list Date view Thread view Subject view Author view Attachment view
[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Thu 14 Oct 2004 - 13:16:47 BST by hypermail 2.1.3