RE: [vserver] Capabilities in Vserver Kernels

From: Joe Gooch <mrwizard_at_k12system.com>
Date: Mon 16 Jun 2008 - 23:59:52 BST
Message-ID: <A8B39189E548604792E5213D18A9DB62190F539F@deagol.win.k12system.com>

> -----Original Message-----
> From: Daniel Hokka Zakrisson [mailto:daniel@hozac.com]
> Sent: Monday, June 16, 2008 6:05 PM
> To: Joe Gooch
> Cc: vserver@list.linux-vserver.org
> Subject: RE: [vserver] Capabilities in Vserver Kernels
>
> Joe Gooch wrote:
> > See inline.
> >
> > Joseph Gooch
> > Sapphire Suite Product Manager
> > K12 Systems, Inc.
> > (866) 366-9540
> >
> >
> >> -----Original Message-----
> >> From: Daniel Hokka Zakrisson [mailto:daniel@hozac.com]
> >> Sent: Monday, June 16, 2008 4:27 PM
> >> To: Joe Gooch
> >> Cc: vserver@list.linux-vserver.org
> >> Subject: Re: [vserver] Capabilities in Vserver Kernels
> >>
> >> Joe Gooch wrote:
> >> > I have diagnosed what I believe to be an incorrect use of
> >> capabilities
> >> > in virtual contexts. I believe the vx_mask_cap_bset function in
> >> > kernel/vserver/context.c should mask based on vx_bcaps, not
> >> vx_cap_bset.
> >> >
> >> > Rationale:
> >> > vx_cap_vset is the capabilites set from
> /proc/sys/kernel/cap-bound,
> >> > which is global for the system, not by VM. Using vx_bcaps
> >> allows it
> >> > to be set by vm.
> >>
> >> Uh, no, it's the other way around. vx_bcaps can only be
> set from the
> >> host.
> >> The purpose of vx_cap_bset is to allow the guest to set
> >> /proc/sys/kernel/cap-bound on its own, independently from the host
> >> (obviously not to raise it).
> >>
> >> (Also, VM is really not the right term. It's not virtual, and it's
> >> not a machine. :-))
> >
> > VS then. :)
> >
> > OK, well if that's the intention, it's still not working.
> >
> > Host# cat /proc/sys/kernel/cap-bound
> > 128
> >
> > (128 equates to CAP_SETUID)
> >
> > Host# vserver support exec cat /proc/sys/kernel/cap-bound
> > 128
> >
> > Host# cat /etc/vservers/support/bcapabilities
> > ~CAP_SETUID
>
> That is perfectly fine. We don't mask vx_bcaps until it's
> used, which means software like BIND9 works without
> modifications in a guest, as well as gives you the ability to
> dynamically change the capabilities assigned to a guest. So
> that guest won't be able to do anything requiring CAP_SETUID...

Ok.. That's fine. But my patch accomplishes the same thing.

> > In fact, in examining the code, vx_cap_bset is only ever
> set from the
> > calling context. It's never pulled from anywhere else, as far as I
> > can tell. Bcapabilities are dumped directly into vx_bcaps.
>
> ... without the patch in my email, yes.

You didn't change clone or fork behaviour with your patch, did you?

> > As a side note, I spend a large amount of time searching for the
> > proper way to use bcapabilities and ncapabilities and such,
> but could
> > not find an explanation of what these were or what they
> were intended to be used for.
> > It could be part of the wiki restructuring; I'm not sure. So my
> > knowledge in this regard is based on the code without context.
>
> http://linux-vserver.org/Capabilities_and_Flags (which is
> linked from http://linux-vserver.org/Documentation).

The documentation describes the linux capability model incorrectly.

----
The inheritable capabilities are the capabilities of the current process that should be inherited by a program executed by the current process. The permitted set of a process is masked against the inheritable set during exec(). Nothing special happens during fork() or clone(). Child processes and threads are given an exact copy of the capabilities of the parent process.
----
This is untrue, each fork or clone DOES add capabilities from /proc/sys/kernel/cap-bound into the effective and permitted sets of euid 0 processes.
(as long as SECURE_NOROOT is untrue, which is the default)
----
Because the current Linux Capability system does not implement the filesystem related portions of POSIX Capabilities which would make setuid and setgid executables secure, and because it is much safer to have a secure upper bound for all processes within a context, an additional per-context capability mask has been added to limit all processes belonging to that context to this mask. The meaning of the individual caps (bits) of the capability bound mask is exactly the same as with the permitted capability set.
----
This is also fine, however, /proc/sys/kernel/cap-bound does NOT provide this functionality.  It only bounds the superuser emulation, i.e. caps received on euid 0 fork/exec/clone.
I think you're responding to a question like "Bcaps isn't bounding my processes".  That's not what I'm reporting.  I'm reporting that the guest is not handling the *inheritable* set properly, based on the way the kernel works with superuser emulation.
I need a guest to have CAP_NET_BIND_SERVICE, or else lighttpd can't get a port.  But once it does, and I program it to drop CAP_NET_BIND_SERVICE from its inheritable set, I need ANY forked,cloned, or exec'd process, even one at euid 0, through setuid or other, to NOT have CAP_NET_BIND_SERVICE.
So my question is actually "Why don't vservers honor the inheritable set like the host does?  Why instead is it using cap-bound, which is only to bound SUPERUSER EMULATION, being used as a per-process mask?  That's what bcapabilities is supposed to be for."
If bcapabilities is instead supposed to be the cap-bound value, but you're ALSO using it to mask out permitted capabilities, that's wrong.  Not because of your model, but because it makes what I'm trying to do impossible.  And it works on the host, and any linux kernel, so it should also work in a guest.
> > [...]
> > So now, vserver.  I don't have an example of what it does
> unmodified
> > (since I'm using this patch in production.  And it does work.  It
> > makes the vserver behaviour consistent with the parent kernel), but
> > here's what it does modified.
> >
> > Host /proc/sys/kernel/cap-bound is 128 Bcapabilities is
> ~CAP_SYS_BOOT
> > and ~CAP_AUDIT_WRITE (in addition to the others vserver drops):
> >
> > # vserver support exec getpcaps =
> > Capabilities for `=': =
> >
> cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,c
> >
> ap_kill,cap_setgid,cap_setuid,cap_net_bind_service,cap_sys_chroot,cap_
> > sys_ptrace,cap_sys_tty_config,cap_lease+eip
> >
> > My new root process is limited by bcapabilities...
> >
> > #  execcap ' =eip cap_lease,cap_net_bind_service,cap_setpcap-eip '
> > vserver support exec getpcaps = Capabilities for `=': =
> >
> cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,c
> >
> ap_kill,cap_setgid,cap_setuid,cap_sys_chroot,cap_sys_ptrace,cap_sys_tt
> > y_config+eip
> >
> > But it's ALSO bounded by inheritable sets.
>
> None of these are interesting, as they all show the
> per-process values. If you'd actually try to use a capability
> that's not in vx_bcaps, you'd get EPERM, as expected, since
> you need to have _both_. /proc/<pid>/status shows the masked
> values, i.e. what's actually used by permission checks.
No, it wouldn't, because CAP_NET_BIND_SERVICE is NOT in my bcapabilities.  But any process decended from the execcap I ran would not have CAP_NET_BIND_SERVICE.
Again, I understand I can block things for all processes everywhere in the guest w/ bcapabilities.  Not what I'm trying to do.  Otherwise, processes can't start.
> >> > The patch can be found here.
> >> >
> >>
> http://users.k12system.com/mrwizard/software/patch-vserver-cap_bound_
> >> s
> >> > et.diff
> >>
> >> The only bug that I see is that we don't let guests modify
> >> vx_cap_bset at this time, but
> >>
> http://people.linux-vserver.org/~dhozac/p/k/delta-cap_bset-fix01.diff
> >> should take care of that...
> >
> > I wouldn't use this.  I wouldn't want my guest to change it
> anyway.
> > The fix here is the behavior of new processes and where and
> how they
> > receive capabilities from the parent.
>
> "Parent"? Process? The host?
Parent: PPID.  Process that causes fork/exec/or clone.
Process:  The program I'm running in the guest.
I didn't mention a host here. :)
For the purposes of this behavior, everything could be run in the guest context.  Like I said, the inheritance on exec is the issue.  Not bcapabilities.  It's just they seem to be related and shouldn't be.
Joseph Gooch
Sapphire Suite Product Manager
K12 Systems, Inc.
(866) 366-9540
Received on Tue Jun 17 00:00:25 2008
[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Tue 17 Jun 2008 - 00:00:28 BST by hypermail 2.1.8