RE: [vserver] Capabilities in Vserver Kernels

From: Joe Gooch <mrwizard_at_k12system.com>
Date: Mon 16 Jun 2008 - 22:14:17 BST
Message-ID: <A8B39189E548604792E5213D18A9DB62190F539D@deagol.win.k12system.com>

See inline.

Joseph Gooch
Sapphire Suite Product Manager
K12 Systems, Inc.
(866) 366-9540

> -----Original Message-----
> From: Daniel Hokka Zakrisson [mailto:daniel@hozac.com]
> Sent: Monday, June 16, 2008 4:27 PM
> To: Joe Gooch
> Cc: vserver@list.linux-vserver.org
> Subject: Re: [vserver] Capabilities in Vserver Kernels
>
> Joe Gooch wrote:
> > I have diagnosed what I believe to be an incorrect use of
> capabilities
> > in virtual contexts. I believe the vx_mask_cap_bset function in
> > kernel/vserver/context.c should mask based on vx_bcaps, not
> vx_cap_bset.
> >
> > Rationale:
> > vx_cap_vset is the capabilites set from /proc/sys/kernel/cap-bound,
> > which is global for the system, not by VM. Using vx_bcaps
> allows it
> > to be set by vm.
>
> Uh, no, it's the other way around. vx_bcaps can only be set
> from the host.
> The purpose of vx_cap_bset is to allow the guest to set
> /proc/sys/kernel/cap-bound on its own, independently from the
> host (obviously not to raise it).
>
> (Also, VM is really not the right term. It's not virtual, and
> it's not a machine. :-))

VS then. :)

OK, well if that's the intention, it's still not working.

Host# cat /proc/sys/kernel/cap-bound
128

(128 equates to CAP_SETUID)

Host# vserver support exec cat /proc/sys/kernel/cap-bound
128

Host# cat /etc/vservers/support/bcapabilities
~CAP_SETUID

In fact, in examining the code, vx_cap_bset is only ever set from the calling context. It's never pulled from anywhere else, as far as I can tell. Bcapabilities are dumped directly into vx_bcaps.

As a side note, I spend a large amount of time searching for the proper way to use bcapabilities and ncapabilities and such, but could not find an explanation of what these were or what they were intended to be used for. It could be part of the wiki restructuring; I'm not sure. So my knowledge in this regard is based on the code without context.

> > In addition, as far as I can tell, the point of bcapabilities
> > (vx_bcaps) is to limit the entire set of capabilities
> available to a
> > VM. This is in line with the change I suggest.
> >
> > cap-bound, instead of limiting capabilities, merely limits
> > capabilities inherited automatically with superuser emulation.
> > (Assuming
> > !SECURE_NOROOT) This feature does trickle down to the vms
> and makes
> > sense to be done on a system-wide basis.
> >
> > In short, this change makes capabilities work, for those of us who
> > still use them actively.
>
> I have no idea what that actually means. Care to elaborate?

Sure! I've been using capabilities since 2.2, so I'll give some history.

The posix draft was never really finished (at least last I knew) so in v2.2 linux implemented capabilities, but ext2/3 couldn't support the filesystem component. Linux implemented Superuser Emulation to get around this. What it means is when a program is run under euid 0 (i.e. they should have root), the only way they can get these privilieges in their ep set is if the kernel gives them. So, exec will use cap-bound, and take -1 & cap-bound and dump that into the ep set, regardless of the inherited set or anything else. Every time a process is created with euid 0 this occurs.

The problem was there was no way to limit the capabilities of a root process because superuser emulation would prevent it, and since it happened on the exec call, the reverse was also true. (Moving from euid 0 to non-euid0 would DROP ep sets) This made it hard to elevate caps.

So, I hacked the libcap library to exec the child process, keep setpcap on the parent, set the process capabilities on the child and then die.
(See: http://linux.omnipotent.net/article.php?article_id=5425 Though I'm sure the code link is broken now, not to mention obsolete, read on)

Linux v2.4 came along and introduced prctl and KEEPCAPS, which was nice, because you could tell the kernel that "Hey, I'm capabilities aware and I want to maintain my capabilities from euid 0 to euid non-0. Can you set that up?) And if keepcaps is 1, then that's what happens. So, now the execcap routine should be (and I believe it is now):

Set keepcaps 1
Drop credentials
Drop any caps you don't want to keep
Exec process.

Since keepcaps is still 1, exec doesn't muck with the capabilities. This functionality is still in v2.6.

Securebits were also introduced. Specifically SECURE_NOROOT. This means, if set, that the super user emulation features of exec are completely off. Root is just the uid 0 user, NOT a super user. If you do this without filesystem capabilties support, or rsbac, or something else that will elevate privileges when required, then you completely break your system. So, I don't do that.

The simplest way I've ever seen to use capabilities is to:
1) Modify init to give itself SETPCAP. (Because process 1 is the only one that gets this in its permitted set, and cap-bound excludes SETPCAP from superuser emulation) In addition, init should set a full inheritable set.
2) Lower cap-bound to something reasonable. (128 was suggested due to bugtraq issues with sendmail many moons ago)

What this means is when I become root, the only capability I get automatically is CAP_SETUID. It cannot be filtered by my inheritable set. However, I only receive +ep for the capabilities that were in my inherited set before I exec.

So, capabilities enabled kernel (this is my host):
/proc/sys/kernel/cap-bound: 128 (normal kernel is -1 & ~CAP_SETPCAP)
getpcaps 1: =eip cap_setpcap-I (normal kernel would be =ep cap_setpcap-e)
Getpcaps 2: =eip cap_setpcap-ei (normal kernel would be =ep cap_setpcap-e)
Getpcaps = (root process): =eip cap_setpcap-eip (notice since cap_setpcap is not in cap-bound, and is NOT in the inherited set, my root process doesn't have it at all)
Getpcaps = (user process): =i cap_setpcap-i

On a normal kernel:

Getpcaps = (root process): =ep cap_setpcap-ep
Getpcaps = (user process): = cap_setpcap-I

The inheritable set on a non-modified kernel basically means nothing if cap-bound gives all those caps on privilege elevation.

So, as I said, I keep my cap-bound at 128 because my init process increases the inheritable set such that root processes inherit effective and permitted permissions from the unprivileged (or privileged) processes that spawn them. This allows for non-root processes, which you CAN'T have if cap-bound is really full.

For instance.
On a non-modified kernel as root (capbound -1 & ~CAP_SETPCAP):
# execcap ' = cap_setuid+eip ' getpcaps =
  Capabilities for `=': =ep cap_setuid+i cap_setpcap-ep
# execcap ' = cap_sys_chroot+eip ' getpcaps =
  Capabilities for `=': =ep cap_sys_chroot+i cap_setpcap-ep

Doesn't much matter what I do.

On my cap enabled kernel as root:
execcap ' = cap_setuid+eip ' getpcaps =
  Capabilities for `=': = cap_setuid+eip
execcap ' = cap_sys_chroot+eip ' getpcaps =
  Capabilities for `=': = cap_sys_chroot+eip cap_setuid+ep

Notice how cap_setuid is always given regardless of the +I set. That's cap-bound!
Notice also how I can have a root process that only has chroot and setuid.

It's even better in daemons I've modified, because even if a daemon has all capabilities, if the inheritable set is empty, any process spawned from it is unpriviledged, even if it's uid 0. This can prevent breaking out of chroots using something like cage, prevent unsafe functions in process trees (imagine removing CAP_SYS_MODULE from sshd for instance), etc. It's what capabilities were meant to do; limit access. Even without filesystem support.

So now, vserver. I don't have an example of what it does unmodified (since I'm using this patch in production. And it does work. It makes the vserver behaviour consistent with the parent kernel), but here's what it does modified.

Host /proc/sys/kernel/cap-bound is 128
Bcapabilities is ~CAP_SYS_BOOT and ~CAP_AUDIT_WRITE (in addition to the others vserver drops):

# vserver support exec getpcaps =
Capabilities for `=': = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_net_bind_service,cap_sys_chroot,cap_sys_ptrace,cap_sys_tty_config,cap_lease+eip

My new root process is limited by bcapabilities...

# execcap ' =eip cap_lease,cap_net_bind_service,cap_setpcap-eip ' vserver support exec getpcaps =
Capabilities for `=': = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_sys_chroot,cap_sys_ptrace,cap_sys_tty_config+eip

But it's ALSO bounded by inheritable sets.

> > The patch can be found here.
> >
> http://users.k12system.com/mrwizard/software/patch-vserver-cap_bound_s
> > et.diff
>
> The only bug that I see is that we don't let guests modify
> vx_cap_bset at this time, but
> http://people.linux-vserver.org/~dhozac/p/k/delta-cap_bset-fix01.diff
> should take care of that...

I wouldn't use this. I wouldn't want my guest to change it anyway. The fix here is the behavior of new processes and where and how they receive capabilities from the parent.

Thanks!
-Joe
Received on Mon Jun 16 22:14:24 2008

[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Mon 16 Jun 2008 - 22:15:02 BST by hypermail 2.1.8