[Vserver] fatal errors starting and stopping a guest

From: Chuck <chuck_at_sbbsnet.net>
Date: Tue 04 Oct 2005 - 19:50:40 BST
Message-Id: <200510041450.40901.chuck@sbbsnet.net>

this one is really odd.

im using gentoo with
        from vanilla kernel and vserver patch
        for some reason i could never get the 1.11.13-r1

the machine is a dual processor p3-500 machine with intel mobo.

i have been having stop/start problems with this installation from the start.
the single proc system has exactly the same setups and it has no problems.

initially it had problems when it came to mount/unmount.. even on a successful
stop randomly i would get an error mounting in fstab, so usually going in and
removing the contents of mtab cured it. when i enabled mounting the shared
portage directories in the etc/vserver/fstab, it would always fail on stop
not able to unmount file systems. i suspect it would try unmounting distfiles
after portage was unmounted thereby making distfiles invalid.

it seems that i have no trouble starting/stopping guests within a few min
after a host reboot, but after some time then the error occurs.. the last
time i tried an experiment and went into each guest and did an orderly manual
shut down of each running service, then exited and stopped them from the
host. the first one errored out with the messages below.. at that point none
of them shut down cleanly. and when i went to reboot the host using reboot,
it did a power down instead!!! i just made sure acpi is off in the kernel
because i could not restart the host at that point by a remote power cycle.

another thing weird is the /etc/init.d/vserver script is not auto-starting the
guests.. it tries because the ip addresses are still listed in each nic but
no trace of the guests is present in the process list. and when i start them
manually, they start with no other error than the expected RTNETLINK file
exists nonfatal error.

the first error i received was on the first guest to be shut down:

apollo rio # vserver prometheus stop
/usr/lib/util-vserver/vserver.functions: line 804: 28440 Segmentation fault \
  $_NOHUP $_VWAIT --timeout "$VSHELPER_SYNC_TIMEOUT" --terminate \
--status-fd 3 "$2" >>$_is_tmpdir/out 2>$_is_tmpdir/err 3>$_is_tmpdir/fifo
internal error: 'vwait' exited with an unexpected status ''; I will
try to continue but be prepared for unexpected events.
* Prometheus Stopped

btw, the start and stopped messages are my additions in the
pre-start/post-stop scripts.

then, when i tried to start it again i got:

apollo ~ # vserver prometheus start
* Prometheus starting
<1>Unable to handle kernel NULL pointer dereference at virtual address
 printing eip:
*pde = 00000000
Oops: 0002 [#9]
Modules linked in:
CPU: 1
EIP: 0060:[<c0136b7e>] Not tainted VLI
EFLAGS: 00010246 (
EIP is at __dealloc_vx_info+0xe/0x50
eax: 00000000 ebx: 00000d4d ecx: 00000000 edx: f5807000
esi: ffffffef edi: f5807000 ebp: eb0da000 esp: eb0dbf68
ds: 007b es: 007b ss: 0068
Process vcontext (pid: 31199, threadinfo=eb0da000 task=f5adb040)
Stack: eb0da000 c0136d52 f5807000 00000d4d 00000000 fffffeff c01375b8 00000d4d
       c1923380 00000000 00000000 00000003 00000000 00000d4d c0136423 00000d4d
       00000000 09010001 0804bcd4 bfa70704 c0102ff9 09010001 00000d4d 00000000
Call Trace:
 [<c0136d52>] __create_vx_info+0x92/0x1c0
 [<c01375b8>] vc_ctx_create+0x98/0x100
 [<c0136423>] sys_vserver+0x163/0x540
 [<c0102ff9>] syscall_call+0x7/0xb
Code: 4d c2 83 f8 01 89 c1 7e 9e e9 bb fd ff ff 0f bc c0 e9 9f fd ff ff 8d b4
26 00 00 00 00 83 ec 04 8b 54 24 08 8b 02 8b 4a 04 85 c0 <89> 01 74 03 89 48
04 81 4a 18 00 80 00 00 c7 42 04 00 02 20 00
 /usr/lib/util-vserver/vserver.start: line 147: 31199 Segmentation fault
"$VSERVER_DIR"/ulimits $_VCONTEXT --create "${OPTS_VCONTEXT_CREATE[@]}" --
"$VSERVER_DIR"/rlimits --missingok -- $_VSCHED --xid self "${OPTS_VSCHED[@]}"
-- $_VUNAME --xid self --dir "$VSERVER_DIR"/uts --missingok --
"${VSERVER_EXTRA_CMDS[@]}" $_VUNAME --xid self --set -t
context="$VSERVER_DIR" -- $_VATTRIBUTE --set "${OPTS_VATTRIBUTE[@]}" --
$_SAVE_CTXINFO "$VSERVER_DIR" $_ENV -i -- $_VCONTEXT --migrate-self
--endsetup --chroot $SILENT_OPT "${OPTS_VCONTEXT_MIGRATE[@]}"

An error occured while executing the vserver startup sequence; when
there are no other messages, it is very likely that the init-script
(/sbin/init) failed.

Common causes are:
* /etc/rc.d/rc on Fedora Core 1 and RH9 fails always; the 'apt-rpm' build
  method knows how to deal with this, but on existing installations,
  appending 'true' to this file will help.

Failed to start vserver 'prometheus'


then, when i tried to shut the host down, i got this:


kernel BUG at kernel/vserver/context.c:144!
invalid operand: 0000 [#10]
Modules linked in:
CPU: 0
EIP: 0060:[<c0136cb0>] Not tainted VLI
EFLAGS: 00010246 (
EIP is at free_vx_info+0x70/0x80
eax: 00000001 ebx: f4a0e938 ecx: da1e0368 edx: f58e7000
esi: f58e7000 edi: c03d89a4 ebp: da1e030c esp: eeb63da4
ds: 007b es: 007b ss: 0068
Process find (pid: 3033, threadinfo=eeb62000 task=f58fe530)
Stack: c013c584 f58e7000 f4a0e938 00000020 00000004 00000000 c1907960 000000d0
       fffffff4 da1e030c f4af18a4 eeb63e4c c0175181 f4af18a4 da1e030c eeb63f10
       00000000 eeb63f10 eeb63e44 eeb63e4c c017557a f48d2d90 eeb63e4c eeb63f10
Call Trace:
 [<c013c584>] proc_virtual_lookup+0xd4/0x2a0
 [<c0175181>] real_lookup+0xd1/0x100
 [<c017557a>] do_lookup+0x13a/0x150
 [<c0175cf7>] __link_path_walk+0x767/0xe70
 [<c0146ca7>] filemap_nopage+0x207/0x3c0
 [<c0176449>] link_path_walk+0x49/0xe0
 [<c01767a4>] path_lookup+0x94/0x170
 [<c0176a43>] __user_walk+0x33/0x60
 [<c0170a5c>] vfs_lstat+0x1c/0x60
 [<c01711eb>] sys_lstat64+0x1b/0x40
 [<c01155e0>] do_page_fault+0x0/0x5db
 [<c0102ff9>] syscall_call+0x7/0xb
Code: ce b0 3b c0 eb dc f6 42 18 01 74 cf 0f 0b 95 00 ce b0 3b c0 eb c5 0f 0b
93 00 ce b0 3b c0 eb b7 0f 0b 92 00 ce b0 3b c0 eb a6 90 <0f> 0b 90 00 ce b0
3b c0 eb 94 8d b6 00 00 00 00 57 56 53 83 ec
  * Hiding /proc entries ...

apollo ~ #


and it sat forever at hiding proc entries

i finally got pissed at it and logged back into it on another terminal and
issued init 0 which i found out called a halt rather than a shutdown which it
has done in the past... i suppose i should have done an init 6 once again..
that may have been the shut down when i initially told it to reboot. :)

now iget something really odd and only when i am starting this one guest
prometheus.... it shows me the startup process!:)

      apollo ~ # vserver prometheus start
* Prometheus starting
INIT: version 2.86 booting

Gentoo Linux; http://www.gentoo.org/
 Copyright 1999-2005 Gentoo Foundation; Distributed under the GPLv2

 * Setting hostname to prometheus ... [ ok ]
 * Updating environment ... [ ok ]
 * Cleaning /var/lock, /var/run ... [ ok ]
 * Cleaning /tmp directory ... [ ok ]
 * Setting DNS domainname to sbbsnet.net [ ok ]
INIT: Entering runlevel: 3
 * Starting clamd ... [ ok ]
 * Starting freshclam ... [ ok ]
                                                                        [ ok ]
 * Starting syslog-ng ... [ ok ]
 * Starting service scan ... [ ok ]
 * Starting spamd ... [ ok ]
 * Starting local ... [ ok ]
INIT: no more processes left in this runlevel


at this point it does not return to the host prompt unless i press enter

and when i stop it i now see the shutdown sequences but got no error.

apollo ~ # vserver prometheus stop
INIT: Sending processes the TERM signal
 * Stopping local ... [ ok ]
 * Stopping spamd ... [ ok ]
 * Stopping service scan ... [ ok ]
 * Stopping services ... [ ok ]
 * Stopping service logging ... [ ok ]
 * Stopping syslog-ng ... [ ok ]
 * Stopping clamd ...
* Failed to stop clamd [ !! ]
 * Stopping freshclam ... [ ok ]
* Prometheus Stopped
apollo ~ #


i do not see these sequences on other guests and they only became visible
after this last super crash and reboot.

i am hoping all these problems will go away when i set everything up fresh on
the big machine...

any clues what is happening?

if its that kernel 'race bug' concerning smp , do you think the kernel.org
people will have it fixed in a few weeks? are they even aware of it? im
getting a bit apprehensive because this final machine being installed in
about 2 weeks must be absolutely perfect the first time. no room for errors
on that one.

"...and the hordes of M$*ft users descended upon me in their anger,
and asked 'Why do you not get the viruses or the BlueScreensOfDeath
or insecure system troubles and slowness or pay through the nose 
for an OS as *we* do?!!', and I answered...'I use Linux'. "
The Book of John, chapter 1, page 1, and end of book
Vserver mailing list
Received on Tue Oct 4 19:51:02 2005
[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Tue 04 Oct 2005 - 19:51:05 BST by hypermail 2.1.8