[Vserver] vserver hosts "just stop responding" - cause??

From: GarconDuMonde <gdm_at_fifthhorseman.net>
Date: Mon 04 Sep 2006 - 15:08:43 BST
Message-ID: <44FC336B.2040004@fifthhorseman.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hello,

i am involved in running two servers, both of which become unresponsive after
periods of time for reasons that are unclear. here is some background:

server 1.

P4 2.8GHz, 1GB RAM, 2x120GB IDE hard drives.
kernel - 2.6.16.10-vs2.0.2-rc18 installed 30 april 2006 (the box has mostly been
down since this point, and we have not had an opportunity to update it)

this box ran three vservers: one was in production use, one was for development
of the same software twiki) and the third was for backups of a completely
different site/software.

server 2.

dual xeon 2.8GHz, 5GB RAM, 2x160GB SATA hard drives
kernel - was 2.6.8 when problems started, but then upgraded to
linux-image-2.6.16-2-vserver-686 from backports.org. however, continued to
experience the same problems.

this box ran approx 15 vservers but the cpu nor the memory were ever maxed out.
there was no indication in any logfiles on the host of what the problem possibly
was.

we have now done extensive testing on the hardware using memtest86,
smartmontools and cpuburn without finding any problems. the server now has
uptime of ~40 days using RIP and a 2.6.17 kernel

* * *

server 1 had an extensive amount of work done on it by someone extremely
knowledgeable in linux security, but the problem could not be found. attempting
to recreate the situation (on server 1) with apache bench did lead to the
situation where the box would complete the 3 way TCP hand shake or respond to
ICMP echo requests but not handle TCP connections any further. however,
adjusting limits ('as' 'rss' and 'nproc') did not prevent the box from becoming
unresponsive again as soon as it was put back into production use.

* * *

we do not know how to proceed from here in terms of diagnosing and fixing the
problem and making the machines once again suitable for production use. we are
now nearing the stage where we will have to give up using linux-vserver unless
we can solve the problem. this is a shame as quite a few of us have invested
time in learning about linux-vserver. does anyone have any ideas on how to
diagnose and fix the problem?

incidentally, i have heard of other machines that have experienced similar
problems.

if i can help with diagnosis in any way by providing more information, please
ask. i also have some munin graphs available that demonstrate some of the
variables (cpu, memory usage, uptime, individual vserver memory usage, etc),
taken shortly after the hosts became unreachable - i.e. with the most detail of
the hours leading up to them "stopping"

any help is greatfully received.

best,

        --gdm

- --

http://docs.indymedia.org/view/Main/GarconDuMonde
gpg --keyserver pgp.mit.edu --recv-keys 594B97C2
Key fingerprint = 7B70 F22D F275 D111 3A04 F9EE 0E25 4944 594B 97C2

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Darwin)

iD8DBQFE/DNpDiVJRFlLl8IRAncgAJwIth7f6uqxLvoT6zijA3YXZXxuSQCaA14G
9kh/+exbvk9OYlk2tqDgdbo=
=LkqV
-----END PGP SIGNATURE-----
_______________________________________________
Vserver mailing list
Vserver@list.linux-vserver.org
http://list.linux-vserver.org/mailman/listinfo/vserver
Received on Mon Sep 4 15:09:42 2006

[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Mon 04 Sep 2006 - 15:09:47 BST by hypermail 2.1.8