Re: [Vserver] vserver hosts "just stop responding" - cause??

From: GarconDuMonde <gdm_at_fifthhorseman.net>
Date: Tue 05 Sep 2006 - 00:02:12 BST
Message-ID: <44FCB074.6040104@fifthhorseman.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Herbert Poetzl wrote:
> On Mon, Sep 04, 2006 at 03:08:43PM +0100, GarconDuMonde wrote:
> hello,
>
> i am involved in running two servers, both of which become
> unresponsive after periods of time for reasons that are unclear.
> here is some background:
>
> server 1.
>
> P4 2.8GHz, 1GB RAM, 2x120GB IDE hard drives.
> kernel - 2.6.16.10-vs2.0.2-rc18 installed 30 april 2006 (the box has
> mostly been down since this point, and we have not had an opportunity
> to update it)
>
>> relatively old kernel

yes, i know, but that's because it's been down for such a long time!!

> this box ran three vservers: one was in production use, one was for
> development of the same software twiki) and the third was for backups
> of a completely different site/software.
>
>
>
> server 2.
>
> dual xeon 2.8GHz, 5GB RAM, 2x160GB SATA hard drives
> kernel - was 2.6.8 when problems started, but then upgraded to
> linux-image-2.6.16-2-vserver-686 from backports.org. however,
> continued to experience the same problems.
>
>> very old kernel

sorry, my fault - that was a typo: it was also a 2.6.16 as well - and was the
latest kernel at the time. the 2.6.8 was definitely old, tho!

> this box ran approx 15 vservers but the cpu nor the memory were ever
> maxed out. there was no indication in any logfiles on the host of what
> the problem possibly was.
>
> we have now done extensive testing on the hardware using memtest86,
> smartmontools and cpuburn without finding any problems. the server now
> has uptime of ~40 days using RIP and a 2.6.17 kernel
>
>> well, how does it behave with the latest stable release?
>> (vs2.02 for 2.6.17.11)

we're going to try this when we reboot into the vservers later this week (or
early next), so i'll let you know.

> * * *
>
> server 1 had an extensive amount of work done on it by someone
> extremely knowledgeable in linux security, but the problem could not
> be found. attempting to recreate the situation (on server 1) with
> apache bench did lead to the situation where the box would complete
> the 3 way TCP hand shake or respond to ICMP echo requests but not
> handle TCP connections any further. however, adjusting limits ('as'
> 'rss' and 'nproc') did not prevent the box from becoming unresponsive
> again as soon as it was put back into production use.
>
>> I'd suggest to test mainline (not a debian specific kernel)

hmm, ok, i will put that to the rest of the group as well - although one of the
advantages of using debian was the ease of maintenance. i have also tried hard
to pick micah's brains along the way ;-)

> * * *
>
> we do not know how to proceed from here in terms of diagnosing and
> fixing the problem and making the machines once again suitable for
> production use. we are now nearing the stage where we will have to
> give up using linux-vserver unless we can solve the problem. this is
> a shame as quite a few of us have invested time in learning about
> linux-vserver. does anyone have any ideas on how to diagnose and fix
> the problem?
>
>> well, many companies do use linux-vserver in production quite
>> fine, (I'm using it too :) and there are no reports of issues
>> with the stable branch (2.01 or 2.02), but it should be no
>> problem to track your specific issues down .. best would be
>> to pay a visit to the irc channel (#vserver @ irc.oftc.net)

thanks - have been there before and had good lessons from you :-)

i will try to come back again when i have a bit of time and also one of the
servers in front of me to play with properly - will likely be sometime next week.

> incidentally, i have heard of other machines that have experienced
> similar problems.
>
>> well, did not get reported back to us (at least not that I
>> know of)

no - this has taken me a while to hear as well, but i have now heard it
informally from several people who run vservers, that they've had (generally)
occasional problems with similar symptoms - just "hanging" and no response.
again, it is probable that they had older kernels.

thanks, i'll be in touch again soon (and, of course, write anything useful up on
the wiki)

        --gdm

> if i can help with diagnosis in any way by providing more information,
> please ask. i also have some munin graphs available that demonstrate
> some of the variables (cpu, memory usage, uptime, individual vserver
> memory usage, etc), taken shortly after the hosts became unreachable -
> i.e. with the most detail of
>
>> sure, please contact me on the irc channel and we will have
>> a closer look at your issues ... but please also use the
>> latest (stable) release ...
>
> the hours leading up to them "stopping"
>
> any help is greatfully received.
>
>> best,
>> Herbert
>
> best,
>
> --gdm
>
> --
>
> http://docs.indymedia.org/view/Main/GarconDuMonde
> gpg --keyserver pgp.mit.edu --recv-keys 594B97C2
> Key fingerprint = 7B70 F22D F275 D111 3A04 F9EE 0E25 4944 594B 97C2
>
>
>
_______________________________________________
Vserver mailing list
Vserver@list.linux-vserver.org
http://list.linux-vserver.org/mailman/listinfo/vserver

- --

http://docs.indymedia.org/view/Main/GarconDuMonde
gpg --keyserver pgp.mit.edu --recv-keys 594B97C2
Key fingerprint = 7B70 F22D F275 D111 3A04 F9EE 0E25 4944 594B 97C2

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Darwin)

iD8DBQFE/LByDiVJRFlLl8IRAs6CAJ9QMMcph38mS391XDRe7m0XAEuuKACfUVdV
VVvf71hmdXeEyJMHQOUYuX4=
=esjx
-----END PGP SIGNATURE-----
_______________________________________________
Vserver mailing list
Vserver@list.linux-vserver.org
http://list.linux-vserver.org/mailman/listinfo/vserver
Received on Tue Sep 5 00:03:18 2006

[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Tue 05 Sep 2006 - 00:03:24 BST by hypermail 2.1.8