-----BEGIN PGP SIGNED MESSAGE-----
Herbert Poetzl wrote:
> On Mon, Sep 04, 2006 at 03:08:43PM +0100, GarconDuMonde wrote:
> i am involved in running two servers, both of which become
> unresponsive after periods of time for reasons that are unclear.
> here is some background:
> server 1.
> P4 2.8GHz, 1GB RAM, 2x120GB IDE hard drives.
> kernel - 184.108.40.206-vs2.0.2-rc18 installed 30 april 2006 (the box has
> mostly been down since this point, and we have not had an opportunity
> to update it)
>> relatively old kernel
yes, i know, but that's because it's been down for such a long time!!
> this box ran three vservers: one was in production use, one was for
> development of the same software twiki) and the third was for backups
> of a completely different site/software.
> server 2.
> dual xeon 2.8GHz, 5GB RAM, 2x160GB SATA hard drives
> kernel - was 2.6.8 when problems started, but then upgraded to
> linux-image-2.6.16-2-vserver-686 from backports.org. however,
> continued to experience the same problems.
>> very old kernel
sorry, my fault - that was a typo: it was also a 2.6.16 as well - and was the
latest kernel at the time. the 2.6.8 was definitely old, tho!
> this box ran approx 15 vservers but the cpu nor the memory were ever
> maxed out. there was no indication in any logfiles on the host of what
> the problem possibly was.
> we have now done extensive testing on the hardware using memtest86,
> smartmontools and cpuburn without finding any problems. the server now
> has uptime of ~40 days using RIP and a 2.6.17 kernel
>> well, how does it behave with the latest stable release?
>> (vs2.02 for 220.127.116.11)
we're going to try this when we reboot into the vservers later this week (or
early next), so i'll let you know.
> * * *
> server 1 had an extensive amount of work done on it by someone
> extremely knowledgeable in linux security, but the problem could not
> be found. attempting to recreate the situation (on server 1) with
> apache bench did lead to the situation where the box would complete
> the 3 way TCP hand shake or respond to ICMP echo requests but not
> handle TCP connections any further. however, adjusting limits ('as'
> 'rss' and 'nproc') did not prevent the box from becoming unresponsive
> again as soon as it was put back into production use.
>> I'd suggest to test mainline (not a debian specific kernel)
hmm, ok, i will put that to the rest of the group as well - although one of the
advantages of using debian was the ease of maintenance. i have also tried hard
to pick micah's brains along the way ;-)
> * * *
> we do not know how to proceed from here in terms of diagnosing and
> fixing the problem and making the machines once again suitable for
> production use. we are now nearing the stage where we will have to
> give up using linux-vserver unless we can solve the problem. this is
> a shame as quite a few of us have invested time in learning about
> linux-vserver. does anyone have any ideas on how to diagnose and fix
> the problem?
>> well, many companies do use linux-vserver in production quite
>> fine, (I'm using it too :) and there are no reports of issues
>> with the stable branch (2.01 or 2.02), but it should be no
>> problem to track your specific issues down .. best would be
>> to pay a visit to the irc channel (#vserver @ irc.oftc.net)
thanks - have been there before and had good lessons from you :-)
i will try to come back again when i have a bit of time and also one of the
servers in front of me to play with properly - will likely be sometime next week.
> incidentally, i have heard of other machines that have experienced
> similar problems.
>> well, did not get reported back to us (at least not that I
>> know of)
no - this has taken me a while to hear as well, but i have now heard it
informally from several people who run vservers, that they've had (generally)
occasional problems with similar symptoms - just "hanging" and no response.
again, it is probable that they had older kernels.
thanks, i'll be in touch again soon (and, of course, write anything useful up on
> if i can help with diagnosis in any way by providing more information,
> please ask. i also have some munin graphs available that demonstrate
> some of the variables (cpu, memory usage, uptime, individual vserver
> memory usage, etc), taken shortly after the hosts became unreachable -
> i.e. with the most detail of
>> sure, please contact me on the irc channel and we will have
>> a closer look at your issues ... but please also use the
>> latest (stable) release ...
> the hours leading up to them "stopping"
> any help is greatfully received.
> gpg --keyserver pgp.mit.edu --recv-keys 594B97C2
> Key fingerprint = 7B70 F22D F275 D111 3A04 F9EE 0E25 4944 594B 97C2
Vserver mailing list
gpg --keyserver pgp.mit.edu --recv-keys 594B97C2
Key fingerprint = 7B70 F22D F275 D111 3A04 F9EE 0E25 4944 594B 97C2
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v18.104.22.168 (Darwin)
-----END PGP SIGNATURE-----
Vserver mailing list
Received on Tue Sep 5 00:03:18 2006