From: Cathy Sarisky (cathy_at_acornhosting.net)
Date: Thu 07 Nov 2002 - 00:15:51 GMT
I have two servers hosting vservers.  The first is a 1 GHz Duron with 1 GB memory and an IDE disk, ext3.  It's on 60+ days of uptime, and the last time it was down was to upgrade the memory.  It runs 9 vservers and some stuff in the root server also, without a complaint.  Many of the vserver clients are running mostly idle AOLserver instances, so I have about 500MB of swap in use (2GB swap available) pretty regularly.  Loads are reasonable (about .7 during the day, often less than .2 overnight), the server is peppy, and everyone is happy.  This is a Redhat 7.2 server, with the pre-built kernel (2.4.18-ctx12).  That kernel isn't set up for highmem, so actually I'm only using about 900MB of my 1GB.
Enter server #2.  Server #2 is an P4 1.6GHz, with 1GB memory and RAID1, ext3.  I wanted a highmem kernel, so I compiled this one.  This is a Redhat 7.3 server, with 2.4.19ctx-13, patched and compiled by yours truly.  It has had 4-6 vservers running on it, loads in the .1-.5 range, and little if any swap in use.
(Currently:
Mem:  1033596K av, 1019324K used,   14272K free, 0K shrd, 272916K buff
Swap: 2048276K av, 796K used, 2047480K free 194720K cached)
This server is very responsive for a while after a reboot.  Days to maybe a week.  Then it will appear to hang.  It doesn't respond to SSH or http requests (to either root server or vserver), although it doesn't actually drop the packets.  It remains pingable.  It doesn't run cron jobs.  At the point where the problem starts, all logging stops, but there's no indication of a problem on the horizon prior to the cessation of logging.  The server still responds at a console.  Two times I've had the data center tech run sar -u on it before rebooting.  Once showed complete cpu usage, once showed the cpu almost entirely idle.  The vps run by the data center tech also doesn't show anything unusual, although in both cases the server had been unresponsive for a while before the sar and vps commands were run.
Further weirdness: when the server is told to shutdown at the console, it becomes ssh-able again for a few moments during the shutdown process.  This suggests to me that there's some process running that causes the server to be unresponsive, and when it's killed during the server shutdown, things revert to normal again.  (Of course, then the server reboots.)  I *really* wish this server wasn't in a data center half-way across the country!
The datacenter swapped out the network card, motherboard, and memory last week but I've seen another server hang since.  
I'm stumped.  I think the next course of action is to try running the precompiled kernel on this server, but that'll lose me the highmem features.
I realize that there are probably waaaay too many variables different between these two servers for the source of the problem to be, but I wonder if anyone has seen anything similar and might suggest a course of action.  Do these symptoms sound at all familiar?  Trying to solve this sort of problem by experiment is wretched with a server in a datacenter and a problem that isn't reliably reproducible!
Thanks in advance for any ideas, suggestions for further investigation, or encouragement!
Cathy Sarisky 
________________________________________________________________
Sent via the WebMail system at webmail.pioneernet.net