From: Herbert Poetzl (herbert_at_13thfloor.at)
Date: Thu 07 Nov 2002 - 05:56:10 GMT
On Wed, Nov 06, 2002 at 04:15:51PM -0800, Cathy Sarisky wrote:
> I have two servers hosting vservers. The first is a 1 GHz Duron with 1 GB memory and an IDE disk, ext3. It's on 60+ days of uptime, and the last time it was down was to upgrade the memory. It runs 9 vservers and some stuff in the root server also, without a complaint. Many of the vserver clients are running mostly idle AOLserver instances, so I have about 500MB of swap in use (2GB swap available) pretty regularly. Loads are reasonable (about .7 during the day, often less than .2 overnight), the server is peppy, and everyone is happy. This is a Redhat 7.2 server, with the pre-built kernel (2.4.18-ctx12). That kernel isn't set up for highmem, so actually I'm only using about 900MB of my 1GB.
> Enter server #2. Server #2 is an P4 1.6GHz, with 1GB memory and RAID1, ext3. I wanted a highmem kernel, so I compiled this one. This is a Redhat 7.3 server, with 2.4.19ctx-13, patched and compiled by yours truly. It has had 4-6 vservers running on it, loads in the .1-.5 range, and little if any swap in use.
> Mem: 1033596K av, 1019324K used, 14272K free, 0K shrd, 272916K buff
> Swap: 2048276K av, 796K used, 2047480K free 194720K cached)
> This server is very responsive for a while after a reboot. Days to maybe a week. Then it will appear to hang. It doesn't respond to SSH or http requests (to either root server or vserver), although it doesn't actually drop the packets. It remains pingable. It doesn't run cron jobs. At the point where the problem starts, all logging stops, but there's no indication of a problem on the horizon prior to the cessation of logging. The server still responds at a console. Two times I've had the data center tech run sar -u on it before rebooting. Once showed complete cpu usage, once showed the cpu almost entirely idle. The vps run by the data center tech also doesn't show anything unusual, although in both cases the server had been unresponsive for a while before the sar and vps commands were run.
> Further weirdness: when the server is told to shutdown at the console, it becomes ssh-able again for a few moments during the shutdown process. This suggests to me that there's some process running that causes the server to be unresponsive, and when it's killed during the server shutdown, things revert to normal again. (Of course, then the server reboots.) I *really* wish this server wasn't in a data center half-way across the country!
let us assume there is such a process running ...
- where should such a process come from?
- cron jobs? no you don't run cron jobs on the server!
- left over process from virtual server XY?
- what would a single process do to stop the logging
and block remote logins (by accident)?
- temporarily replace the filesystem/network/etc?
- capture all tcp/udp packets?
so I do not believe that a process could cause this kind
of starvation, more likely some device i/o or system
(read kernel) resource exhaustion will be the cause.
I would check for the following:
- file/inode maximum setting
- virtual memory limitations
- maximum process time (maybe log/ssh/etc gets killed?)
- how much time is spend in system state
- how many processes are there, and how long is
the oldest process running?
- what is the last entry in the log?
- what happened at this moment on other log files?
- what about power management? ACPI/APM/SpeedStep
- what about I/O errors on the harddisk?
I also would not draw many conclusions from the ping-ability
of the server, because the icmp echo reply is at such
a low (kernel stack) level, that often even when the kernel
is completely unresponsive, icmp echo replies come back.
Second, do the replies come from your machine at all?
Many firewalls nowadays send an icmp echo reply back,
without even asking the addressed machine ...
> The datacenter swapped out the network card, motherboard, and memory last week but I've seen another server hang since.
> I'm stumped. I think the next course of action is to try running the precompiled kernel on this server, but that'll lose me the highmem features.
I would suggest, you install some tools, which
monitor the system state/resource usage and send
this data immediately to another host, and/or
store it on an dedicated partition/disk ...
> I realize that there are probably waaaay too many variables different between these two servers for the source of the problem to be, but I wonder if anyone has seen anything similar and might suggest a course of action. Do these symptoms sound at all familiar? Trying to solve this sort of problem by experiment is wretched with a server in a datacenter and a problem that isn't reliably reproducible!
> Thanks in advance for any ideas, suggestions for further investigation, or encouragement!
try to narrow down the possibilities by proving
that some or all of my suggestions/assumptions are
> Cathy Sarisky
> Sent via the WebMail system at webmail.pioneernet.net