About this list Date view Thread view Subject view Author view Attachment view

From: Paul Sladen (vserver_at_paul.sladen.org)
Date: Thu 07 Nov 2002 - 13:52:01 GMT


On Thu, 7 Nov 2002, Herbert Poetzl wrote:
> On Wed, Nov 06, 2002 at 04:15:51PM -0800, Cathy Sarisky wrote:
> > it becomes ssh-able again for a few moments during the shutdown process.
> I *really* wish this server wasn't in a data center across the country!

You're not the only one. My other problem is that I don't own any other
fast Intel stuff, other my colo'ed stuff and I haven't got the where-
with-all to run duplicate development setups here. (Or even any development
setups at all...)

> let us assume there is such a process running ...

Low-level kernel is there fine (except for the @#$%-ing watchdog...)
It's more a case of processes /not/ running (see below).

> - file/inode maximum setting

I've had this once on the second box. The symptoms when this happened were
very different in that existing connections to the box quite happily stayed
up until you did something that required exec()ing a new process or opening
a file-handle (such as an running an external command from bash).

> - maximum process time (maybe log/ssh/etc gets killed?)

/me tries to remember what happens if you telnet to a port that wasn't bound
whilst the machine was running.

> - how much time is spend in system state
> - how many processes are there, and how long is the oldest process running?
> - virtual memory limitations

Possible. The main problem is not being able to see the state at which it
did die. The best you can get from monitoring after-it-happened is the
stack-trace/register-dump from the console. I suppose I should really
follow these up and see if there is a common path.

> - what is the last entry in the log?
> - what happened at this moment on other log files?

Generally the last `--MARK--'.

> - what about power management? ACPI/APM/SpeedStep

Disabled.

> - what about I/O errors on the harddisk?

Possible; and of the hardware-related possibilities, the most likely.

> I also would not draw many conclusions from the ping-ability of the
> server, because the icmp echo reply is at such a low (kernel stack)

Yup.

> Many firewalls nowadays send an icmp echo reply back, without even
> asking the addressed machine ...

Yup, these are coming from the machines.

> > I'm stumped.

It's personally embarrassing too! :-) Everywhere else I have 180day+
uptimes and don't need to keep ringing people up on a Saturday morning
with:

  ``Um, ...''
  ``*Again?* Certainly, I'll try to pop down later.''

Especially when the box in question sits between two W2K boxes that
ridiculously high uptimes for a Windows box.

> > I think the next course of action is to try running the
> > precompiled kernel on this server, but that'll lose me the highmem
> > features.

I've so-far had it happen with every kernel/ctx-patch combo I've had and
I've stopped doing any more commercial stuff because I don't have the
confidence to go on holiday and know that nothing will have happened.

> I would suggest, you install some tools, which monitor the system

It seems very random. Several of the occasions I (or somebody else) has
been sitting there in an SSH session when it has gone down (possibley even
not doing anything). On the occasions when I've had pretty-good post-mortem
data it hasn't showed /anything/ unusual.
 
> try to narrow down the possibilities by proving that some or all of my
> suggestions/assumptions are wrong ...

Many thanks for your checklist, it has helped me collect my thoughts.

So far, I'm thinking:

  Scheduler
  IP-stack (socket/file-handle count related)
  Virtual Memory shortage/corruption

The first two of which we are doing hacks to.

        -Paul

-- 
Nottingham, GB


About this list Date view Thread view Subject view Author view Attachment view
[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Thu 07 Nov 2002 - 15:09:32 GMT by hypermail 2.1.3