Re: [Vserver] Re: Chasing kernel crashes on SMP

From: Dennis Roos <>
Date: Mon 28 Nov 2005 - 14:24:50 GMT
Message-Id: <>

On Mon, 2005-11-28 at 12:02 +0100, Grzegorz Nosek wrote:
> 2005/11/28, Grzegorz Nosek <>:
> > Hello
> >
> > I'd like to report on my findings in my continuing crusade to find the
> > cause of AMD64 kernel crashes.
> >
> > First, it still crashes.
Is this the come-to-a-grinding-halt kinda crash ? As I am experiencing
that same kind of problem here and at home, both on uni and multi
processor machines.

> >
> > Second, but now I have an oops trace :)
I have not been able to get anything from our machines except for a
black screen.

> > Third, it's not AMD64-specific after all (though it seems much more
> > frequent there)
I have been guessing the cause of this bug varying from hardware related
(ide controller/sata controller), cpu/ram, cooling and my latest:I/O

We're running Intel only and some machines have this problem and some
don't, does not matter if it's uni- (HT disabled) or multiprocessor
hardware, hence we never really suspected a kernel issue, as all
machines run the same kernel.

> > Fourth, since the last-but-one build (internal rev17) the oopses seem
> > more frequent.

> > As I've booted my test box (dual Xeon) with rev17, it found two extra
> > CPUs (I enabled ACPI in rev14 and it was running rev13 before) and
> > started crashing quite frequently (sometimes reaching uptime of only a
> > few minutes). I'm running the box with rev13 now (no ACPI, sees 2 CPUs
> > only) and it's at least usable (though it probably *will* crash sooner
> > or later :))
Could you generate lots of I/O on the vserver partition and check if it
speeds up the crash, this is what triggers the problem on my test
machine wether I'm running vservers or not and there is no difference in
local or nfs mounted storage, although when mounted locally the crashes
tend to occur more often.

> > The crash occurs in fs/proc/array.c:do_task_stat(), triggered by
> > pidof. It is clearly a NULL pointer dereference. I have attached an
> > oops from the amd64smp.17 kernel and a dump of do_task_stat assembly
> > code from amd64smp.18 (these two builds only differ in Fusion MPT SCSI
> > support so this file should be identical) with the oopsing instruction
> > marked.
> >
> > The p4 kernel crashes in the very same assembly instruction.
> >
> > I'm off to relate the assembly to the kernel source. I'll report as
> > soon as I find something but I wanted to share this with vserver-gurus
> > (it'll probably be easier to spot the mistake for you).
If I can be of assistance in tracking this down, I am on irc (bware),
although I am asleep/out-of-office when the guru's are awake ;)

Dennis Roos
Network Engineer @ InTouch N.V.
Middenweg 76
1097 BS Amsterdam
Tel: +31 (0)20 6752060
Fax: +31 (0)20 6758429
-=[Assumption is the mother of all f*ckups]=-
Vserver mailing list
Received on Mon Nov 28 14:25:15 2005
[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Mon 28 Nov 2005 - 14:25:41 GMT by hypermail 2.1.8