Re: [Vserver] Re: Chasing kernel crashes on SMP

From: Grzegorz Nosek <>
Date: Mon 28 Nov 2005 - 15:15:03 GMT
Message-ID: <>

2005/11/28, Dennis Roos <>:
> > > First, it still crashes.
> Is this the come-to-a-grinding-halt kinda crash ? As I am experiencing
> that same kind of problem here and at home, both on uni and multi
> processor machines.

Actually, I'm not sitting at the machine (it's in the server room two
doors down) so I don't know too much about the machine's last moments.
>From my point of view it just freezes.

> > >
> > > Second, but now I have an oops trace :)
> I have not been able to get anything from our machines except for a
> black screen.

Have you tried netconsole? A real life-saver :)

> > > Third, it's not AMD64-specific after all (though it seems much more
> > > frequent there)
> I have been guessing the cause of this bug varying from hardware related
> (ide controller/sata controller), cpu/ram, cooling and my latest:I/O
> We're running Intel only and some machines have this problem and some
> don't, does not matter if it's uni- (HT disabled) or multiprocessor
> hardware, hence we never really suspected a kernel issue, as all
> machines run the same kernel.

Well, a dual Xeon 1.8 (the test box) is (was?) quite brittle while a
dual Xeon 3.0 is stable for the most part (it suffers a bit from
untraceable HTB problems in rev14 but doesn't oops. Will probably boot
rev19 there soon), so it manifests itself in unpredictable ways.

> > > Fourth, since the last-but-one build (internal rev17) the oopses seem
> > > more frequent.
> > > As I've booted my test box (dual Xeon) with rev17, it found two extra
> > > CPUs (I enabled ACPI in rev14 and it was running rev13 before) and
> > > started crashing quite frequently (sometimes reaching uptime of only a
> > > few minutes). I'm running the box with rev13 now (no ACPI, sees 2 CPUs
> > > only) and it's at least usable (though it probably *will* crash sooner
> > > or later :))
> Could you generate lots of I/O on the vserver partition and check if it
> speeds up the crash, this is what triggers the problem on my test
> machine wether I'm running vservers or not and there is no difference in
> local or nfs mounted storage, although when mounted locally the crashes
> tend to occur more often.

Well, most of the crashes of the test box are accompanied by a 'fsck!'
as my kernel compile freezes halfway, so I'd say that it is I/O
related :)

Actually kernel builds seem to be most problematic. On rev17 kernel
(the one crashing the most often), my data points are:

- kernel build with local gcc-3.4 -j8 large cpu ratio
(fill-rate/interval ~ num_cpus) - crash
not a solid rule but it tends to encourage crashes

- kernel build with local gcc-3.4 -j8 small cpu ratio
(fill-rate/interval << num_cpus) - ok
the only way I have managed to build the new rev19 kernels was to
build p4smp.19 with low cpu ratio (took ages), reboot with rev19 and
build the rest (with high cpu ratio so it was quite snappy)

- php4 build with ccache distcc large cpu ratio - ok

Apparently it's caused by lots of disk I/O while running gcc locally,
because I didn't have a noticeable number of freezes while compiling
anything else (I compile userspace via ccache distcc with -j100 or

OTOH, I also had random crashes out of the blue in the middle of the
night while apparently doing nothing.

However, I can't see the reason why disk I/O would encourage this race
condition apart from plain and simple load increase.

> > > The crash occurs in fs/proc/array.c:do_task_stat(), triggered by
> > > pidof. It is clearly a NULL pointer dereference. I have attached an
> > > oops from the amd64smp.17 kernel and a dump of do_task_stat assembly
> > > code from amd64smp.18 (these two builds only differ in Fusion MPT SCSI
> > > support so this file should be identical) with the oopsing instruction
> > > marked.
> > >
> > > The p4 kernel crashes in the very same assembly instruction.
> > >
> > > I'm off to relate the assembly to the kernel source. I'll report as
> > > soon as I find something but I wanted to share this with vserver-gurus
> > > (it'll probably be easier to spot the mistake for you).
> If I can be of assistance in tracking this down, I am on irc (bware),
> although I am asleep/out-of-office when the guru's are awake ;)

I'm going home soon but I might be available on irc tomorrow (nick
blackfire). Feel free to /msg me :)

Best regards,
 Grzegorz Nosek
Vserver mailing list
Received on Mon Nov 28 15:15:30 2005

[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Mon 28 Nov 2005 - 15:15:54 GMT by hypermail 2.1.8