Re: [vserver] opteron server dies with vserver patch.

From: Herbert Poetzl <herbert_at_13thfloor.at>
Date: Mon 08 Aug 2011 - 21:13:33 BST
Message-ID: <20110808201333.GS12671@MAIL.13thfloor.at>

On Mon, Aug 08, 2011 at 09:45:38PM +0200, Pawe?? Sikora wrote:
> On Monday 08 of August 2011 20:29:37 Herbert Poetzl wrote:

>> could you upload those sections (without line break) and
>> also mark the <xx> bytes in the dumps, I'm having a hard
>> time to determine them from the screenshot ...
>> (seems to be cut off on the right side)

> i'll try to catch these stack traces again tomorrow
> and adjust monitor to read <xx> markers...

>> btw, how long does it take till those traces show up

> it varies... from few minutes to few hours.

well, that's actually a good thing, i.e. we can
easily enable more debugging info/output and
analyze the actual problem

>> and in what way do they affect the system? (total crash,
>> single cpu blocked, zombie, nothing)

> afaics this partial lock occurs mostly on heavy parallel i/o
> connected with high cpu load.

> most common scenario:
> - 16 farm slaves running on 16 cpu cores cleanup local working
> directories on /dev/md1 (raid-0, 4 disks).

> - 16 farm slaves decompress from common nfs share a .7z
> installer and install it into local dir.

> - after this i/o peak farm slaves start 100% cpu utilization
> with minimal i/o (storing results).

> - lock (maybe related to i/o sync? machine has 64GB and caches
> i/o in buffers until slave cpu processing).

> after lock:
> - the ssh is out, but machine responds to ping.
> - ipmi console (serial-via-bios-over-lan) is out.
> - (weird!) the sysrq on real console handles (only?)
> reboot sequence (e.g. cannot trace/terminate tasks).

try with something like 'echo 9 >/proc/sysrq-trigger' and
'echo l >/proc/sysrq-trigger' before anything happens
(that should also test that the ipmi console is working)

once the problem arises, use 'echo l ..' and 'echo w ..'
or send the equivalent (magic sysrq l/w) via console

HTH,
Herbert

> - hdd leds aren't blinking and cpus stop processing
> (fans stop flushing hot air from rack).

> so probably it stucks in some kind of i/o deadlock.
Received on Mon Aug 8 21:13:43 2011

[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Mon 08 Aug 2011 - 21:13:43 BST by hypermail 2.1.8