Re: [vserver] opteron server dies with vserver patch.

From: Paweł Sikora <pluto_at_pld-linux.org>
Date: Mon 08 Aug 2011 - 20:45:38 BST
Message-Id: <201108082145.38860.pluto@pld-linux.org>

On Monday 08 of August 2011 20:29:37 Herbert Poetzl wrote:

> could you upload those sections (without line break) and
> also mark the <xx> bytes in the dumps, I'm having a hard
> time to determine them from the screenshot ...
> (seems to be cut off on the right side)

i'll try to catch these stack traces again tomorrow
and adjust monitor to read <xx> markers...

> btw, how long does it take till those traces show up

it varies... from few minutes to few hours.

> and in what way do they affect the system? (total crash,
> single cpu blocked, zombie, nothing)

afaics this partial lock occurs mostly on heavy parallel i/o connected with high cpu load.

most common scenario:
- 16 farm slaves running on 16 cpu cores cleanup local working directories on /dev/md1 (raid-0, 4 disks).
- 16 farm slaves decompress from common nfs share a .7z installer and install it into local dir.
- after this i/o peak farm slaves start 100% cpu utilization with minimal i/o (storing results).
- lock (maybe related to i/o sync? machine has 64GB and caches i/o in buffers until slave cpu processing).

after lock:
- the ssh is out, but machine responds to ping.
- ipmi console (serial-via-bios-over-lan) is out.
- (weird!) the sysrq on real console handles (only?) reboot sequence (e.g. cannot trace/terminate tasks).
- hdd leds aren't blinking and cpus stop processing (fans stop flushing hot air from rack).

so probably it stucks in some kind of i/o deadlock.
Received on Mon Aug 8 20:45:58 2011

[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Mon 08 Aug 2011 - 20:45:58 BST by hypermail 2.1.8