About this list Date view Thread view Subject view Author view Attachment view

From: Yann Dupont (Yann.Dupont_at_univ-nantes.fr)
Date: Tue 05 Oct 2004 - 13:16:07 BST


Hello There. I'm seeing something strange here.
Not sure it's vserver related at all. Probably just 2.6 related, but
before going on lkml I'd like to see if someone else seeing
those kind of messages:

I have one machine (Dual Xeon, 2 Gb Ram + Qlogic FC & SAN), with 8
vservers on it,
Each vserver is using dedicated EVMS volume on the san .
one of this vservers is a very busy vserver (rsyncd master, where 100+
servers are syncing on it every hour).
This vserver use some large partitions (300 Gb+, and has zillions of
file in it)

This was working fine with 2.4 kernel

I have switched the host from 2.4 to 2.6, and I started to have thoses
messages :

TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window
723200794:723201746. Repaired.
TCP: Treason uncloaked! Peer 172.20.12.49:37066/873 shrinks window
723200794:723201746. Repaired.
TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window
2029005703:2029007151. Repaired.
TCP: Treason uncloaked! Peer 192.168.100.17:53343/873 shrinks window
2029017287:2029018735. Repaired.

Thoses IP (cleints) are others vservers in 2.4.27 Kernel... The only
explanation I saw is a broken TCP/IP stack on the client side.
Seems not to be the case ...

More harmfull :

swapper: page allocation failure. order:0, mode:0x20
 [<c013a545>] __alloc_pages+0x1ab/0x317
 [<c013a6c9>] __get_free_pages+0x18/0x24
 [<c013d529>] kmem_getpages+0x1a/0xbe
 [<c013e108>] cache_grow+0x9e/0x127
 [<c013e304>] cache_alloc_refill+0x173/0x218
 [<c013e710>] __kmalloc+0x7c/0x83
 [<c030f574>] alloc_skb+0x32/0xc3
 [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5
 [<c028690d>] e1000_clean_rx_irq+0x192/0x44c
 [<c02948c4>] scsi_io_completion+0x135/0x3ee
 [<c02864f1>] e1000_clean+0x3e/0xb3
 [<c0314bc5>] net_rx_action+0x70/0xef
 [<c011d078>] __do_softirq+0xb4/0xc3
 [<c011d0b4>] do_softirq+0x2d/0x2f
 [<c0106633>] do_IRQ+0x105/0x11e
 [<c0104768>] common_interrupt+0x18/0x20
 [<c0101f7a>] default_idle+0x0/0x2c
 [<c0101fa3>] default_idle+0x29/0x2c
 [<c010200c>] cpu_idle+0x33/0x3c
 [<c049a7d0>] start_kernel+0x15b/0x176
 [<c049a303>] unknown_bootoption+0x0/0x144
rsync: page allocation failure. order:0, mode:0x20
 [<c013a545>] __alloc_pages+0x1ab/0x317
 [<c011565c>] __wake_up+0x38/0x4e
 [<c013a6c9>] __get_free_pages+0x18/0x24
 [<c013d529>] kmem_getpages+0x1a/0xbe
 [<c013e108>] cache_grow+0x9e/0x127
 [<c013e304>] cache_alloc_refill+0x173/0x218
 [<c013e710>] __kmalloc+0x7c/0x83
 [<c030f574>] alloc_skb+0x32/0xc3
 [<c0286c02>] e1000_alloc_rx_buffers+0x3b/0xd5
 [<c028690d>] e1000_clean_rx_irq+0x192/0x44c
 [<c013a740>] __pagevec_free+0x17/0x1f
 [<c02864f1>] e1000_clean+0x3e/0xb3
 [<c0314bc5>] net_rx_action+0x70/0xef
 [<c011d078>] __do_softirq+0xb4/0xc3
 [<c011d0b4>] do_softirq+0x2d/0x2f
 [<c0106633>] do_IRQ+0x105/0x11e
 [<c0104768>] common_interrupt+0x18/0x20
 [<c011007b>] unknown_nmi_panic_callback+0x38/0x47
 [<c01408f3>] shrink_cache+0x109/0x388
 [<c012047d>] del_timer_sync+0x7d/0xb5
 [<c01204ca>] del_singleshot_timer_sync+0x15/0x23
 [<c0365d22>] schedule_timeout+0x6f/0xbb
 [<c0141105>] shrink_zone+0xa9/0xc0
 [<c0141170>] shrink_caches+0x54/0x56
 [<c0141229>] try_to_free_pages+0xb7/0x17f
 [<c013a58e>] __alloc_pages+0x1f4/0x317
 [<c030c073>] sock_aio_read+0xe2/0x13e
 [<c013a6c9>] __get_free_pages+0x18/0x24
 [<c0162e29>] __pollwait+0x80/0xc1
 [<c032ea66>] tcp_poll+0x1a/0x152
 [<c030c6d9>] sock_poll+0x12/0x14
 [<c01631a0>] do_select+0x25d/0x2b9
 [<c0162da9>] __pollwait+0x0/0xc1
 [<c01634af>] sys_select+0x29e/0x498
 [<c011c7da>] sys_time+0x16/0x50
 [<c0103d83>] syscall_call+0x7/0xb

This was with 2.6.9-rc2 + VS for it (2.6.9-rc2-vs1.9.2.28.4)

All this seems eepro1000 related, but not sure. I saw others have some
kind of similar problems with eepro1000,
and doing echo 2048 > /proc/sys/vm/min_free_kbytes seems to lower those
problems. This is what I've done.

This morning the server was crashed (after 14 days of uptime). I didn't
get a chance to see the oops.

So I recompiled another kernel, with all the bleeding edge, to see if
this is changing something
so this time :
2.6.9-rc3-bk4 + vs1.9.3-rc2
the device mapper has all the last patches,
the eepro1000 has been changed to 5.4.11-NAPI (directly from intel page)
the qlogic driver has been changed to 8.00.00b21-k

... And the results are the same ...

I've no problems on a non vserver-patched kernel, but with different
hardware. So the question is :
Is there a chance there are allocations on vserver code that can affect
this ?

Or do you think vserver is totally innocent in that case ?

Sincerely yours,

-- 
Yann Dupont, Cri de l'université de Nantes
Tel: 02.51.12.53.91 - Fax: 02.51.12.58.60 - Yann.Dupont_at_univ-nantes.fr

_______________________________________________ Vserver mailing list Vserver_at_list.linux-vserver.org http://list.linux-vserver.org/mailman/listinfo/vserver


About this list Date view Thread view Subject view Author view Attachment view
[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]
Generated on Tue 05 Oct 2004 - 13:33:12 BST by hypermail 2.1.3