From jonathan.sambrook@dsvr.co.uk Tue Mar 18 02:50:34 2003 Date: Mon, 17 Mar 2003 22:43:02 +0000 From: Jonathan Sambrook To: vserver@solucorp.qc.ca Cc: Herbert Poetzl , Sam Vilain , Eje Gustafsson , Dinesh Mistry , John Goerzen , Paul Sladen , Luis Miguel Silva , Jacques Gelinas , Warren Togami , Francois Deppierraz , Lyashkov Alexey Subject: ctx16 sched.c BUG fix - if decoded oops fingers semaphore related activity Hi all, the list being down and IMO this being important enough, I've cc'd y'all - hope no-one's too offended? Anyhow, as the subject line intimates, if the symptom of your kernel panic is semaphore related (ksymoops is your friend), the attached patch(es) could/should sort you out. Note that sched.c BUG is a symptom which can be caused by a variety of causes, so it is possible that there may be more than one issue here. Without more widespread use of ksymoops... sys_assign_ip_info() and sys_release_ip_info() can get called in softirq functions, thus (sometimes?) leading to oopses because you can't use semaphores in code that runs at IRQ level. sched-oopsfix.patch replaces the semaphore with a spinlock, which works and is the correct solution if ip_info continues to need messing with at IRQ level. Might not be the optimum solution since spinlocks do spin somewhat :) but since the code they're protecting is small and fast, they're probably the way to go? If not there's always ctx17 and a refit. sched-oopsfix-safer.patch also replaces the semaphore with a spinlock, but adds some protection against the (remote) possibility of ip_info->refcount getting out of kilter (munged) in multi-threaded applications. N.b. Jacques didn't think this was possible, but I've printk'd refcount and seen it get above 5400 - before sched.c BUG oopsed on me. I've not seen it skyrocket again, but since it's a timing based problem it will be inherently intermittent. IIRC it's unlikely to be a problem with decrementing, so a small memory leak is the most common effect. In an extreme case refcount could _eventually_ wrap leading to vfree being called followed by an invalid pointer access... Comments please... Jonathan -- Jonathan Sambrook Software Developer Designer Servers [ Part 1.2, Text/PLAIN 48 lines. ] [ Unable to print this part. ] [ Part 1.3, Text/PLAIN 63 lines. ] [ Unable to print this part. ] [ Part 2, Application/PGP-SIGNATURE 196bytes. ] [ Unable to print this part. ]