Re: spinlock held too long on reboot

From: Attilio Rao <attilio_at_freebsd.org> Date: Wed, 29 Jul 2009 04:43:36 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:52 UTC

2009/5/23 Stefan Bethke <stb_at_lassitu.de>:
> I wrote:
>
>> Syncing disks, vnodes remaining...0 done
>> All buffers synced.
>> GEOM_MIRROR: Device diesel_root: provider mirror/diesel_root destroyed.
>> Uptime: 6m32s
>> GEOM_MIRROR: Device diesel_root destroyed.
>> Rebooting...
>> cpu_reset: Stopping other CPUs
>> spin lock 0xffffffff8078c900 (sched lock 1) held by 0xffffff00014d4ab0
>> (tid 100002) too long
>> panic: spin lock held too long
>> cpuid = 0
>> KDB: enter: panic
>> [thread pid 77 tid 100090 ]
>> Stopped at      kdb_enter+0x3d: movq    $0,0x48bbd0(%rip)
>> db> bt
>> Tracing pid 77 tid 100090 td 0xffffff000457bab0
>> kdb_enter() at kdb_enter+0x3d
>> panic() at panic+0x17b
>> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
>> _mtx_lock_spin() at _mtx_lock_spin+0x9e
>> _mtx_lock_spin_flags() at _mtx_lock_spin_flags+0x72
>> sched_balance_group() at sched_balance_group+0xc5
>> sched_balance_group() at sched_balance_group+0x1f8
>> sched_balance() at sched_balance+0xa2
>> sched_clock() at sched_clock+0xf6
>> statclock() at statclock+0xbd
>> lapic_handle_timer() at lapic_handle_timer+0x197
>> Xtimerint() at Xtimerint+0x8c
>> --- interrupt, rip = 0xffffffff80541cc4, rsp = 0xffffff80771dba90, rbp =
>> 0xffffff80771dbab0 ---
>> DELAY() at DELAY+0x64
>> cpu_reset() at cpu_reset+0xdd
>> boot() at boot+0x2e6
>> reboot() at reboot+0x42
>> syscall() at syscall+0x1a5
>> Xfast_syscall() at Xfast_syscall+0xd0
>> --- syscall (55, FreeBSD ELF64, reboot), rip = 0x800788eec, rsp =
>> 0x7fffffffeca8, rbp = 0 ---
>
>
> I've only seen this once.  If I should encounter it again, is there
> something you'd like me to look at?

[ Sorry, trying to add anyone who alredy reported such a problem even
if I know many of you experienced it on -STABLE]

Could you try this patch against -CURRENT:
http://www.freebsd.org/~attilio/stop_nmi.diff

This patch basically does 2 things:
1) Removing the STOP_NMI option, and adding the infrastructure for
using NMI on KDB invocation and normal stop IPIs on standard cpu
shutdown.
In order to accomplish that and forsee a better design than what
STOP_NMI does now, 2 new functions are introduced: *
ipi_hstop_selected() which does, if the architecture offers such an
option, the possibility to send a "forced" IPI through a privileged
channel (NMI on amd64 and ia32) in order to stop CPUs passed in the
mask.  Note that for the other architectures that are not amd64 and
ia32 ipi_hstop_selected() is defaulted to ipi_selected(..., STOP_IPI),
but if maintainers want to override that they can simply implement
something harder
* stop_cpus_hard() which is a 'more powerful' version of stop_cpus()
that uses ipi_hstop_selected() instead than ipi_selected(...,
STOP_IPI) in order to stop cpus
In the end, while shutdown subsystem keeps using stop_cpus(), kdb now
uses stop_cpus_hard().
2) Disable interrupts on CPU0 while doing the stop_cpus() for others.
That does avoid spourious fast handlers to preempt the CPU0 while
doing the stopping (aka: timerint running hardclock())

If you can report if that patch fixes the problem for you it would be great.
I'm alredy well aware that this patch needs an entry in UPDATING too
if we verify it does solve the problem.
If someone wants to port this to STABLE_7 and he is faster than me, he
is welcome. Due to invasivness of the patch, it should be modified if
eventually to be ported on STABLE_7.
I tested it on i386, but I would eventually need of run a make
universe. I will do ASAP.

* Please don't forget to drop STOP_NMI by your own custom config files *

Thanks,
Attilio

-- 
Peace can only be achieved by understanding - A. Einstein