Re: spinlock held too long on reboot

From: John Baldwin <jhb_at_freebsd.org> Date: Wed, 29 Jul 2009 09:50:42 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:52 UTC

On Tuesday 28 July 2009 10:43:36 pm Attilio Rao wrote:
> 2009/5/23 Stefan Bethke <stb_at_lassitu.de>:
> > I wrote:
> >
> >> Syncing disks, vnodes remaining...0 done
> >> All buffers synced.
> >> GEOM_MIRROR: Device diesel_root: provider mirror/diesel_root destroyed.
> >> Uptime: 6m32s
> >> GEOM_MIRROR: Device diesel_root destroyed.
> >> Rebooting...
> >> cpu_reset: Stopping other CPUs
> >> spin lock 0xffffffff8078c900 (sched lock 1) held by 0xffffff00014d4ab0
> >> (tid 100002) too long
> >> panic: spin lock held too long
> >> cpuid = 0
> >> KDB: enter: panic
> >> [thread pid 77 tid 100090 ]
> >> Stopped at      kdb_enter+0x3d: movq    $0,0x48bbd0(%rip)
> >> db> bt
> >> Tracing pid 77 tid 100090 td 0xffffff000457bab0
> >> kdb_enter() at kdb_enter+0x3d
> >> panic() at panic+0x17b
> >> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
> >> _mtx_lock_spin() at _mtx_lock_spin+0x9e
> >> _mtx_lock_spin_flags() at _mtx_lock_spin_flags+0x72
> >> sched_balance_group() at sched_balance_group+0xc5
> >> sched_balance_group() at sched_balance_group+0x1f8
> >> sched_balance() at sched_balance+0xa2
> >> sched_clock() at sched_clock+0xf6
> >> statclock() at statclock+0xbd
> >> lapic_handle_timer() at lapic_handle_timer+0x197
> >> Xtimerint() at Xtimerint+0x8c
> >> --- interrupt, rip = 0xffffffff80541cc4, rsp = 0xffffff80771dba90, rbp =
> >> 0xffffff80771dbab0 ---
> >> DELAY() at DELAY+0x64
> >> cpu_reset() at cpu_reset+0xdd
> >> boot() at boot+0x2e6
> >> reboot() at reboot+0x42
> >> syscall() at syscall+0x1a5
> >> Xfast_syscall() at Xfast_syscall+0xd0
> >> --- syscall (55, FreeBSD ELF64, reboot), rip = 0x800788eec, rsp =
> >> 0x7fffffffeca8, rbp = 0 ---
> >
> >
> > I've only seen this once.  If I should encounter it again, is there
> > something you'd like me to look at?
> 
> [ Sorry, trying to add anyone who alredy reported such a problem even
> if I know many of you experienced it on -STABLE]
> 
> Could you try this patch against -CURRENT:
> http://www.freebsd.org/~attilio/stop_nmi.diff
> 
> This patch basically does 2 things:
> 1) Removing the STOP_NMI option, and adding the infrastructure for
> using NMI on KDB invocation and normal stop IPIs on standard cpu
> shutdown.
> In order to accomplish that and forsee a better design than what
> STOP_NMI does now, 2 new functions are introduced: *
> ipi_hstop_selected() which does, if the architecture offers such an
> option, the possibility to send a "forced" IPI through a privileged
> channel (NMI on amd64 and ia32) in order to stop CPUs passed in the
> mask.  Note that for the other architectures that are not amd64 and
> ia32 ipi_hstop_selected() is defaulted to ipi_selected(..., STOP_IPI),
> but if maintainers want to override that they can simply implement
> something harder

Why not just add a new IPI_STOP_HARD that maps to IPI_STOP on most archs and 
does the NMI logic on x86.  This avoids adding a new API 
(ipi_hstop_selected()) instead just adding a new logical IPI.

-- 
John Baldwin