Re: spinlock held too long on reboot

From: Attilio Rao <attilio_at_freebsd.org> Date: Wed, 29 Jul 2009 16:13:22 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:52 UTC

2009/7/29 John Baldwin <jhb_at_freebsd.org>:
> On Tuesday 28 July 2009 10:43:36 pm Attilio Rao wrote:
>> 2009/5/23 Stefan Bethke <stb_at_lassitu.de>:
>> > I wrote:
>> >
>> >> Syncing disks, vnodes remaining...0 done
>> >> All buffers synced.
>> >> GEOM_MIRROR: Device diesel_root: provider mirror/diesel_root destroyed.
>> >> Uptime: 6m32s
>> >> GEOM_MIRROR: Device diesel_root destroyed.
>> >> Rebooting...
>> >> cpu_reset: Stopping other CPUs
>> >> spin lock 0xffffffff8078c900 (sched lock 1) held by 0xffffff00014d4ab0
>> >> (tid 100002) too long
>> >> panic: spin lock held too long
>> >> cpuid = 0
>> >> KDB: enter: panic
>> >> [thread pid 77 tid 100090 ]
>> >> Stopped at      kdb_enter+0x3d: movq    $0,0x48bbd0(%rip)
>> >> db> bt
>> >> Tracing pid 77 tid 100090 td 0xffffff000457bab0
>> >> kdb_enter() at kdb_enter+0x3d
>> >> panic() at panic+0x17b
>> >> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
>> >> _mtx_lock_spin() at _mtx_lock_spin+0x9e
>> >> _mtx_lock_spin_flags() at _mtx_lock_spin_flags+0x72
>> >> sched_balance_group() at sched_balance_group+0xc5
>> >> sched_balance_group() at sched_balance_group+0x1f8
>> >> sched_balance() at sched_balance+0xa2
>> >> sched_clock() at sched_clock+0xf6
>> >> statclock() at statclock+0xbd
>> >> lapic_handle_timer() at lapic_handle_timer+0x197
>> >> Xtimerint() at Xtimerint+0x8c
>> >> --- interrupt, rip = 0xffffffff80541cc4, rsp = 0xffffff80771dba90, rbp =
>> >> 0xffffff80771dbab0 ---
>> >> DELAY() at DELAY+0x64
>> >> cpu_reset() at cpu_reset+0xdd
>> >> boot() at boot+0x2e6
>> >> reboot() at reboot+0x42
>> >> syscall() at syscall+0x1a5
>> >> Xfast_syscall() at Xfast_syscall+0xd0
>> >> --- syscall (55, FreeBSD ELF64, reboot), rip = 0x800788eec, rsp =
>> >> 0x7fffffffeca8, rbp = 0 ---
>> >
>> >
>> > I've only seen this once.  If I should encounter it again, is there
>> > something you'd like me to look at?
>>
>> [ Sorry, trying to add anyone who alredy reported such a problem even
>> if I know many of you experienced it on -STABLE]
>>
>> Could you try this patch against -CURRENT:
>> http://www.freebsd.org/~attilio/stop_nmi.diff
>>
>> This patch basically does 2 things:
>> 1) Removing the STOP_NMI option, and adding the infrastructure for
>> using NMI on KDB invocation and normal stop IPIs on standard cpu
>> shutdown.
>> In order to accomplish that and forsee a better design than what
>> STOP_NMI does now, 2 new functions are introduced: *
>> ipi_hstop_selected() which does, if the architecture offers such an
>> option, the possibility to send a "forced" IPI through a privileged
>> channel (NMI on amd64 and ia32) in order to stop CPUs passed in the
>> mask.  Note that for the other architectures that are not amd64 and
>> ia32 ipi_hstop_selected() is defaulted to ipi_selected(..., STOP_IPI),
>> but if maintainers want to override that they can simply implement
>> something harder
>
> Why not just add a new IPI_STOP_HARD that maps to IPI_STOP on most archs and
> does the NMI logic on x86.  This avoids adding a new API
> (ipi_hstop_selected()) instead just adding a new logical IPI.

When choosing among the two, as long as we had API like
ipi_all_but_self() I thought we gave preference to more explicit API
toward logical ones.
Anyways I can reimplement in that way if any, it is something I like
more as well. Just want to know if that fixes the problem for the
users right now.

Thanks,
Attilio

-- 
Peace can only be achieved by understanding - A. Einstein