Re: Heavy I/O blocks FreeBSD box for several seconds

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Sun, 31 Jul 2011 15:03:21 -0400 (EDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:16 UTC

Andriy Gapon wrote:
> on 31/07/2011 01:55 Rick Macklem said the following:
> > Andriv Gapon wrote:
> >> on 26/07/2011 00:44 Rick Macklem said the following:
> >>> hrs sent me this panic. I'm wondering if it might be relevant to
> >>> this?
> >>
> >> To 'this' what?
> >
> > Well, my thinking was (and it is quite likely completely incorrect)
> > is
> > that, since this panic is a result of holding the spin lock too
> > long,
> > that it might be another symptom of the same problem as when the
> > scheduler
> > seems to "lock up" for several seconds.
> 
> I don't think that there is any issue with the scheduler "locking up"
> or taking
> too long to make scheduling decisions, at least when we talk about
> performance
> problems. There can be issues with the scheduler making sub-optimal
> scheduling
> decisions.
> 
> > --> The bug might be some case where the spin lock isn't being
> > released
> >     when it should be, and that "could" result in "Heavy I/O blocks
> >     FreeBSD box for several seconds". Note that the server was under
> >     heavy I/O load when this panic occurred.
> >
> >
> >> A panic is a panic.
> >>
> >>> spin lock 0xffffffff80cb52c0 (sched lock 1) held by
> >>> 0xffffff0012c7f8c0 (tid 100317) too long
> >>> panic: spin lock held too long
> >>
> >> I guess it's more related to the thread on stable_at_ which has the
> >> panic
> >> message
> >> as its subject.
> >>
> >>> cpuid = 0
> >>> KDB: stack backtrace:
> >>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
> >>> kdb_backtrace() at kdb_backtrace+0x37
> >>> panic() at panic+0x187
> >>> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
> >>> _mtx_lock_spin() at _mtx_lock_spin+0x9e
> >>> sched_add() at sched_add+0x117
> >>> setrunnable() at setrunnable+0x78
> >>> sleepq_signal() at sleepq_signal+0x7a
> >>> cv_signal() at cv_signal+0x3b
> >>> xprt_active() at xprt_active+0xe3
> >>> svc_vc_soupcall() at svc_vc_soupcall+0xc
> >>> sowakeup() at sowakeup+0x69
> >>> tcp_do_segment() at tcp_do_segment+0x2cbd
> >>> tcp_input() at tcp_input+0xcdd
> >>> ip_input() at ip_input+0xac
> >>> netisr_dispatch_src() at netisr_dispatch_src+0x7e
> >>> ether_demux() at ether_demux+0x14d
> >>> ether_input() at ether_input+0x17d
> >>> em_rxeof() at em_rxeof+0x1ca
> >>> em_handle_que() at em_handle_que+0x5b
> >>> taskqueue_run_locked() at taskqueue_run_locked+0x85
> >>> taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
> >>> fork_exit() at fork_exit+0x11f
> >>> fork_trampoline() at fork_trampoline+0xe
> >>> --- trap 0, rip = 0, rsp = 0xffffff8000160d00, rbp = 0 ---
> >>> KDB: enter: panic
> >>> [thread pid 0 tid 100033 ]
> >>> Stopped at kdb_enter+0x3b: movq $0,0x6b4e62(%rip)
> >>> db> ps
> >>>
> >>
> >> --
> >> Andriy Gapon
> >>
Ok, so if the scheduler spin lock is being held too long, what could
cause this, if it isn't a scheduler bug?

I can't think of how NFS would affect this beyond causing a heavy I/O
load, but if there is some way it does, I suppose I need to know what
that is?

rick