Re: r244036 kernel hangs under load.

From: Attilio Rao <attilio_at_freebsd.org>
Date: Tue, 11 Dec 2012 21:57:44 +0000
On Tue, Dec 11, 2012 at 9:55 PM, Rick Macklem <rmacklem_at_uoguelph.ca> wrote:
> Konstantin Belousov wrote:
>> On Mon, Dec 10, 2012 at 07:11:59PM -0500, Rick Macklem wrote:
>> > Konstantin Belousov wrote:
>> > > On Mon, Dec 10, 2012 at 01:38:21PM -0500, Rick Macklem wrote:
>> > > > Adrian Chadd wrote:
>> > > > > .. what was the previous kernel version?
>> > > > >
>> > > > Hopefully Tim has it narrowed down more, but I don't see
>> > > > the hangs on a Sept. 7 kernel from head and I do see them
>> > > > on a Dec. 3 kernel from head. (Don't know the eact rNNNNNN.)
>> > > >
>> > > > It seems to predate my commit (r244008), which was my first
>> > > > concern.
>> > > >
>> > > > I use old single core i386 hardware and can fairly reliably
>> > > > reproduce it by doing a kernel build and a "svn checkout"
>> > > > concurrently. No NFS activity. These are running on a local
>> > > > disk (UFS/FFS). (The kernel I reproduce it on is built via
>> > > > GENERIC for i386. If you want me to start a "binary search"
>> > > > for which rNNNNNN, I can do that, but it will take a while.:-)
>> > > >
>> > > > I can get out into DDB, but I'll admit I don't know enough
>> > > > about it to know where to look;-)
>> > > > Here's some lines from "db> ps", in case they give someone
>> > > > useful information. (I can leave this box sitting in DB for
>> > > > the rest of to-day, in case someone can suggest what I should
>> > > > look for on it.)
>> > > >
>> > > > Just snippets...
>> > > >    Ss pause adjkerntz
>> > > >    DL sdflush [sofdepflush]
>> > > >    RL [syncer]
>> > > >    DL vlruwt [vnlru]
>> > > >    DL psleep [bufdaemon]
>> > > >    RL [pagezero]
>> > > >    DL psleep [vmdaemon]
>> > > >    DL psleep [pagedaemon]
>> > > >    DL ccb_scan [xpt_thrd]
>> > > >    DL waiting_ [sctp_iterator]
>> > > >    DL ctl_work [ctl_thrd]
>> > > >    DL cooling [acpi_cooling0]
>> > > >    DL tzpoll [acpi_thermal]
>> > > >    DL (threaded) [usb]
>> > > >    ...
>> > > >    DL - [yarrow]
>> > > >    DL (threaded) [geom]
>> > > >    D - [g_down]
>> > > >    D - [g_up]
>> > > >    D - [g_event]
>> > > >    RL (threaded) [intr]
>> > > >    I [irq15: ata1]
>> > > >    ...
>> > > >    Run CPU0 [swi6: Giant taskq]
>> > > > --> does this one indicate the CPU is actually running this?
>> > > >    (after a db> cont, wait a while <ctrl><alt><esc> db> ps
>> > > >     it is still the same)
>> > > >    I [swi4: clock]
>> > > >    I [swi1: netisr 0]
>> > > >    I [swi3: vm]
>> > > >    RL [idle: cpu0]
>> > > >    SLs wait [init]
>> > > >    DL audit_wo [audit]
>> > > >    DLs (threaded) [kernel]
>> > > >    D - [deadlkres]
>> > > >    ...
>> > > >    D sched [swapper]
>> > > >
>> > > > I have no idea if this "ps" output helps, unless it indicates
>> > > > that it is looping on the Giant taskq?
>> > > Might be. You could do 'bt <pid>' for the process to see where it
>> > > loops.
>> > > Another good set of hints is at
>> > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
>> >
>> > Kostik, you must be clairvoyant;-)
>> >
>> > When I did "show alllocks", I found that the syncer process held
>> > - exclusive sleep mutex mount mtx locked _at_ kern/vfs_subr.c:4720
>> > - exclusive lockmgr syncer locked _at_ kern/vfs_subr.c:1780
>> > The trace for this process goes like:
>> >  spinlock_exit
>> >  mtx_unlock_spin_flags
>> >  kern_yield
>> >  _mnt_vnode_next_active
>> >  vnode_next_active
>> >  vfs_msync()
>> >
>> > So, it seems like your r244095 commit might have fixed this?
>> > (I'm not good at this stuff, but from your description, it looks
>> >  like it did the kern_yield() with the mutex held and "maybe"
>> >  got into trouble trying to acquire Giant?)
>> >
>> > Anyhow, I'm going to test a kernel with r244095 in it and see
>> > if I can still reproduce the hang.
>> > (There wasn't much else in the "show alllocks", except a
>> >  process that held the exclusive vnode interlock mutex plus
>> >  a ufs vnode lock, but it's just doing a witness_unlock.)
>> There must be a thread blocked for the mount interlock for the loop
>> in the mnt_vnode_next_active to cause livelock.
>>
> Yes. I am getting hangs with the -current kernel and they seem
> easier for me to reproduce.

Can you report the svn rev number is kernel is built from?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
Received on Tue Dec 11 2012 - 20:57:46 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC