Re: SUJ deadlock

From: David O'Brien <obrien_at_freebsd.org>
Date: Fri, 3 Sep 2010 16:30:44 -0700
On Wed, May 05, 2010 at 12:54:07PM -1000, Jeff Roberson wrote:
> On Mon, 3 May 2010, Fabien Thomas wrote:
>>>> I'm with r207548 now and since some days i've system deadlock.
>>>> It seems related to SUJ with process waiting on suspfs or ppwait.
>>> 
>>> I've also seen it stalled in suspfs, but this information is way better
>>> than what I was able to garner.   I was only able to tell via ctrl-t on
>>> a stalled 'ls' process in a terminal before hard booting.
[..]
> Can anyone who has experienced this hang test this patch:
> 
> Thanks,
> Jeff
> Index: ffs_softdep.c
> ===================================================================
> --- ffs_softdep.c       (revision 207480)
> +++ ffs_softdep.c       (working copy)
> _at__at_ -9301,7 +9301,7 _at__at_
>                         hadchanges = 1;
>         }
>         /* Leave this inodeblock dirty until it's in the list. */
> -       if ((inodedep->id_state & (UNLINKED | DEPCOMPLETE)) == UNLINKED)
> +       if ((inodedep->id_state & (UNLINKED | UNLINKONLIST)) == UNLINKED)


Hi Jeff,
I didn't seem to experience this problem back in May, but I'm now
experiencing it on a regular basis.

I seem to trigger it almost every other or 3rd day during the daily run.
I wind up with cvsup or svnsync stalled and any 'ls' of my sources
partition waiting on suspfs.
(note, I am also running diskcheckd from ports.)

My kernel sources are at:
    Last Changed Author: davidxu
    Last Changed Rev: 211534
    Last Changed Date: 2010-08-20 16:51:34 -0700 (Fri, 20 Aug 2010)

I have also experienced it back to at least:
    Last Changed Author: yongari
    Last Changed Rev: 210152
    Last Changed Date: 2010-07-15 16:34:58 -0700 (Thu, 15 Jul 2010)


Weird thing is - I can still access this partition across NFS without
problems.

    dragon$ cd /src/fbsd
    Filesystem      Size    Used   Avail Capacity  Mounted on
    /dev/da31s1f    271G    119G    130G    48%    /src
    dragon$ ls
    load: 0.12  cmd: ls 77901 [suspfs] 2.26r 0.00u 0.00s 0% 1212k

    quynh$ cd /src/fbsd
    quynh$ df .
    Filesystem     Size    Used   Avail Capacity  Mounted on
    dragon:/src    271G    119G    130G    48%    /src
    quynh$ ls
    .svn/                           lib/
    COPYRIGHT                       libexec/
    ..snip..


Processes also have a tendency to complete quite slowly at times - waiting
in vlruwk.

When I reboot, usually / and /src (but not 3 other partitions) give a
"Bad cg number {negative number}" error from fsck; so a full fsck is run.
This results in what seems tens of thousands iterations of:

    UNREF FILE I=[..snip..]
    RECONNECT? yes
    SORRY no space in lost+found directory
    unexpected soft update inconsistency
    CLEAR? yes


thoughts?
-- 
-- David  (obrien_at_FreeBSD.org)
Received on Fri Sep 03 2010 - 21:41:22 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:07 UTC