vlruwk deadlock (was: NFS troubles on recent -current.)

From: Bernd Walter <ticso_at_cicely12.cicely.de>
Date: Fri, 16 May 2003 19:23:24 +0200
On Mon, May 12, 2003 at 10:33:08PM +0200, Bernd Walter wrote:
> On Mon, May 12, 2003 at 01:45:21AM +0200, Bernd Walter wrote:
> > On Sun, May 11, 2003 at 04:29:34PM -0700, Don Lewis wrote:
> > > On 11 May, Bernd Walter wrote:
> > > > I have a -current (10th may) NFS server which serves for a diskless
> > > > NFS client (11th May).
> > > > The client has / on a CF media and /usr as well as /var/spool on the
> > > > server.
> > > > Also swap is on the server (/usr/swapfile) via md.
> > > > When I run installworld on the client the server hangs with
> > > > "lockmgr draning against myself".
> > > > I got no DDB prompt so no stacktrace or crashdump :(
> > > 
> > > Try the attached patch.  I didn't have any problems with
> > > 	make -j10 buildworld
> > > 	make DESTDIR=/mnt installworld
> > > with /usr/obj and /mnt NFS mounted from the filesystems exported on the
> > > same server (though I had to tweak some sysctl variables to avoid wdrain
> > > deadlocks in the bio code in this configuration).
> > > 
> > > Can you break into the debugger with CNTRL-Alt-Esc?  I also recommend
> > 
> > I havn't tried as it shouldn't be required after a panic.
> > 
> > > trying the DEBUG_VFS_LOCKS kernel configuration option, which will drop
> > > into ddb if any of the vnode locking assertions are violated.
> > 
> > I will try your patch and suggestions tomorrow.
> 
> I can't reproduce the panic with your patch anymore.
> Great work!

I had an io deadlook on the same server today while doing a make
release on an alpha nfs client.
I don't know if it is related to the given patch or not.
It happened shortly after the client checked out the ports.
Sorry - I forgot to take a dump.

One of the last commands I was able to send:
[244]cicely5# ps -axl
  UID   PID  PPID CPU PRI NI   VSZ  RSS MWCHAN STAT  TT       TIME COMMAND
    0     0     0   0 -16  0     0    4 sched  DLs   ??    0:02.92  (swapper)
    0     1     0   0   8  0   716  260 wait   ILs   ??    0:02.39 /sbin/init --
    0     2     0   0  -8  0     0   12 g_even DL    ??    0:54.29  (g_event)
    0     3     0   1  -8  0     0   12 g_up   DL    ??   23:28.80  (g_up)
    0     4     0   0  -8  0     0   12 g_down DL    ??   10:24.32  (g_down)
    0     5     0   0 -16  0     0   12 psleep DL    ??    1:36.59  (pagedaemon)
    0     6     0   0  20  0     0   12 psleep DL    ??    0:00.00  (vmdaemon)
    0     7     0   0 171  0     0   12 pgzero DL    ??    7:55.19  (pagezero)
    0     8     0   0 -16  0     0   12 psleep DL    ??    0:28.41  (bufdaemon)
    0     9     0   0  20  0     0   12 vlrup  DL    ??    0:34.80  (vnlru)
    0    10     0   0 -16  0     0   12 ktrace DL    ??    0:00.00  (ktrace)
    0    11     0  17 -16  0     0   12 -      RL    ??  5384:08.48  (idle)
    0    12     0   0 -48  0     0   12 -      WL    ??   18:42.52  (swi7: tty:sio clock)
    0    14     0   0 -44  0     0   12 -      WL    ??   41:44.12  (swi1: net)
    0    15     0   0  76  0     0   12 sleep  DL    ??    5:28.24  (random)
    0    18     0   1 -36  0     0   12 -      WL    ??    4:28.69  (swi3: cambio)
    0    22     0   0 -64  0     0   12 -      WL    ??    0:00.00  (irq15: ata1)
    0    23     0   1 -64  0     0   12 -      WL    ??    3:14.38  (irq5: ahc0 ahc1++)
    0    24     0   0 -64  0     0   12 -      WL    ??    0:32.48  (irq12: ahc3)
    0    25     0   0 -68  0     0   12 -      WL    ??   18:45.62  (irq10: de0)
    0    26     0   0 -64  0     0   12 -      WL    ??    0:26.23  (irq11: ahc5 ahc6+)
    0    27     0   0 -60  0     0   12 -      WL    ??    0:00.03  (irq1: atkbd0)
    0    28     0   0 -64  0     0   12 -      WL    ??    0:00.00  (irq6: fdc0)
    0    29     0   0 -60  0     0   12 -      WL    ??    0:00.00  (irq7: ppc0)
    0    30     0   0 -48  0     0   12 -      WL    ??    0:00.00  (swi0: tty:sio)
    0    35     0   0  20  0     0   12 syncer DL    ??   19:05.67  (syncer)
    0    36     0   0   8  0     0   12 nfsidl IL    ??    2:17.00  (nfsiod 0)
    0    37     0  12   8  0     0   12 nfsidl IL    ??    0:22.74  (nfsiod 1)
    0    38     0   0   8  0     0   12 nfsidl IL    ??    0:00.01  (nfsiod 2)
    0    39     0   0   8  0     0   12 nfsidl IL    ??    0:00.00  (nfsiod 3)
    0    73     1   6  -8  0   848  176 vinum  DLs   ??    0:00.02 vinum: vinum daemon (vinum)
    0   127     1  26  20  0   224   84 pause  Is    ??    0:00.00 adjkerntz -i
    0   278     1   0  96  0  1196  712 select Is    ??    1:17.76 /usr/sbin/syslogd -s
    0   294     1   0  96  0  1516  960 select Ss    ??    0:02.01 /usr/sbin/rpcbind
    0   310     1   0  96  0  1244  876 select Ss    ??    0:10.22 /usr/sbin/ypbind
    0   379     1   0  96  0  1404 1076 select Is    ??    0:00.79 /usr/sbin/mountd -r
    0   381     1   0  96  0  1212  832 select Is    ??    0:00.31 nfsd: master (nfsd)
    0   384   381   0  -4  0  1148  692 vlruwk D     ??   90:01.02 nfsd: server (nfsd)
    0   385   381   0   4  0  1148  692 nfsd   S     ??   20:15.06 nfsd: server (nfsd)
    0   386   381   0   4  0  1148  692 nfsd   S     ??    8:51.20 nfsd: server (nfsd)
    0   387   381   0   4  0  1148  692 nfsd   I     ??    5:39.17 nfsd: server (nfsd)
    0   389     1   0  96  0 263420  804 select Is    ??    0:01.52 /usr/sbin/rpc.statd
    0   427     1   0  96  0  6072 4052 select Ss    ??    1:53.33 /usr/local/sbin/httpd -k start -DSSL
    0   435     1   7  20  0  1832  696 lockf  Is    ??    0:00.01 /usr/local/sbin/saslauthd -a pam
    0   436   435   7   4  0  1832  696 accept I     ??    0:00.00 /usr/local/sbin/saslauthd -a pam
    0   437   435   7  20  0  1832  696 lockf  I     ??    0:00.00 /usr/local/sbin/saslauthd -a pam
    0   438   435   7  20  0  1832  696 lockf  I     ??    0:00.00 /usr/local/sbin/saslauthd -a pam
    0   439   435   7  20  0  1832  696 lockf  I     ??    0:00.00 /usr/local/sbin/saslauthd -a pam
    0   447     1   1  96  0  1220  768 select Is    ??    0:00.07 /usr/sbin/lpd
    0   462     1   0  96  0  1548  948 select Ss    ??    1:13.14 /usr/sbin/ntpd -p /var/run/ntpd.pid
    1   474     1   0  -4  0  1176  748 vlruwk Ds    ??    0:12.10 /usr/sbin/rwhod
    0   515     1   0  96  0  3320 1764 select Is    ??    0:05.30 /usr/sbin/sshd
    0   523     1   0  96  0  5208 2616 select Ss    ??    1:38.83 sendmail: accepting connections (sendmail)
   25   526     1   0  20  0  3452 2012 pause  Is    ??    0:01.18 sendmail: Queue runner_at_00:30:00 for /var/spool/clientmqueue (sen
    0   544     1   0  -4  0  1272  840 vlruwk Ds    ??    0:07.50 /usr/sbin/cron
    0   581     1   0  96  0  1348  888 select Is    ??    0:00.06 /usr/sbin/inetd -wW
    0   596     1  34   5  0  1212  732 siodcd I     ??    0:00.02 /usr/libexec/getty std.9600 ttyd0
    0   601   515   0   4  0  6124 2116 sbwait Is    ??    0:00.28 sshd: ticso [priv] (sshd)
 1007   603   601   0  96  0  6124 2196 select I     ??    1:15.78 sshd: ticso_at_ttyp0 (sshd)
    0  3121   515   0   4  0  6124 2132 sbwait Is    ??    0:00.25 sshd: ticso [priv] (sshd)
 1007  3123  3121   0  96  0  6124 2188 select I     ??    0:00.11 sshd: ticso_at_ttyp1 (sshd)
   80 24548   427   5  20  0  6072 4080 lockf  I     ??    0:00.05 /usr/local/sbin/httpd -k start -DSSL
   80 24549   427   5  20  0  6072 4080 lockf  I     ??    0:00.04 /usr/local/sbin/httpd -k start -DSSL
   80 24550   427   5  20  0  6072 4080 lockf  I     ??    0:00.04 /usr/local/sbin/httpd -k start -DSSL
   80 24551   427   6  20  0  6072 4080 lockf  I     ??    0:00.05 /usr/local/sbin/httpd -k start -DSSL
   80 24552   427   6  96  0  6072 4080 select I     ??    0:00.05 /usr/local/sbin/httpd -k start -DSSL
   60 28689   440   2  -4  0 24836 3240 vlruwk D     ??    2:27.68 imapd: imapd: cicely12.cicely.de[3ffe:400:8d0:301::12] ticso use
    0 29641   523   0   4  0  5500 3404 sbwait S     ??    0:00.29 sendmail: h4GGWjhx029641 gif-cicely.cosmo-project.de [IPv6:3ffe:
    0 29643   544   0  -8  0  1320 1004 piperd I     ??    0:00.00 cron: running job (cron)
    2 29644 29643   0   8  0   868  348 wait   Is    ??    0:00.04 /bin/sh -c /usr/libexec/save-entropy
    2 29645 29644   1   8  0   908  372 wait   I     ??    0:00.05 /bin/sh /usr/libexec/save-entropy
    2 29656 29645   0  -4  0   240  140 vlruwk D     ??    0:00.00 dd if=/dev/random of=/var/db/entropy/saved-entropy.1 bs=2048 cou
    0 29657   523   1   4  0  5268 2712 kqread S     ??    0:00.01 sendmail: startup with gif-cicely.cosmo-project.de (sendmail)
    0 29658   523   0   4  0  5268 2712 kqread S     ??    0:00.01 sendmail: startup with gif-cicely.cosmo-project.de (sendmail)
 1007   604   603   0  20  0  1596 1108 pause  Is    p0    0:00.19 -tcsh (tcsh)
    0 10000   604   0  20  0  1720 1252 pause  I     p0    0:01.95 tcsh
    0 29647 10000   0  96  0   700  324 -      R+    p0    0:00.01 ps -axl
 1007  3124  3123   0   5  0  1556 1052 ttyin  Is+   p1    0:00.17 -tcsh (tcsh)
 1007  3132  3124   0   8  0   896  368 -      T     p1    0:00.05 /bin/sh ./run-mozilla.sh ./mozilla-bin
 1007  3143  3132  11  97  0 48896 39668 -      T     p1   43:46.76 ./mozilla-bin
    0   592     1   0   8  0  1636 1180 wait   Is    v0    0:00.16 login [pam] (login)
 1007 13632   592   0  20  0  1548 1060 pause  I     v0    0:00.13 -tcsh (tcsh)
 1007 13636 13632   0  96  0  3260 2076 select I+    v0    0:23.72 ssh c6
    0   593     1  99   5  0  1216  796 ttyin  Is+   v1    0:00.02 /usr/libexec/getty Pc ttyv1
    0   594     1  99   5  0  1216  796 ttyin  Is+   v2    0:00.02 /usr/libexec/getty Pc ttyv2
    0   595     1  99   5  0  1216  796 ttyin  Is+   v3    0:00.02 /usr/libexec/getty Pc ttyv3
   60   440     1   0  96  0  2780 1348 select I    con-   0:15.58 /usr/local/cyrus/bin/master

I saw many of those process already over minutes waiting with vlruwk
in top output on another window.

-- 
B.Walter                   BWCT                http://www.bwct.de
ticso_at_bwct.de                                  info_at_bwct.de
Received on Fri May 16 2003 - 08:23:41 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:08 UTC