Re: NFS (& amd?) dysfunction descending a hierarchy

From: David Wolfskill <david_at_catwhisker.org>
Date: Tue, 9 Dec 2008 11:01:10 -0800
On Tue, Dec 02, 2008 at 04:15:38PM -0800, David Wolfskill wrote:
> I seem to have a fairly- (though not deterministly so) reproducible
> mode of failure with an NFS-mounted directory hierarchy:  An attempt to
> traverse a "sufficiently large" hierarchy (e.g., via "tar zcpf" or "rm
> -fr") will fail to "visit" some subdirectories, typically apparently
> acting as if the subdirectories in question do not actually exist
> (despite the names having been returned in the output of a previous
> readdir()).
> ... 

I was able to reproduce the external symptoms of the failure running
CURRENT as of yesterday, using "rm -fr" of a copy of a recent
/usr/ports hierachy on an NFS-mounted file system as a test case.
However, I believe the mechanism may be a bit different -- while
still being other than what I would expect.

One aspect in which the externally-observable symptoms were different
(under CURRENT, vs. RELENG_7) is that under CURRENT, once the error
condition occurred, the NFS client machine was in a state where it
merely kept repeating

	nfs server pid848_at_fbsd-build:/volume: not responding

until I logged in as root & rebooted it.


Here's a cut/paste of the kdump from the ktrace of the amd(8) process
under CURRENT, showing where the master amd(8) process (pid 848)
forks a child (4126) to try the unmount:

   848 amd      1228846258.722953 CALL  gettimeofday(0x8078e48,0)
   848 amd      1228846258.722964 RET   gettimeofday 0
   848 amd      1228846258.722982 CALL  sigprocmask(SIG_BLOCK,0xbfbfeaec,0xbfbfeadc)
   848 amd      1228846258.722993 RET   sigprocmask 0
   848 amd      1228846258.723003 CALL  fork
   848 amd      1228846258.730250 RET   fork 4126/0x101e
   848 amd      1228846258.730405 CALL  sigprocmask(SIG_SETMASK,0xbfbfeadc,0)
  4126 amd      1228846258.730252 RET   fork 0
  4126 amd      1228846258.730456 CALL  getpid
  4126 amd      1228846258.730467 RET   getpid 4126/0x101e
  4126 amd      1228846258.730493 CALL  unmount(0x2825f340,<invalid>0)
   848 amd      1228846258.730422 RET   sigprocmask 0
   848 amd      1228846258.730595 CALL  gettimeofday(0x8078e48,0)
   848 amd      1228846258.730608 RET   gettimeofday 0
...
   848 amd      1228846258.914814 CALL  sigprocmask(SIG_SETMASK,0xbfbfeba0,0)
   848 amd      1228846258.914826 RET   sigprocmask 0
   848 amd      1228846258.914838 CALL  select(0x400,0xbfbfec40,0,0,0xbfbfecd8)
  4126 amd      1228846259.090428 RET   unmount 0
  4126 amd      1228846259.090492 CALL  sigprocmask(SIG_BLOCK,0x2809b080,0xbfbfea0c)
  4126 amd      1228846259.090505 RET   sigprocmask 0
  4126 amd      1228846259.090518 CALL  sigprocmask(SIG_SETMASK,0x2809b090,0)
  4126 amd      1228846259.090530 RET   sigprocmask 0
  4126 amd      1228846259.090545 CALL  sigprocmask(SIG_BLOCK,0x2809b080,0xbfbfe9dc)
  4126 amd      1228846259.090556 RET   sigprocmask 0
  4126 amd      1228846259.090576 CALL  sigprocmask(SIG_SETMASK,0x2809b090,0)
  4126 amd      1228846259.090587 RET   sigprocmask 0
  4126 amd      1228846259.090605 CALL  exit(0)
   848 amd      1228846259.091248 RET   select -1 errno 4 Interrupted system call
   848 amd      1228846259.091277 PSIG  SIGCHLD caught handler=0x805e090 mask=0x0 code=0x0
   848 amd      1228846259.091298 CALL  wait4(0xffffffff,0xbfbfe83c,WNOHANG,0)
   848 amd      1228846259.091329 RET   wait4 4126/0x101e
   848 amd      1228846259.091342 CALL  wait4(0xffffffff,0xbfbfe83c,WNOHANG,0)
   848 amd      1228846259.091352 RET   wait4 -1 errno 10 No child processes
   848 amd      1228846259.091365 CALL  sigprocmask(SIG_SETMASK,0x80795bc,0)
   848 amd      1228846259.091377 RET   sigprocmask 0
   848 amd      1228846259.091390 CALL  sigprocmask(SIG_BLOCK,0x80792c4,0)
   848 amd      1228846259.091401 RET   sigprocmask 0
   848 amd      1228846259.091411 CALL  gettimeofday(0x8078e48,0)
   848 amd      1228846259.091422 RET   gettimeofday 0

Note that while the child didn't get EBUSY (as it does under RELENG_7)
-- indeed, the unmount call appears to have returned 0 -- the master
amd(8) process looks to be seeing "errno 4 Interrupted system call."


And here's a relevent part of the kdump from the "rm -fr" -- I had
kdump spit out Epoch timestamps with each in order to make correlation
easier:

  4121 rm       1228846258.736266 CALL  unlink(0x2821c148)
  4121 rm       1228846258.736281 NAMI  "distinfo"
  4121 rm       1228846258.738329 RET   unlink 0
  4121 rm       1228846258.738379 CALL  unlink(0x2821c1b8)
  4121 rm       1228846258.738401 NAMI  "pkg-descr"
  4121 rm       1228846258.739963 RET   unlink 0
  4121 rm       1228846258.739982 CALL  open(0x28178b6b,O_RDONLY,<unused>0)
  4121 rm       1228846258.740002 NAMI  ".."
  4121 rm       1228846258.740541 RET   open 4
  4121 rm       1228846258.740558 CALL  fstat(0x4,0xbfbfe96c)
  4121 rm       1228846258.740579 STRU  struct stat {dev=67174155, ino=22674937, mode=drwxr-xr-x , nlink=114, uid=9874, gid=929, rdev=0, atime=1228846258.184514000, stime
=1228846258.779501000, ctime=1228846258.779501000, birthtime=-1, size=12288, blksize=4096, blocks=24, flags=0x0 }
  4121 rm       1228846258.740593 RET   fstat 0
  4121 rm       1228846258.740608 CALL  fchdir(0x4)
  4121 rm       1228846258.740626 RET   fchdir 0
  4121 rm       1228846258.740641 CALL  close(0x4)
  4121 rm       1228846258.740976 RET   close 0
  4121 rm       1228846258.740991 CALL  rmdir(0x2821c538)
  4121 rm       1228846258.741007 NAMI  "dnscheck"
  4121 rm       1228846258.741764 RET   rmdir 0
  4121 rm       1228846258.741783 CALL  stat(0x2821d028,0xbfbfe900)
  4121 rm       1228846258.741799 NAMI  "dnsdoctor"
  4121 rm       1228846258.742050 STRU  struct stat {dev=67174155, ino=2519891, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, 
ctime=1228845836.981842000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 }
  4121 rm       1228846258.742066 RET   stat 0
  4121 rm       1228846258.742080 CALL  open(0x2821d028,O_NONBLOCK,<unused>0x28200000)
  4121 rm       1228846258.742097 NAMI  "dnsdoctor"
  4121 rm       1228846258.742419 RET   open 4
  4121 rm       1228846258.742435 CALL  fstat(0x4,0xbfbfe6a0)
  4121 rm       1228846258.742452 STRU  struct stat {dev=67174155, ino=2519891, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, 
ctime=1228845836.981842000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 }
  4121 rm       1228846258.742465 RET   fstat 0
  4121 rm       1228846258.742480 CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
  4121 rm       1228846258.742495 RET   fcntl 0
  4121 rm       1228846258.742516 CALL  fstatfs(0x4,0xbfbfe700)
  4121 rm       1228846258.742792 RET   fstatfs -1 errno 2 No such file or directory
  4121 rm       1228846258.742809 CALL  close(0x4)
  4121 rm       1228846258.743187 RET   close 0
  4121 rm       1228846258.743203 CALL  stat(0x2821d098,0xbfbfe900)
  4121 rm       1228846258.743219 NAMI  "dnsflood"
  4121 rm       1228846258.743544 STRU  struct stat {dev=67174155, ino=2519892, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, ctime=1228845836.978868000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 }
  4121 rm       1228846258.743559 RET   stat 0
  4121 rm       1228846258.743574 CALL  open(0x2821d098,O_NONBLOCK,<unused>0x28200000)
  4121 rm       1228846258.743590 NAMI  "dnsflood"
  4121 rm       1228846258.743792 RET   open 4
  4121 rm       1228846258.743809 CALL  fstat(0x4,0xbfbfe6a0)
  4121 rm       1228846258.743826 STRU  struct stat {dev=67174155, ino=2519892, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, ctime=1228845836.978868000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 }
  4121 rm       1228846258.743840 RET   fstat 0
  4121 rm       1228846258.743854 CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
  4121 rm       1228846258.743867 RET   fcntl 0
  4121 rm       1228846258.743882 CALL  fstatfs(0x4,0xbfbfe700)
  4121 rm       1228846258.744008 RET   fstatfs -1 errno 2 No such file or directory
  4121 rm       1228846258.744022 CALL  close(0x4)
  4121 rm       1228846258.744411 RET   close 0

[I included a moderate amount of successful processing near the
beginning of that excerpt, so folks could see the pattern.]


In contrast, here are the similar kdump excerpts from RELENG_7:

   702 amd      1228774660.297858 CALL  gettimeofday(0x807ad48,0)
   702 amd      1228774660.297864 RET   gettimeofday 0
   702 amd      1228774660.297872 CALL  sigprocmask(SIG_BLOCK,0xbfbfeaec,0xbfbfeadc)
   702 amd      1228774660.297878 RET   sigprocmask 0
   702 amd      1228774660.297883 CALL  fork
   702 amd      1228774660.302658 RET   fork 16840/0x41c8
 16840 amd      1228774660.302660 RET   fork 0
   702 amd      1228774660.302689 CALL  sigprocmask(SIG_SETMASK,0xbfbfeadc,0)
 16840 amd      1228774660.302707 CALL  getpid
 16840 amd      1228774660.302725 RET   getpid 16840/0x41c8
   702 amd      1228774660.302715 RET   sigprocmask 0
   702 amd      1228774660.302746 CALL  gettimeofday(0x807ad48,0)
 16840 amd      1228774660.302753 CALL  unmount(0x2837d310,<invalid>0)
   702 amd      1228774660.302753 RET   gettimeofday 0
...
   702 amd      1228774660.417933 CALL  select(0x400,0xbfbfec40,0,0,0xbfbfecd8)
 16840 amd      1228774660.434632 RET   unmount -1 errno 16 Device busy
 16840 amd      1228774660.434684 CALL  sigprocmask(SIG_BLOCK,0x28097c00,0xbfbfea0c)
 16840 amd      1228774660.434691 RET   sigprocmask 0
 16840 amd      1228774660.434699 CALL  sigprocmask(SIG_SETMASK,0x28097c10,0)
 16840 amd      1228774660.434705 RET   sigprocmask 0
 16840 amd      1228774660.434713 CALL  sigprocmask(SIG_BLOCK,0x28097c00,0xbfbfe9dc)
 16840 amd      1228774660.434718 RET   sigprocmask 0
 16840 amd      1228774660.434729 CALL  sigprocmask(SIG_SETMASK,0x28097c10,0)
 16840 amd      1228774660.434734 RET   sigprocmask 0
 16840 amd      1228774660.434745 CALL  exit(0x10)
   702 amd      1228774660.435214 RET   select -1 errno 4 Interrupted system call
   702 amd      1228774660.435227 PSIG  SIGCHLD caught handler=0x805de20 mask=0x0 code=0x0
   702 amd      1228774660.435237 CALL  wait4(0xffffffff,0xbfbfe83c,WNOHANG,0)
   702 amd      1228774660.435255 RET   wait4 16840/0x41c8
   702 amd      1228774660.435296 CALL  wait4(0xffffffff,0xbfbfe83c,WNOHANG,0)
   702 amd      1228774660.435302 RET   wait4 -1 errno 10 No child processes
   702 amd      1228774660.435307 CALL  sigprocmask(SIG_SETMASK,0x807ba7c,0)
   702 amd      1228774660.435312 RET   sigprocmask 0
   702 amd      1228774660.435317 CALL  sigprocmask(SIG_BLOCK,0x807b784,0)
   702 amd      1228774660.435323 RET   sigprocmask 0

and:

 16835 rm       1228774660.305162 CALL  open(0x2816280b,O_RDONLY,<unused>0)
 16835 rm       1228774660.305173 NAMI  ".."
 16835 rm       1228774660.305626 RET   open 4
 16835 rm       1228774660.305634 CALL  fstat(0x4,0xbfbfe93c)
 16835 rm       1228774660.305644 STRU  struct stat {dev=50396945, ino=29713037, mode=drwxr-xr-x , nlink=91, uid=9874, gid=929, rdev=0, atime=1228774657.877477000, stime=
1228774660.314260000, ctime=1228774660.314260000, birthtime=0, size=20480, blksize=4096, blocks=40, flags=0x0 }
 16835 rm       1228774660.305651 RET   fstat 0
 16835 rm       1228774660.305658 CALL  fchdir(0x4)
 16835 rm       1228774660.305667 RET   fchdir 0
 16835 rm       1228774660.305674 CALL  close(0x4)
 16835 rm       1228774660.305824 RET   close 0
 16835 rm       1228774660.305831 CALL  rmdir(0x2821afe8)
 16835 rm       1228774660.305838 NAMI  "p-interp"
 16835 rm       1228774660.306498 RET   rmdir 0
 16835 rm       1228774660.306513 CALL  stat(0x28218b48,0xbfbfe8cc)
 16835 rm       1228774660.306519 NAMI  "pcemu"
 16835 rm       1228774660.306756 STRU  struct stat {dev=50396945, ino=8465981, mode=drwxr-xr-x , nlink=5, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, 
ctime=1228773351.399184000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 }
 16835 rm       1228774660.306764 RET   stat 0
 16835 rm       1228774660.306770 CALL  open(0x28218b48,O_NONBLOCK,<unused>0x1)
 16835 rm       1228774660.306776 NAMI  "pcemu"
 16835 rm       1228774660.307003 RET   open 4
 16835 rm       1228774660.307010 CALL  fstat(0x4,0xbfbfe8cc)
 16835 rm       1228774660.307018 STRU  struct stat {dev=50396945, ino=8465981, mode=drwxr-xr-x , nlink=5, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, 
ctime=1228773351.399184000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 }
 16835 rm       1228774660.307025 RET   fstat 0
 16835 rm       1228774660.307031 CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
 16835 rm       1228774660.307039 RET   fcntl 0
 16835 rm       1228774660.307049 CALL  fstatfs(0x4,0xbfbfe6f4)
 16835 rm       1228774660.307294 RET   fstatfs -1 errno 2 No such file or directory
 16835 rm       1228774660.307302 CALL  close(0x4)
 16835 rm       1228774660.307549 RET   close 0
 16835 rm       1228774660.307557 CALL  stat(0x28218b98,0xbfbfe8cc)
 16835 rm       1228774660.307563 NAMI  "pearpc"
 16835 rm       1228774660.307759 STRU  struct stat {dev=50396945, ino=27094159, mode=drwxr-xr-x , nlink=4, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, ctime=1228773351.390236000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 }
 16835 rm       1228774660.307767 RET   stat 0
 16835 rm       1228774660.307772 CALL  open(0x28218b98,O_NONBLOCK,<unused>0x1)
 16835 rm       1228774660.307779 NAMI  "pearpc"
 16835 rm       1228774660.308000 RET   open 4
 16835 rm       1228774660.308007 CALL  fstat(0x4,0xbfbfe8cc)
 16835 rm       1228774660.308015 STRU  struct stat {dev=50396945, ino=27094159, mode=drwxr-xr-x , nlink=4, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, ctime=1228773351.390236000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 }
 16835 rm       1228774660.308021 RET   fstat 0
 16835 rm       1228774660.308026 CALL  fcntl(0x4,F_SETFD,FD_CLOEXEC)
 16835 rm       1228774660.308032 RET   fcntl 0
 16835 rm       1228774660.308038 CALL  fstatfs(0x4,0xbfbfe6f4)
 16835 rm       1228774660.308101 RET   fstatfs -1 errno 2 No such file or directory
 16835 rm       1228774660.308108 CALL  close(0x4)
 16835 rm       1228774660.308350 RET   close 0


So either way, the user-level program attempting the directory hierarchy
traversal can be coded to be very careful, yet still receive a rude
surprise -- the system saying that a file that the program had already
opened (and still has opened) does not exist.

How rude!  :-{

I'll be very happy to test suggested patches, whether intended to fix
the problem or merely facilitate diagnosis.  That said, it shouldn't be
difficult to reproduce this -- I did it with a copy  of /usr/ports; a
colleague has reported doing so with a copy of /usr/src (though I
haven't witnessed that).

Peace,
david
-- 
David H. Wolfskill				david_at_catwhisker.org
Depriving a girl or boy of an opportunity for education is evil.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.

Received on Tue Dec 09 2008 - 18:14:31 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:38 UTC