On Tue, Dec 02, 2008 at 04:15:38PM -0800, David Wolfskill wrote: > I seem to have a fairly- (though not deterministly so) reproducible > mode of failure with an NFS-mounted directory hierarchy: An attempt to > traverse a "sufficiently large" hierarchy (e.g., via "tar zcpf" or "rm > -fr") will fail to "visit" some subdirectories, typically apparently > acting as if the subdirectories in question do not actually exist > (despite the names having been returned in the output of a previous > readdir()). > ... I was able to reproduce the external symptoms of the failure running CURRENT as of yesterday, using "rm -fr" of a copy of a recent /usr/ports hierachy on an NFS-mounted file system as a test case. However, I believe the mechanism may be a bit different -- while still being other than what I would expect. One aspect in which the externally-observable symptoms were different (under CURRENT, vs. RELENG_7) is that under CURRENT, once the error condition occurred, the NFS client machine was in a state where it merely kept repeating nfs server pid848_at_fbsd-build:/volume: not responding until I logged in as root & rebooted it. Here's a cut/paste of the kdump from the ktrace of the amd(8) process under CURRENT, showing where the master amd(8) process (pid 848) forks a child (4126) to try the unmount: 848 amd 1228846258.722953 CALL gettimeofday(0x8078e48,0) 848 amd 1228846258.722964 RET gettimeofday 0 848 amd 1228846258.722982 CALL sigprocmask(SIG_BLOCK,0xbfbfeaec,0xbfbfeadc) 848 amd 1228846258.722993 RET sigprocmask 0 848 amd 1228846258.723003 CALL fork 848 amd 1228846258.730250 RET fork 4126/0x101e 848 amd 1228846258.730405 CALL sigprocmask(SIG_SETMASK,0xbfbfeadc,0) 4126 amd 1228846258.730252 RET fork 0 4126 amd 1228846258.730456 CALL getpid 4126 amd 1228846258.730467 RET getpid 4126/0x101e 4126 amd 1228846258.730493 CALL unmount(0x2825f340,<invalid>0) 848 amd 1228846258.730422 RET sigprocmask 0 848 amd 1228846258.730595 CALL gettimeofday(0x8078e48,0) 848 amd 1228846258.730608 RET gettimeofday 0 ... 848 amd 1228846258.914814 CALL sigprocmask(SIG_SETMASK,0xbfbfeba0,0) 848 amd 1228846258.914826 RET sigprocmask 0 848 amd 1228846258.914838 CALL select(0x400,0xbfbfec40,0,0,0xbfbfecd8) 4126 amd 1228846259.090428 RET unmount 0 4126 amd 1228846259.090492 CALL sigprocmask(SIG_BLOCK,0x2809b080,0xbfbfea0c) 4126 amd 1228846259.090505 RET sigprocmask 0 4126 amd 1228846259.090518 CALL sigprocmask(SIG_SETMASK,0x2809b090,0) 4126 amd 1228846259.090530 RET sigprocmask 0 4126 amd 1228846259.090545 CALL sigprocmask(SIG_BLOCK,0x2809b080,0xbfbfe9dc) 4126 amd 1228846259.090556 RET sigprocmask 0 4126 amd 1228846259.090576 CALL sigprocmask(SIG_SETMASK,0x2809b090,0) 4126 amd 1228846259.090587 RET sigprocmask 0 4126 amd 1228846259.090605 CALL exit(0) 848 amd 1228846259.091248 RET select -1 errno 4 Interrupted system call 848 amd 1228846259.091277 PSIG SIGCHLD caught handler=0x805e090 mask=0x0 code=0x0 848 amd 1228846259.091298 CALL wait4(0xffffffff,0xbfbfe83c,WNOHANG,0) 848 amd 1228846259.091329 RET wait4 4126/0x101e 848 amd 1228846259.091342 CALL wait4(0xffffffff,0xbfbfe83c,WNOHANG,0) 848 amd 1228846259.091352 RET wait4 -1 errno 10 No child processes 848 amd 1228846259.091365 CALL sigprocmask(SIG_SETMASK,0x80795bc,0) 848 amd 1228846259.091377 RET sigprocmask 0 848 amd 1228846259.091390 CALL sigprocmask(SIG_BLOCK,0x80792c4,0) 848 amd 1228846259.091401 RET sigprocmask 0 848 amd 1228846259.091411 CALL gettimeofday(0x8078e48,0) 848 amd 1228846259.091422 RET gettimeofday 0 Note that while the child didn't get EBUSY (as it does under RELENG_7) -- indeed, the unmount call appears to have returned 0 -- the master amd(8) process looks to be seeing "errno 4 Interrupted system call." And here's a relevent part of the kdump from the "rm -fr" -- I had kdump spit out Epoch timestamps with each in order to make correlation easier: 4121 rm 1228846258.736266 CALL unlink(0x2821c148) 4121 rm 1228846258.736281 NAMI "distinfo" 4121 rm 1228846258.738329 RET unlink 0 4121 rm 1228846258.738379 CALL unlink(0x2821c1b8) 4121 rm 1228846258.738401 NAMI "pkg-descr" 4121 rm 1228846258.739963 RET unlink 0 4121 rm 1228846258.739982 CALL open(0x28178b6b,O_RDONLY,<unused>0) 4121 rm 1228846258.740002 NAMI ".." 4121 rm 1228846258.740541 RET open 4 4121 rm 1228846258.740558 CALL fstat(0x4,0xbfbfe96c) 4121 rm 1228846258.740579 STRU struct stat {dev=67174155, ino=22674937, mode=drwxr-xr-x , nlink=114, uid=9874, gid=929, rdev=0, atime=1228846258.184514000, stime =1228846258.779501000, ctime=1228846258.779501000, birthtime=-1, size=12288, blksize=4096, blocks=24, flags=0x0 } 4121 rm 1228846258.740593 RET fstat 0 4121 rm 1228846258.740608 CALL fchdir(0x4) 4121 rm 1228846258.740626 RET fchdir 0 4121 rm 1228846258.740641 CALL close(0x4) 4121 rm 1228846258.740976 RET close 0 4121 rm 1228846258.740991 CALL rmdir(0x2821c538) 4121 rm 1228846258.741007 NAMI "dnscheck" 4121 rm 1228846258.741764 RET rmdir 0 4121 rm 1228846258.741783 CALL stat(0x2821d028,0xbfbfe900) 4121 rm 1228846258.741799 NAMI "dnsdoctor" 4121 rm 1228846258.742050 STRU struct stat {dev=67174155, ino=2519891, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, ctime=1228845836.981842000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 } 4121 rm 1228846258.742066 RET stat 0 4121 rm 1228846258.742080 CALL open(0x2821d028,O_NONBLOCK,<unused>0x28200000) 4121 rm 1228846258.742097 NAMI "dnsdoctor" 4121 rm 1228846258.742419 RET open 4 4121 rm 1228846258.742435 CALL fstat(0x4,0xbfbfe6a0) 4121 rm 1228846258.742452 STRU struct stat {dev=67174155, ino=2519891, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, ctime=1228845836.981842000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 } 4121 rm 1228846258.742465 RET fstat 0 4121 rm 1228846258.742480 CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) 4121 rm 1228846258.742495 RET fcntl 0 4121 rm 1228846258.742516 CALL fstatfs(0x4,0xbfbfe700) 4121 rm 1228846258.742792 RET fstatfs -1 errno 2 No such file or directory 4121 rm 1228846258.742809 CALL close(0x4) 4121 rm 1228846258.743187 RET close 0 4121 rm 1228846258.743203 CALL stat(0x2821d098,0xbfbfe900) 4121 rm 1228846258.743219 NAMI "dnsflood" 4121 rm 1228846258.743544 STRU struct stat {dev=67174155, ino=2519892, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, ctime=1228845836.978868000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 } 4121 rm 1228846258.743559 RET stat 0 4121 rm 1228846258.743574 CALL open(0x2821d098,O_NONBLOCK,<unused>0x28200000) 4121 rm 1228846258.743590 NAMI "dnsflood" 4121 rm 1228846258.743792 RET open 4 4121 rm 1228846258.743809 CALL fstat(0x4,0xbfbfe6a0) 4121 rm 1228846258.743826 STRU struct stat {dev=67174155, ino=2519892, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, ctime=1228845836.978868000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 } 4121 rm 1228846258.743840 RET fstat 0 4121 rm 1228846258.743854 CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) 4121 rm 1228846258.743867 RET fcntl 0 4121 rm 1228846258.743882 CALL fstatfs(0x4,0xbfbfe700) 4121 rm 1228846258.744008 RET fstatfs -1 errno 2 No such file or directory 4121 rm 1228846258.744022 CALL close(0x4) 4121 rm 1228846258.744411 RET close 0 [I included a moderate amount of successful processing near the beginning of that excerpt, so folks could see the pattern.] In contrast, here are the similar kdump excerpts from RELENG_7: 702 amd 1228774660.297858 CALL gettimeofday(0x807ad48,0) 702 amd 1228774660.297864 RET gettimeofday 0 702 amd 1228774660.297872 CALL sigprocmask(SIG_BLOCK,0xbfbfeaec,0xbfbfeadc) 702 amd 1228774660.297878 RET sigprocmask 0 702 amd 1228774660.297883 CALL fork 702 amd 1228774660.302658 RET fork 16840/0x41c8 16840 amd 1228774660.302660 RET fork 0 702 amd 1228774660.302689 CALL sigprocmask(SIG_SETMASK,0xbfbfeadc,0) 16840 amd 1228774660.302707 CALL getpid 16840 amd 1228774660.302725 RET getpid 16840/0x41c8 702 amd 1228774660.302715 RET sigprocmask 0 702 amd 1228774660.302746 CALL gettimeofday(0x807ad48,0) 16840 amd 1228774660.302753 CALL unmount(0x2837d310,<invalid>0) 702 amd 1228774660.302753 RET gettimeofday 0 ... 702 amd 1228774660.417933 CALL select(0x400,0xbfbfec40,0,0,0xbfbfecd8) 16840 amd 1228774660.434632 RET unmount -1 errno 16 Device busy 16840 amd 1228774660.434684 CALL sigprocmask(SIG_BLOCK,0x28097c00,0xbfbfea0c) 16840 amd 1228774660.434691 RET sigprocmask 0 16840 amd 1228774660.434699 CALL sigprocmask(SIG_SETMASK,0x28097c10,0) 16840 amd 1228774660.434705 RET sigprocmask 0 16840 amd 1228774660.434713 CALL sigprocmask(SIG_BLOCK,0x28097c00,0xbfbfe9dc) 16840 amd 1228774660.434718 RET sigprocmask 0 16840 amd 1228774660.434729 CALL sigprocmask(SIG_SETMASK,0x28097c10,0) 16840 amd 1228774660.434734 RET sigprocmask 0 16840 amd 1228774660.434745 CALL exit(0x10) 702 amd 1228774660.435214 RET select -1 errno 4 Interrupted system call 702 amd 1228774660.435227 PSIG SIGCHLD caught handler=0x805de20 mask=0x0 code=0x0 702 amd 1228774660.435237 CALL wait4(0xffffffff,0xbfbfe83c,WNOHANG,0) 702 amd 1228774660.435255 RET wait4 16840/0x41c8 702 amd 1228774660.435296 CALL wait4(0xffffffff,0xbfbfe83c,WNOHANG,0) 702 amd 1228774660.435302 RET wait4 -1 errno 10 No child processes 702 amd 1228774660.435307 CALL sigprocmask(SIG_SETMASK,0x807ba7c,0) 702 amd 1228774660.435312 RET sigprocmask 0 702 amd 1228774660.435317 CALL sigprocmask(SIG_BLOCK,0x807b784,0) 702 amd 1228774660.435323 RET sigprocmask 0 and: 16835 rm 1228774660.305162 CALL open(0x2816280b,O_RDONLY,<unused>0) 16835 rm 1228774660.305173 NAMI ".." 16835 rm 1228774660.305626 RET open 4 16835 rm 1228774660.305634 CALL fstat(0x4,0xbfbfe93c) 16835 rm 1228774660.305644 STRU struct stat {dev=50396945, ino=29713037, mode=drwxr-xr-x , nlink=91, uid=9874, gid=929, rdev=0, atime=1228774657.877477000, stime= 1228774660.314260000, ctime=1228774660.314260000, birthtime=0, size=20480, blksize=4096, blocks=40, flags=0x0 } 16835 rm 1228774660.305651 RET fstat 0 16835 rm 1228774660.305658 CALL fchdir(0x4) 16835 rm 1228774660.305667 RET fchdir 0 16835 rm 1228774660.305674 CALL close(0x4) 16835 rm 1228774660.305824 RET close 0 16835 rm 1228774660.305831 CALL rmdir(0x2821afe8) 16835 rm 1228774660.305838 NAMI "p-interp" 16835 rm 1228774660.306498 RET rmdir 0 16835 rm 1228774660.306513 CALL stat(0x28218b48,0xbfbfe8cc) 16835 rm 1228774660.306519 NAMI "pcemu" 16835 rm 1228774660.306756 STRU struct stat {dev=50396945, ino=8465981, mode=drwxr-xr-x , nlink=5, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, ctime=1228773351.399184000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 } 16835 rm 1228774660.306764 RET stat 0 16835 rm 1228774660.306770 CALL open(0x28218b48,O_NONBLOCK,<unused>0x1) 16835 rm 1228774660.306776 NAMI "pcemu" 16835 rm 1228774660.307003 RET open 4 16835 rm 1228774660.307010 CALL fstat(0x4,0xbfbfe8cc) 16835 rm 1228774660.307018 STRU struct stat {dev=50396945, ino=8465981, mode=drwxr-xr-x , nlink=5, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, ctime=1228773351.399184000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 } 16835 rm 1228774660.307025 RET fstat 0 16835 rm 1228774660.307031 CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) 16835 rm 1228774660.307039 RET fcntl 0 16835 rm 1228774660.307049 CALL fstatfs(0x4,0xbfbfe6f4) 16835 rm 1228774660.307294 RET fstatfs -1 errno 2 No such file or directory 16835 rm 1228774660.307302 CALL close(0x4) 16835 rm 1228774660.307549 RET close 0 16835 rm 1228774660.307557 CALL stat(0x28218b98,0xbfbfe8cc) 16835 rm 1228774660.307563 NAMI "pearpc" 16835 rm 1228774660.307759 STRU struct stat {dev=50396945, ino=27094159, mode=drwxr-xr-x , nlink=4, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, ctime=1228773351.390236000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 } 16835 rm 1228774660.307767 RET stat 0 16835 rm 1228774660.307772 CALL open(0x28218b98,O_NONBLOCK,<unused>0x1) 16835 rm 1228774660.307779 NAMI "pearpc" 16835 rm 1228774660.308000 RET open 4 16835 rm 1228774660.308007 CALL fstat(0x4,0xbfbfe8cc) 16835 rm 1228774660.308015 STRU struct stat {dev=50396945, ino=27094159, mode=drwxr-xr-x , nlink=4, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, ctime=1228773351.390236000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 } 16835 rm 1228774660.308021 RET fstat 0 16835 rm 1228774660.308026 CALL fcntl(0x4,F_SETFD,FD_CLOEXEC) 16835 rm 1228774660.308032 RET fcntl 0 16835 rm 1228774660.308038 CALL fstatfs(0x4,0xbfbfe6f4) 16835 rm 1228774660.308101 RET fstatfs -1 errno 2 No such file or directory 16835 rm 1228774660.308108 CALL close(0x4) 16835 rm 1228774660.308350 RET close 0 So either way, the user-level program attempting the directory hierarchy traversal can be coded to be very careful, yet still receive a rude surprise -- the system saying that a file that the program had already opened (and still has opened) does not exist. How rude! :-{ I'll be very happy to test suggested patches, whether intended to fix the problem or merely facilitate diagnosis. That said, it shouldn't be difficult to reproduce this -- I did it with a copy of /usr/ports; a colleague has reported doing so with a copy of /usr/src (though I haven't witnessed that). Peace, david -- David H. Wolfskill david_at_catwhisker.org Depriving a girl or boy of an opportunity for education is evil. See http://www.catwhisker.org/~david/publickey.gpg for my public key.
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:38 UTC