Am Thu, 24 Nov 2016 22:35:42 +0200 Konstantin Belousov <kostikbel_at_gmail.com> schrieb: > On Thu, Nov 24, 2016 at 11:42:41AM -0700, Alan Somers wrote: > > On Thu, Nov 24, 2016 at 5:53 AM, Rick Macklem <rmacklem_at_uoguelph.ca> wrote: > > > > > > On Wed, Nov 23, 2016 at 10:17:25PM -0700, Alan Somers wrote: > > >> I have a FreeBSD 10.3-RELEASE-p12 server exporting its home > > >> directories over both NFSv3 and NFSv4. I have a TrueOS client (based > > >> on 12.0-CURRENT on the drm-next-4.7 branch, built on 28-October) > > >> mounting the home directories over NFSv4. At first, everything is > > >> fine and performance is good. But if the client does a buildworld > > >> using sources on NFS and locally stored objects, performance slowly > > >> degrades. The degradation is most noticeable with metadata-heavy > > >> operations. For example, "ls -l" in a directory with 153 files takes > > >> less than 0.1 seconds right after booting. But the longer the > > >> buildworld goes on, the slower it gets. Eventually that same "ls -l" > > >> takes 19 seconds. When the home directories are mounted over NFSv3 > > >> instead, I see no degradation. > > >> > > >> top shows negligible CPU consumption on the server, and very high > > >> consumption on the client when using NFSv4 (nearly 100%). The > > >> NFS-using process is spending almost all of its time in system mode, > > >> and dtrace shows that almost all of its time is spent in > > >> ncl_getpages(). > > >> > > > A couple of things you could do when it slow (as well as what Kostik suggested): > > > - nfsstat -c -e on client and nfsstat -e -s on server, to see what RPCs are being > > > done and how quickly. (nfsstat -s -e will also show you how big the DRC is, > > > although a large DRC should show up as increased CPU consumption on the server) > > > - capture packets with tcpdump -s 0 -w test.pcap host <other-one> > > > - then you can email me test.pcap as an attachment. I can look at it in wireshark > > > and see if there seem to protocol and/or TCP issues. (You can look at in > > > wireshark yourself, the look for NFS4ERR_xxx, TCP segment retransmits...) > > > > > > If you are using either "intr" or "soft" on the mounts, try without those mount > > > options. (The Bugs section of mount_nfs recommends against using them. If an RPC > > > fails due to these options, something called a seqid# can be "out of sync" between > > > client/server and that causes serious problems.) > > > --> These seqid#s are not used by NFSv4.1, so you could try that by adding > > > "minorversion=1" to your mount options. > > > > > > Good luck with it, rick > > > > I've reproduced the issue on stock FreeBSD 12, and I've also learned > > that nullfs is a required factor. Doing the buildworld directly on > > the NFS mount doesn't cause any slowdown, but doing a buildworld on > > the nullfs copy of the NFS mount does. The slowdown affects the base > > NFS mount as well as the nullfs copy. Here is the nfsstat output for > > both server and client duing "ls -al" on the client: > > > > nfsstat -e -s -z > > > > Server Info: > > Getattr Setattr Lookup Readlink Read Write Create Remove > > 800 0 121 0 0 2 0 0 > > Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access > > 0 0 0 0 0 0 0 8 > > Mknod Fsstat Fsinfo PathConf Commit LookupP SetClId SetClIdCf > > 0 0 0 0 1 3 0 0 > > Open OpenAttr OpenDwnGr OpenCfrm DelePurge DeleRet GetFH Lock > > 0 0 0 0 0 0 123 0 > > LockT LockU Close Verify NVerify PutFH PutPubFH PutRootFH > > 0 0 0 0 0 674 0 0 > > Renew RestoreFH SaveFH Secinfo RelLckOwn V4Create > > 0 0 0 0 0 0 > > Server: > > Retfailed Faults Clients > > 0 0 0 > > OpenOwner Opens LockOwner Locks Delegs > > 0 0 0 0 0 > > Server Cache Stats: > > Inprog Idem Non-idem Misses CacheSize TCPPeak > > 0 0 0 674 16738 16738 > > > > nfsstat -e -c -z > > Client Info: > > Rpc Counts: > > Getattr Setattr Lookup Readlink Read Write Create Remove > > 60 0 119 0 0 0 0 0 > > Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access > > 0 0 0 0 0 0 0 3 > > Mknod Fsstat Fsinfo PathConf Commit SetClId SetClIdCf Lock > > 0 0 0 0 0 0 0 0 > > LockT LockU Open OpenCfr > > 0 0 0 0 > > OpenOwner Opens LockOwner Locks Delegs LocalOwn LocalOpen LocalLOwn > > 5638 141453 0 0 0 0 0 0 > > LocalLock > > 0 > > Rpc Info: > > TimedOut Invalid X Replies Retries Requests > > 0 0 0 0 662 > > Cache Info: > > Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits Misses > > 1275 58 837 121 0 0 0 0 > > BioRLHits Misses BioD Hits Misses DirE Hits Misses > > 1 0 6 0 1 0 > > > > And here are the most popular stack traces of "ls -al", as observed by > > dtrace. The number beneath each stack is the number of times dtrace > > observed that exact stack: > > > > kernel`bcmp+0x21 > > kernel`vinactive+0xc6 > > kernel`vputx+0x30e > > kernel`kern_statat+0x165 > > kernel`sys_fstatat+0x2c > > kernel`amd64_syscall+0x314 > > kernel`vputx+0x30e > > kernel`NDFREE+0xaa > > kernel`sys___acl_get_link+0x82 > > kernel`amd64_syscall+0x314 > > kernel`0xffffffff80eb95fb > > 96 > > > > kernel`nfscl_doclose+0x383 > > kernel`vinactive+0xc6 > > kernel`vputx+0x30e > > kernel`NDFREE+0xaa > > kernel`sys___acl_get_link+0x82 > > kernel`amd64_syscall+0x314 > > kernel`0xffffffff80eb95fb > > 183 > > > > kernel`nfscl_doclose+0x383 > > kernel`vinactive+0xc6 > > kernel`vputx+0x30e > > kernel`kern_statat+0x165 > > kernel`sys_fstatat+0x2c > > kernel`amd64_syscall+0x314 > > kernel`0xffffffff80eb95fb > > 189 > > > > kernel`lock_delay+0x52 > > kernel`nfs_lookup+0x337 > > kernel`VOP_LOOKUP_APV+0xda > > kernel`lookup+0x6a2 > > kernel`namei+0x57e > > kernel`sys___acl_get_link+0x55 > > kernel`amd64_syscall+0x314 > > kernel`0xffffffff80eb95fb > > 194 > > > > kernel`lock_delay+0x52 > > kernel`ncl_getattrcache+0x28 > > kernel`nfs_getattr+0x92 > > kernel`VOP_GETATTR_APV+0xda > > kernel`vn_stat+0xa3 > > kernel`kern_statat+0xde > > kernel`sys_fstatat+0x2c > > kernel`amd64_syscall+0x314 > > kernel`0xffffffff80eb95fb > > 196 > > > > What role could nullfs be playing? > > Can you check two things: > 1. Does NFSv3 mount used with nullfs your way cause the same issue, or not ? > You already said that NFSv3 somehow was not affected, but due to > discovery that nullfs is part of the scenario, can you, please, confirm > that still. > 2. If you add nocache option to the nullfs mount which degrades, does the > problem go away ? > _______________________________________________ > freebsd-current_at_freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org" I'm curious if this problem could also affect "poudriere" if related to nullfs.
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC