On 12 Nov, Mateusz Guzik wrote: > On Fri, Nov 11, 2016 at 11:29:48PM -0800, Don Lewis wrote: >> On 11 Nov, To: freebsd-current_at_FreeBSD.org wrote: >> > Yesterday I upgraded my ports builder box from r306349 to r308477. It >> > crashed overnight during a poudriere run. >> > >> > panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 >> >> This problem is very reproduceable for me, so I added some assertions to >> the code to try to track it down: >> >> panic: Bad link elm 0xfffff802a59cb690 prev->next != elm >> #13 0xffffffff80b0d55a in cache_negative_remove (ncp=0xfffff802a59cb690, >> neg_locked=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:747 >> #14 0xffffffff80b0d84a in cache_zap_locked (ncp=0xfffff802a59cb690, >> neg_locked=false) at /usr/src/sys/kern/vfs_cache.c:890 >> #15 0xffffffff80b0ab98 in cache_zap_rlocked_bucket (ncp=0xfffff802a59cb690, >> blp=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:1021 >> #16 cache_lookup (dvp=<optimized out>, vpp=<optimized out>, >> cnp=<optimized out>, tsp=<optimized out>, ticksp=<optimized out>) >> at /usr/src/sys/kern/vfs_cache.c:1308 >> #17 0xffffffff80b0e97c in vfs_cache_lookup (ap=<optimized out>) >> at /usr/src/sys/kern/vfs_cache.c:2024 >> #18 0xffffffff8103fcea in VOP_LOOKUP_APV (vop=<optimized out>, >> a=<optimized out>) at vnode_if.c:127 >> #19 0xffffffff80b17ed2 in VOP_LOOKUP (vpp=0xfffffe085bfa86e8, >> cnp=0xfffffe085bfa8710, dvp=<optimized out>) at ./vnode_if.h:54 >> #20 lookup (ndp=<optimized out>) at /usr/src/sys/kern/vfs_lookup.c:830 >> #21 0xffffffff80b1748e in namei (ndp=<optimized out>) >> at /usr/src/sys/kern/vfs_lookup.c:397 >> #22 0xffffffff80b30bc4 in kern_renameat (td=<optimized out>, oldfd=10, >> old=0x28e72a59 <error: Cannot access memory at address 0x28e72a59>, >> newfd=10, >> new=0x28e72601 <error: Cannot access memory at address 0x28e72601>, >> pathseg=<optimized out>) at /usr/src/sys/kern/vfs_syscalls.c:3515 >> #23 0xffffffff80ff98a0 in syscallenter (td=0xfffff801ebfd8500, >> sa=<optimized out>) >> at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135 >> #24 ia32_syscall (frame=0xfffffe085bfa8a40) >> at /usr/src/sys/amd64/ia32/ia32_syscall.c:187 >> #25 <signal handler called> >> >> >> #13 0xffffffff80b0d55a in cache_negative_remove (ncp=0xfffff802a59cb690, >> neg_locked=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:747 >> 747 TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); >> (kgdb) list >> 742 } else { >> 743 mtx_assert(&neglist->nl_lock, MA_OWNED); >> 744 mtx_assert(&ncneg_hot.nl_lock, MA_OWNED); >> 745 } >> 746 if (ncp->nc_flag & NCF_HOTNEGATIVE) { >> 747 TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); >> 748 } else { >> 749 TAILQ_REMOVE(&neglist->nl_list, ncp, nc_dst); >> 750 } >> 751 if (list_locked) >> >> (kgdb) print ncp >> $1 = (struct namecache *) 0xfffff802a59cb690 >> >> (kgdb) print *ncp >> $2 = {nc_hash = {le_next = 0x0, le_prev = 0xfffffe0001814178}, nc_src = { >> le_next = 0xfffff80415e22290, le_prev = 0xfffff8048e5597a8}, nc_dst = { >> tqe_next = 0xfffff80421744a10, tqe_prev = 0xfffff80421744a30}, >> nc_dvp = 0xfffff8048e559760, n_un = {nu_vp = 0x0, nu_neghits = 0}, >> nc_flag = 32 ' ', nc_nlen = 25 '\031', >> nc_name = 0xfffff802a59cb6d2 "gtk3-icon-browser.desktop"} >> (kgdb) print ncp->nc_dst->tqe_prev >> $8 = (struct namecache **) 0xfffff80421744a30 >> (kgdb) print *(ncp->nc_dst->tqe_prev) >> $9 = (struct namecache *) 0x0 >> >> >> The line number for the failed assertion doesn't make sense. The value >> of nc_flag is 32 (0x20), but NCF_HOTNEGATIVE is defined as 0x40 and >> NCF_NEGATIVE is defined as 0x20. In any case, the list seems to be >> messed up. >> > > Can you check if the problem persists after reverting r307650 and > r307685? > > I got another report of trouble with poudriere, but so far can neither > reproduce nor diagnose what happens here. > > For instance with this crash, in principle there could be a race where > the entry is demoted and loses the HOTNEGATIVE. But the demotion code > has the hot list locked and so does this code if it spots the flag. > What's more, the race is handled by re-checking the flag after the lock > was taken. > > That said, there is something seriously fishing going on here. I can try later. After several crashes in fairly short order, the machine currently has 11+ hours of uptime while running poudriere. I made no kernel changes this time. The earlier panics seemed to indicate that an entry was on the wrong list. cache_negative_zap_one() calls cache_negative_shrink_select(), which updates ncp and neglist, with the latter locked if ncp is not NULL. Later on, it drops the locks and relocks various things including neglist before calling cache_zap_locked(), which in turn calls cache_negative_remove(). The latter calls NCP2NEGLIST(ncp) to recompute neglist, which it finds is not locked. One of asserts that I added to cache_negative_zap_one() is a sanity check for the value of neglist returned by cache_negative_shrink_select(). The neglist value the latter returns is the index of the neglist array where it found the entry, not the result of calling NCP2NEGLIST() on the entry. That assert fired, indicating that there was an inconsistency, but so far I have not been able to track it further back. I also didn't see any way for things to get out of whack.Received on Sat Nov 12 2016 - 17:25:47 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC