On Tue, Sep 29, 2020 at 02:59:43PM +0300, Michael Zhilin wrote: > Hi, > > I'm using FreeBSD 13-CURRENT (pre-ZoF, r359724) on my laptop with installed > Gnome. Sometimes > (once a week/month) gnome hangs and the system may be still responsible > (may be not). > This week it happened again and I've gathered information via ddb/textdump > and rebooted laptop. > > gnome-shell is trying to get exclusive lock on some directory according to > information > from "show alllocks" and "bt": > > Process 2355 (gnome-shell) thread 0xfffffe00b03d4700 (100483) > exclusive lockmgr zfs (zfs) r = 0 (0xfffff802112fe808) locked _at_ > /usr/src/sys/kern/vfs_subr.c:2930 > > Tracing command gnome-shell pid 2355 tid 100483 td 0xfffffe00b03d4700 > sched_switch() at sched_switch+0x5b2/frame 0xfffffe00aeff7220 > mi_switch() at mi_switch+0x155/frame 0xfffffe00aeff7240 > sleepq_switch() at sleepq_switch+0x11a/frame 0xfffffe00aeff7280 > sleeplk() at sleeplk+0x106/frame 0xfffffe00aeff72e0 > lockmgr_xlock_hard() at lockmgr_xlock_hard+0x21f/frame > 0xfffffe00aeff7380 > _vn_lock() at _vn_lock+0x54/frame 0xfffffe00aeff73e0 > zfs_lookup() at zfs_lookup+0x5f7/frame 0xfffffe00aeff74c0 > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x8e/frame > 0xfffffe00aeff7600 > vfs_cache_lookup() at vfs_cache_lookup+0xa8/frame 0xfffffe00aeff7650 > lookup() at lookup+0x5e1/frame 0xfffffe00aeff76f0 > namei() at namei+0x524/frame 0xfffffe00aeff77e0 > vn_open_cred() at vn_open_cred+0x10b/frame 0xfffffe00aeff7930 > kern_openat() at kern_openat+0x1fa/frame 0xfffffe00aeff7aa0 > filemon_wrapper_openat() at filemon_wrapper_openat+0x15/frame > 0xfffffe00aeff7ad0 > amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe00aeff7bf0 > fast_syscall_common() at fast_syscall_common+0x101/frame > 0xfffffe00aeff7bf0 > --- syscall (499, FreeBSD ELF64, filemon_wrapper_openat), rip = > 0x80148dbca, rsp = 0x7fffffffe248, rbp = 0x7fffffffe2c0 --- > > Line vfs_subr.c:2930 mentioned by lockmgr is part of vget_finish. I've > found blocked stack > with shared lock request: > > Tracing command gsd-color pid 2422 tid 100784 td 0xfffffe00b08e0700 > sched_switch() at sched_switch+0x5b2/frame 0xfffffe00b07084f0 > mi_switch() at mi_switch+0x155/frame 0xfffffe00b0708510 > sleepq_switch() at sleepq_switch+0x11a/frame 0xfffffe00b0708550 > sleeplk() at sleeplk+0x106/frame 0xfffffe00b07085b0 > lockmgr_slock_hard() at lockmgr_slock_hard+0x1ce/frame > 0xfffffe00b0708640 > _vn_lock() at _vn_lock+0x54/frame 0xfffffe00b07086a0 > vget_finish() at vget_finish+0x42/frame 0xfffffe00b07086d0 > cache_lookup() at cache_lookup+0x57c/frame 0xfffffe00b0708790 > vfs_cache_lookup() at vfs_cache_lookup+0x7d/frame 0xfffffe00b07087e0 > lookup() at lookup+0x5e1/frame 0xfffffe00b0708880 > namei() at namei+0x524/frame 0xfffffe00b0708970 > kern_accessat() at kern_accessat+0x106/frame 0xfffffe00b0708ad0 > amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe00b0708bf0 > fast_syscall_common() at fast_syscall_common+0x101/frame > 0xfffffe00b0708bf0 > --- syscall (33, FreeBSD ELF64, sys_access), rip = 0x800fd0eba, rsp = > 0x7fffffffe268, rbp = 0x7fffffffe380 --- > > All other locked threads look more interesting. Thread 100747 is trying to > write > into file and waiting for page, but thread 101436 is trying to handle page > fault > and waiting for a shared lock of locked_range due to present writer. I > suppose it's > deadlock caused all troubles. > > Tracing command dconf-service pid 2384 tid 100747 td 0xfffffe00b08cc000 > sched_switch() at sched_switch+0x5b2/frame 0xfffffe00b0d762e0 > mi_switch() at mi_switch+0x155/frame 0xfffffe00b0d76300 > sleepq_switch() at sleepq_switch+0x11a/frame 0xfffffe00b0d76340 > _vm_page_busy_sleep() at _vm_page_busy_sleep+0x110/frame > 0xfffffe00b0d76390 > vm_page_acquire_unlocked() at vm_page_acquire_unlocked+0x177/frame > 0xfffffe00b0d763f0 > vm_page_grab_valid_unlocked() at vm_page_grab_valid_unlocked+0x51/frame > 0xfffffe00b0d76430 > zfs_freebsd_write() at zfs_freebsd_write+0x9b6/frame 0xfffffe00b0d76640 > VOP_WRITE_APV() at VOP_WRITE_APV+0xa7/frame 0xfffffe00b0d76750 > vn_write() at vn_write+0x2a4/frame 0xfffffe00b0d767d0 > vn_io_fault_doio() at vn_io_fault_doio+0x43/frame 0xfffffe00b0d76830 > vn_io_fault1() at vn_io_fault1+0x16c/frame 0xfffffe00b0d76980 > vn_io_fault() at vn_io_fault+0x182/frame 0xfffffe00b0d769f0 > dofilewrite() at dofilewrite+0x81/frame 0xfffffe00b0d76a40 > kern_pwritev() at kern_pwritev+0x62/frame 0xfffffe00b0d76a80 > sys_pwrite() at sys_pwrite+0x8a/frame 0xfffffe00b0d76ad0 > amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe00b0d76bf0 > fast_syscall_common() at fast_syscall_common+0x101/frame > 0xfffffe00b0d76bf0 > --- syscall (476, FreeBSD ELF64, sys_pwrite), rip = 0x8007428ca, rsp = > 0x7fffffffdf08, rbp = 0x7fffffffdf20 --- > > Tracing command evolution pid 4536 tid 101436 td 0xfffffe00bf484c00 > sched_switch() at sched_switch+0x5b2/frame 0xfffffe00bfd446e0 > mi_switch() at mi_switch+0x155/frame 0xfffffe00bfd44700 > sleepq_switch() at sleepq_switch+0x11a/frame 0xfffffe00bfd44740 > _cv_wait() at _cv_wait+0x15a/frame 0xfffffe00bfd447a0 > rangelock_enter() at rangelock_enter+0x306/frame 0xfffffe00bfd447f0 This call to rangelock_enter() looks suspicious. This is a call to ZFS own rangelocks, not our rangelocks. Still, if write took rangelock on the same range, we get a deadlock due to LoR between rangelock and page busy. > zfs_freebsd_getpages() at zfs_freebsd_getpages+0x14f/frame > 0xfffffe00bfd448a0 > vnode_pager_getpages() at vnode_pager_getpages+0x37/frame > 0xfffffe00bfd448e0 > vm_pager_get_pages() at vm_pager_get_pages+0x4f/frame 0xfffffe00bfd44930 > vm_fault() at vm_fault+0x780/frame 0xfffffe00bfd44a40 > vm_fault_trap() at vm_fault_trap+0x6e/frame 0xfffffe00bfd44a80 > trap_pfault() at trap_pfault+0x1ee/frame 0xfffffe00bfd44ae0 > trap() at trap+0x44c/frame 0xfffffe00bfd44bf0 > calltrap() at calltrap+0x8/frame 0xfffffe00bfd44bf0 > --- trap 0xc, rip = 0x80a55de3f, rsp = 0x7fffffffcc60, rbp = > 0x7fffffffcc60 --- > > If it happens again, I will try to gather more information about locks, > corresponding > files and directories. For a while I wonder how to reproduce this case. For > instance, > one program has mmapped huge file and reads it infinitely, another program > write random > data in the middle of the file. But it's unclear how to trigger page fault > manually. > As I understand, kernel provides "vm_page_set_invalid" to mark it invalid. > Is there > any way to invalidate a page of backed object from userland? > > I will appreciate any advice on how to troubleshoot this issue. > > The full output of ddb is here: https://people.freebsd.org/~mizhka/ddb.txt > Thanks! > _______________________________________________ > freebsd-current_at_freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"Received on Tue Sep 29 2020 - 11:20:43 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:25 UTC