Re: Since last week (today) current on my Ryzen box is unstable

From: Gleb Smirnoff <glebius_at_FreeBSD.org>
Date: Sat, 17 Feb 2018 18:35:45 -0800
  Andriy,

On Sun, Feb 18, 2018 at 12:54:21AM +0200, Andriy Gapon wrote:
A> > Today's rebuild has given me uptimes of below an hour, usually.  The box will stay up in single user mode long enough to rebuild world/kernel, but multi-user it is panicking at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1592
A> > 
A> > The backtrace shows that it gets to this panic from a sendfile() syscall.  The line above is in the middle of a big edit that's part of svn revision 329363.  The tripping assertion seems to suggest that m->valid != 0, for whatever that's worth.
A> 
A> I am doing a bit of an offline investigation with Andrew and it seems that the
A> actual panic message is this:
A> 
A> panic: vm_page_assert_xbusied: page 0xfffff807ebbd8f98 not exclusive busy _at_
A> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1592
A> 
A> The stack is this:
A> vpanic() at vpanic/frame 0xfffffe00b3c36390
A> dmu_read_pages() at dmu_read_pages+0x535/frame 0xfffffe00b3c36460
A> zfs_freebsd_getpages() at zfs_freebsd_getpages+0x24c/frame 0xfffffe00b3c36510
A> VOP_GETPAGES_APV() at VOP_GETPAGES_APV+0xd9/frame 0xfffffe00b3c36540
A> vop_stdgetpages_async() at vop_stdgetpages_async+0x49/frame 0xfffffe00b3c36590
A> VOP_GETPAGES_ASYNC_APV() at VOP_GETPAGES_ASYNC_APV+0xd9/frame 0xfffffe00b3c365c0
A> vnode_pager_getpages_async() at vnode_pager_getpages_async+0x81/frame
A> 0xfffffe00b3c36650
A> vn_sendfile() at vn_sendfile+0xe70/frame 0xfffffe00b3c368e0
A> sendfile() at sendfile+0x149/frame 0xfffffe00b3c36980
A> amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe00b3c36ab0
A> fast_syscall_common() at fast_syscall_common+0x101/frame 0x7fffffffdb00
A> 
A> I looked at sendfile_swapin() code and it seems that it uses the pager API in an
A> undocumented way.  Specifically, it inserts bogus_page into the array of
A> requested pages.  For starters, bogus_page is not busied and VOP_GETPAGES is
A> documented to have all requested pages exclusively busied.  Second, I always had
A> an impression that bogus_page is an implementation detail of the unified buffer
A> / page cache and that other code need not be aware of it.
A> 
A> So, my opinion is that the sendfile code uses a "clever hack" that happens to
A> work with the buffer cache based filesystems, but that that hack is a bug.
A> So, I'd prefer that the problem is fixed in that code.
A> But I am open to being convinced that all VOP_GETPAGES implementations,
A> including that in ZFS, must be made aware of bogus_page.  Or, at least, that
A> they should not verify that the requested pages are busied.

This is optimization that improves throughput when file memory cache is
fragmented. Why don't you like adding the code to zfs_freebsd_getpages()?

-- 
Gleb Smirnoff
Received on Sun Feb 18 2018 - 01:54:00 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC