vm_page_remove() crash on sys_exit() (possibly ZFS related)

From: Peter Schuller <peter.schuller_at_infidyne.com>
Date: Wed, 22 Jul 2009 19:17:41 +0200
Hello,

so I finally got my crash dump. I'll include some more history further
down. First off:

   http://distfiles.scode.org/mlref/crashdump_20090722/core.txt.0
   http://distfiles.scode.org/mlref/crashdump_20090722/backtrace.txt

Inline version of backtrace appears below[1] (after background).

So this is a general protection fault in vm_page_remove called
indirectly from sys_exit(). Worth nothing is that at least once (the
previous crash, without a dump) I got a "logic" panic rather than a
memory error; I'm pretty sure the panic message was related to page
*inserts*. Grepping the source indicates:

  vm_page.c:              panic("vm_page_insert: page already inserted");
  vm_page.c:                      panic("vm_page_insert: offset already allocated");

However I could not say for sure whether one of these was indeed the
exact panic I got and I neither have a crash nor was able to see a
track trace at the time.

Some further background and speculation:

This system is root-on-ZFS where I have been tracking CURRENT for
several months. I updated every month or so in part to test
improvements to ZFS; specifically the fixes that have gone in for
deadlock/hang issues.

My "test case" is to run bulk building of all my ports (the port list
is a semi-typical desktop; about 700 or so packages in total). It
would very often hang (before) or crash (now) at least once during
such a build; the building of firefox was in particular extremely
over-represented, at least now that I see the crash symptome.

Going back to my tracking of current, at some point, I think roughly a
couple of months ago by now, I stopped experiencing deadlocks/hangs
(or at least have not seen it yet), but instead began seeing
panic:s. No longer seeing hangs was expected because the reason I
updated that particular time, if I recall correctly, was specifically
that I believed that all the work-in-progress ZFS fixes had gone
in. However I am not 100% sure of the timing.

Since then I've updated a couple of times more, most recently to
BETA1, but am still seeing this crash.

Wannabe speculation based on insufficient understanding of the VM
system:

vm_page_remove() requires, according to comments, that the object and
page must be locked. The actual crash in this case happens when
checking m->oflags:

        if (m->oflags & VPO_BUSY) {
                m->oflags &= ~VPO_BUSY;
                vm_page_flash(m);
        }

The "m->oflags & VPO_BUSY" evaluation is the culprit, if line numbers
can be trusted.

If I recall correctly, at least one of the deadlock/hang fixes for ZFS
did involve a change to locking, so I'm thinking the introduction of
the crashing may in fact be related to the ZFS fix itself. However now
that I think about it perhaps the only locking changes were vnode ones
rather than vm objects/pages? Also interestingly reading m->object
right before suceeds, and the lock assert on the object does too.

Is it possible the vm page was NOT locked even though m->object was
locked?

[1] Inline backtrace:

#0  doadump () at pcpu.h:223
#1  0xffffffff801d248c in db_fncall (dummy1=Variable "dummy1" is not available.
) at /usr/src/sys/ddb/db_command.c:548
#2  0xffffffff801d27c1 in db_command (last_cmdp=0xffffffff80b667a0, cmd_table=Variable "cmd_table" is not available.
) at /usr/src/sys/ddb/db_command.c:445
#3  0xffffffff801d2a10 in db_command_loop () at /usr/src/sys/ddb/db_command.c:498
#4  0xffffffff801d49a9 in db_trap (type=Variable "type" is not available.
) at /usr/src/sys/ddb/db_main.c:229
#5  0xffffffff805b5f25 in kdb_trap (type=9, code=0, tf=0xffffff805b9608d0) at /usr/src/sys/kern/subr_kdb.c:534
#6  0xffffffff80812efd in trap_fatal (frame=0xffffff805b9608d0, eva=Variable "eva" is not available.
) at /usr/src/sys/amd64/amd64/trap.c:847
#7  0xffffffff80813a1d in trap (frame=0xffffff805b9608d0) at /usr/src/sys/amd64/amd64/trap.c:639
#8  0xffffffff807f9793 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:223
#9  0xffffffff807d941f in vm_page_remove (m=0xffffff00bebe7f90) at /usr/src/sys/vm/vm_page.c:730
#10 0xffffffff807d957d in vm_page_free_toq (m=0xffffff00bebe7f90) at /usr/src/sys/vm/vm_page.c:1394
#11 0xffffffff807d7c6b in vm_object_terminate (object=0xffffff0066392948) at /usr/src/sys/vm/vm_object.c:694
#12 0xffffffff807d821c in vm_object_deallocate (object=0xffffff0066392948) at /usr/src/sys/vm/vm_object.c:592
#13 0xffffffff807cfad0 in _vm_map_unlock (map=0xffffff0004811310, file=Variable "file" is not available.
) at /usr/src/sys/vm/vm_map.c:480
#14 0xffffffff807cff8f in vm_map_remove (map=0xffffff0004811310, start=Variable "start" is not available.
) at /usr/src/sys/vm/vm_map.c:2765
#15 0xffffffff807d2e44 in vmspace_exit (td=0xffffff004eb78ab0) at /usr/src/sys/vm/vm_map.c:329
#16 0xffffffff8055a33e in exit1 (td=0xffffff004eb78ab0, rv=0) at /usr/src/sys/kern/kern_exit.c:299
#17 0xffffffff8055b43e in sys_exit (td=Variable "td" is not available.
) at /usr/src/sys/kern/kern_exit.c:110
#18 0xffffffff80813546 in syscall (frame=0xffffff805b960c90) at /usr/src/sys/amd64/amd64/trap.c:984
#19 0xffffffff807f9a20 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:364
#20 0x000000000047f63c in ?? ()
Previous frame inner to this frame (corrupt stack?)



-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller_at_infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey_at_scode.org
E-Mail: peter.schuller_at_infidyne.com Web: http://www.scode.org


Received on Wed Jul 22 2009 - 15:17:43 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:52 UTC