amd64 crash in pmap_remove_pages(): page fault

From: Sean Chittenden <sean_at_gigave.com> Date: Mon, 16 Jan 2006 13:27:44 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:50 UTC

Howdy.  I've got a "diskless" PXE boot client that is crashing about
once a week or so with the following backtrace and info:

#1  0x0000000000000004 in ?? ()
#2  0xffffffff80257623 in boot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:399
#3  0xffffffff80257c26 in panic (fmt=0xffffff006808f720 "")
    at /usr/src/sys/kern/kern_shutdown.c:555
#4  0xffffffff8039da92 in trap_fatal (frame=0xffffff006808f720,
    eva=18446742975955402752) at /usr/src/sys/amd64/amd64/trap.c:660
#5  0xffffffff8039ddaf in trap_pfault (frame=0xffffffffc72b19e0, usermode=0)
    at /usr/src/sys/amd64/amd64/trap.c:573
#6  0xffffffff8039e063 in trap (frame=
      {tf_rdi = -36495808960, tf_rsi = -1097607851600, tf_rdx = 0, tf_rcx = -1097607851600, tf_r8 = 0, tf_r9 = -1097427386368, tf_rax = 0, tf_rbx = -2134343104, tf_rbp = -2138286592, tf_r10 = 6447135, tf_r11 = 429809, tf_r12 = -1097607851680, tf_r13 = -1097375809288, tf_r14 = 140737488355328, tf_r15 = 0, tf_trapno = 12, tf_addr = -36495808960, tf_flags = -2135524400, tf_err = 2, tf_rip = -2143710185, tf_cs = 8, tf_rflags = 66118, tf_rsp = -953476448, tf_ss = 16})
    at /u    at /u    at /u    at /u.c:352
#7  0xffffffff8038d30b in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:168
#8  0xffffffff80399417 in pmap_remove_pages (pmap=0xffffff0071795160, sva=0, 
    eva=140737488355328) at /usr/src/sys/amd64/amd64/pmap.c:2590
#9  0xffffffff8023b947 in exit1 (td=0xffffff006808f720, rv=256) at vm_map.h:252
---Type <return> to continue, or q <return> to quit--- 
#10 0xffffffff8023bc5e in sys_exit (td=0xfffffff780ae2640, 
    uap=0xffffff00717951b0) at /usr/src/sys/kern/kern_exit.c:97
#11 0xffffffff8039e8a1 in syscall (frame=
      {tf_rdi = 1, tf_rsi = 34365342200, tf_rdx = 0, tf_rcx = 4, tf_r8 = 0, tf_r9 = 59, tf_rax = 1, tf_rbx = 1, tf_rbp = 0, tf_r10 = 140737488349248, tf_r11 = 2, tf_r12 = 0, tf_r13 = 5520656, tf_r14 = 14400, tf_r15 = 1, tf_trapno = 12, tf_addr = 34368227008, tf_flags = 0, tf_err = 2, tf_rip = 34368000696, tf_cs = 43, tf_rflags = 518, tf_rsp = 140737488349560, tf_ss = 35})
    at /usr/src/sys/amd64/amd64/trap.c:792
#12 0xffffffff8038d4a8 in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:270
#13 0x00000008007e12b8 in ?? ()

Dump header from device /dev/ad4s1b
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 2147024896B (2047 MB)
  Blocksize: 512
  Dumptime: Sun Jan 15 20:54:41 2006
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 6.0-STABLE #1: Wed Jan  4 22:40:57 PST 2006
    sean_at_host.example.org:/usr/obj/usr/src/sys/WEBHEAD
  Panic String: page fault
  Dump Parity: 1702696226
  Bounds: 42
  Dump Status: good

Any pearls of wisdom as to what's causing this?  My nfs options are:
rw,tcp,nfsv3,-r=32768,-w=32768 and my kernel config is included below.
I have the cores around if anyone's interested or I'm missing
something, but if I'm looking at this correctly, it seems like a race
condition which is causing a problem with one of the TAILQ macro's.

amd64/amd64/pmap.c:2590
    TAILQ_REMOVE(&m->md.pv_list, pv, pv_list);

I've been digging around on -stable, -current, and following your
recent work on HEAD, but haven't seen anything that touches this area
of code.  Being a VM rookie, seems as though this is a bug that's
being tripped in pmap_remove_pages(), but isn't caused by a bug there.
TAILQ_*()'s usage in this function seems correct.  With NFS diskless
root, zero copy sockets, and sendfile(2) in use on these machines,
there are a number of places for potential problems and I'm at a loss
as to a fix.  Any ideas?  -sc

-- 
Sean Chittenden