Re: head -r331499 amd64/threadripper panic in vm_page_free_prep during "poudriere bulk -a", after 14h 22m or so.

From: Mark Millard <marklmi26-fbsd_at_yahoo.com> Date: Thu, 5 Apr 2018 19:05:55 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC

On 2018-Mar-26, at 6:35 AM, Mark Millard <marklmi26-fbsd at yahoo.com> wrote:

> [Unfortunately, I'd not be able to get back to this
> for many hours. I do not want to leave the machine
> at the db> prompt that long. So this is all there
> will be.]
> 
> It  got a different crash last night, after a little over 12
> hours of poudriere bulk -a activity, again while I was
> sleeping. Hand typed:
> 
> kernel trap 12 with interrupts disabled
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 13; apic id = 0d
> fault virtual address = 0x20
> fault code            = supervisor read data, page not present
> instruction pointer   = 0x20:0xffffffff80b70867
> stack pointer         = 0x28:0xfffffe00ebab8880
> frame pointer         = 0x28:0xfffffe00ebab8890
> code segment          = base 0x0, limit 0xfffff, type 0x1b
>                      = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags      = resume, IOPL = 0
> current process       = 44 (dom0)
> [ thread pid 44 tid 100277 ]
> Stopped at turnstile_broadcast+0x47: movq 0x20(%rbx,%rax,1),%rcx
> 
> (So an offset from a null pointer, apparently.)
> 
> bt shows:
> 
> Tracing pid 44 tid 100277 td 0xfffff8010f938560
> turnstile_broadcast() at turnstile_broadcast+0x47/frame 0xfffffe00ebab8890
> __mtx_unlock_sleep() at __mtx_unlock_sleep+0xb9/frame 0xfffffe00ebab88c0
> vm_pageout_page_lock() at vm_pageout_page_lock+0x179/frame 0xfffffe00ebab8960
> vm_pageout_worker() at vm_pageout_worker+0xd3a/frame 0xfffffe00ebab8a50
> vm_pageout() at vm_pageout+0x133/frame 0xfffffe00ebab8a70
> fork_exit() at fork_exit+0x83/frame 0xfffffe00ebab8ab0
> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00ebab8ab0
> --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
> 
> Dump again failed, the same way but with some byte
> value differences.
> 
> (da1:strovsc1:0:0:0) WRITE(10). CDB 2a 00 35 39 8c c7 00 00 08 00
> (da1:storvsc1:0:0:0) CAM status Command timeout
> (da1:storvsc1:0:0:0) Error 5, Retries exhausted
> Aborting dump to to I/O error.
> 
> ** DUMP FAILED (ERROR 5) **
> Cannot dump: unknown error (error=5)
> 
> So this appears to be repeatable (for the Optane
> swap/page partition?).
> 
> show reg:
> 
> cs 0x20
> ds 0x3b ll+0x1a
> es 0x3b ll+0x1a
> fs 0x13
> gs 0x1b
> ss 0x28 ll+0x7
> rax 0
> rcx 0xfffff8010f938501
> rdx 0xfffff8010f938501
> rbx 0xfffffe00ebab8880
> rsp 0xfffffe00ebab8800
> rsi 0
> rdi 0
> r8  0
> r9  0
> r10 0
> r11 0
> r12 0
> r13 0xfffff8010f938560
> r14 0
> r15 0xffffffff81d67998 vm_dom+0x18
> rip 0xffffffff80b70867 turnstile_broadcast+0x47
> rflags 0x10056
> turnstile_broadcast+0x47: movq 0x20(%rbx,%rax,1),%rcx
> 
> Around where rbx points:
> 
> 0xfffffe00ebab8872: ab eb 0  fe ff ff 28 0  0  0  0  0  0  0
> 0xfffffe00ebab8880: 0  0  0  0  0  0  0  0  80 79 d6 81 ff ff
> 0xfffffe00ebab888e: ff ff c0 88 ab eb 0  fe ff ff 9  20 af 80
> 0xfffffe00ebab889c: ff ff ff ff 0  7b 2  d8 f  f8 ff ff 98 79
> 
> And it looks like we have that null pointer above.
> 
> And I'm afraid that is it: I need to be off doing other things.

3 rounds of bulk -a spanning over 126 hours total
and I've not had any more failures. Between rounds
I updated /usr/src/ and did buildworld/buildkernel/install
sequences so I'd not be far behind head.

I'm giving up on directly trying to replicate either of
the two types of failures that I'd reported.

At least I know to "show panic" now.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)