On 2018-Mar-26, at 6:35 AM, Mark Millard <marklmi26-fbsd at yahoo.com> wrote: > [Unfortunately, I'd not be able to get back to this > for many hours. I do not want to leave the machine > at the db> prompt that long. So this is all there > will be.] > > It got a different crash last night, after a little over 12 > hours of poudriere bulk -a activity, again while I was > sleeping. Hand typed: > > kernel trap 12 with interrupts disabled > > Fatal trap 12: page fault while in kernel mode > cpuid = 13; apic id = 0d > fault virtual address = 0x20 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80b70867 > stack pointer = 0x28:0xfffffe00ebab8880 > frame pointer = 0x28:0xfffffe00ebab8890 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = resume, IOPL = 0 > current process = 44 (dom0) > [ thread pid 44 tid 100277 ] > Stopped at turnstile_broadcast+0x47: movq 0x20(%rbx,%rax,1),%rcx > > (So an offset from a null pointer, apparently.) > > bt shows: > > Tracing pid 44 tid 100277 td 0xfffff8010f938560 > turnstile_broadcast() at turnstile_broadcast+0x47/frame 0xfffffe00ebab8890 > __mtx_unlock_sleep() at __mtx_unlock_sleep+0xb9/frame 0xfffffe00ebab88c0 > vm_pageout_page_lock() at vm_pageout_page_lock+0x179/frame 0xfffffe00ebab8960 > vm_pageout_worker() at vm_pageout_worker+0xd3a/frame 0xfffffe00ebab8a50 > vm_pageout() at vm_pageout+0x133/frame 0xfffffe00ebab8a70 > fork_exit() at fork_exit+0x83/frame 0xfffffe00ebab8ab0 > fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00ebab8ab0 > --- trap 0, rip = 0, rsp = 0, rbp = 0 --- > > Dump again failed, the same way but with some byte > value differences. > > (da1:strovsc1:0:0:0) WRITE(10). CDB 2a 00 35 39 8c c7 00 00 08 00 > (da1:storvsc1:0:0:0) CAM status Command timeout > (da1:storvsc1:0:0:0) Error 5, Retries exhausted > Aborting dump to to I/O error. > > ** DUMP FAILED (ERROR 5) ** > Cannot dump: unknown error (error=5) > > So this appears to be repeatable (for the Optane > swap/page partition?). > > show reg: > > cs 0x20 > ds 0x3b ll+0x1a > es 0x3b ll+0x1a > fs 0x13 > gs 0x1b > ss 0x28 ll+0x7 > rax 0 > rcx 0xfffff8010f938501 > rdx 0xfffff8010f938501 > rbx 0xfffffe00ebab8880 > rsp 0xfffffe00ebab8800 > rsi 0 > rdi 0 > r8 0 > r9 0 > r10 0 > r11 0 > r12 0 > r13 0xfffff8010f938560 > r14 0 > r15 0xffffffff81d67998 vm_dom+0x18 > rip 0xffffffff80b70867 turnstile_broadcast+0x47 > rflags 0x10056 > turnstile_broadcast+0x47: movq 0x20(%rbx,%rax,1),%rcx > > Around where rbx points: > > 0xfffffe00ebab8872: ab eb 0 fe ff ff 28 0 0 0 0 0 0 0 > 0xfffffe00ebab8880: 0 0 0 0 0 0 0 0 80 79 d6 81 ff ff > 0xfffffe00ebab888e: ff ff c0 88 ab eb 0 fe ff ff 9 20 af 80 > 0xfffffe00ebab889c: ff ff ff ff 0 7b 2 d8 f f8 ff ff 98 79 > > And it looks like we have that null pointer above. > > And I'm afraid that is it: I need to be off doing other things. 3 rounds of bulk -a spanning over 126 hours total and I've not had any more failures. Between rounds I updated /usr/src/ and did buildworld/buildkernel/install sequences so I'd not be far behind head. I'm giving up on directly trying to replicate either of the two types of failures that I'd reported. At least I know to "show panic" now. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)Received on Fri Apr 06 2018 - 00:26:16 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC