Fatal trap 18 on boot after OpenZFS import

From: Tomoaki AOKI <junchoon_at_dec.sakura.ne.jp>
Date: Fri, 4 Sep 2020 22:03:01 +0900
Hi.

Encountering boot failure with fatal trap 18 on boot,
happening at (maybe) just before init() starts. Possibly on
root remount by kernel or zpool import by rc.d script.
The last revision tried is r365316 (r364788 is the last tried
clean rebuild).

The last health revision is r364744, just before actual switch
to OpenZFS. amd64 on ThinkPad P52 (Core i7-8750H) w/descrete nvidia GPU.

r364751 with diff of r364777 and r364788 (to successfully built
Without unrelated-to-OpenZFS changes) fails.

Any suggestions and fixes are appreciated.


Trap screen is something like below (text attached),
typed up from relatively clear photo, so could be some typo.

This is shown just after usual kernel startup outputs.
boot1.efi (as EFI/bootx64.efi on ESP) starts /boot/loader.efi
properly, and loader.efi seems to boot kernel properly.

As even single user shell selection doesn't appear, loader.efi
is of r364744. But they works even if I proceeded irregular
process,

  1)Update src tree
  2)Clean obj tree
  3)buildworld
  4)etcupdate -p
  5)buildkernel
  6)installkernel
  7)shutdown to single user WITHOUT reboot  <- Irregular!
  8)installworld
  9)etcupdate
 10)rebuild src/sys-dependent ports (kmods, nvidia-driver, ...)
 11)reboot

loader.efi looks doing its job and panics after kernel startup ends.
Needless to say, rolling back to r364744 state from stable/12 on nvd0
Fixes the issue.

Regards.

=====

Fatal trap 18: integer divide fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer     = 0x20:0xffffffff82bfa320
stack pointer           = 0x28:0xfffffe00e20c6900
frame pointer           = 0x28:0xfffffe00e20c6960
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 27 (vdev_open)
trap number             = 18
panic: integer divide fault
cpuid = 2
time = 16
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe00e20c6610 vpanic() at vpanic+0x182/frame fffffe00e20c6660
panic() at panic+0x43/frame fffffe00e20c66c0
trap_fatal() at trap_fatal+0x387/frame fffffe00e20c6720
trap() at trap+0x8e/frame fffffe00e20c6830
calltrap() at calltrap+0x8/frame fffffe00e20c6830
--- trap 0x12, rip = 0xffffffff82bfa320, rsp = 0xfffffe00e20c6900, rbp
= 0xfffffe00e20c6960 --- zio_wait() at zio_wait+0x60/frame
0xfffffe00e20c6960 vdev_open() at vdev_open+0x74d/frame
0xfffffe00e20c69c0 vdev_open_child() at vdev_open_child+0x1e/frame
0xfffffe00e20c69e0 taskq_run() at taskq_run+0x1f/frame
0xfffffe00e20c6a00 taskqueue_run_locked() at
taskqueue_run_locked+0x181/frame 0xfffffe00e20c6a80
taskqueue_thread_loop() at taskqueue_thread_loop+0x118/frame
0xfffffe00e20c6ab0 fork_exit() at fork_exit+0x7d/frame
0xfffffe00e20c6af0 fork_trampoline() at fork_trampoline+0xe/frame
0xfffffe00e20c6af0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 27 tid 100570 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1091556(%rip)
db> 

=====

Additional info:
 *Clean build with killing CPUTYPE from command line and
  make.conf (so should be equivalent with nocona) didn't help.

 *Clean build with commenting out WITH_KERNEL_RETPOLINE line
  and WITH_RETPOLINE line in src.conf didn't help.

 *Combination of the above two didn't help, too (at r364788).

 *There are two root pools in different physical drive.
  stable/12 on nvd0 (primary) and head on ada0 (secondary).

 *GENERIC-NODEBUG based (added options CAM_IOSCHED_DYNAMIC)
  kernel.

-- 
Tomoaki AOKI    <junchoon_at_dec.sakura.ne.jp>

Received on Fri Sep 04 2020 - 11:03:13 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:25 UTC