On 2021-Mar-9, at 21:11, Mark Millard <marklmi at yahoo.com> wrote: > On 2021-Mar-9, at 19:17, Mark Millard <marklmi at yahoo.com> wrote: > > [My only testing context for this has been main, not 13.0. > But it might be a 13.0 worry.] > >> Using the quickest to so-far-reliably-fail type of example from >> another thread I used truss to see what happens, here filtered >> down to two processes that appear to be involved and only >> near the failure. (The overall truss output is huge from the >> prior activity in the poudriere bulk relatated activity). Also, >> this initiated watching from aarch64 but the failing code is >> armv7. >> >> 83630 100199: #340(0x1,0xffffd18c,0xffffd17c) = 0 (0x0) >> 83630 100199: #416(0x14,0xffffd1b4,0xffffd19c) = 0 (0x0) >> 83630 100199: #7(0xffffffff,0xffffd178,0x1,0x0) = 0 (0x0) >> 83731 100161: #240(0xffffd5f0,0xffffd5f0) = 0 (0x0) >> 83731 100161: #1(0x0) >> 83731 100161: process exit, rval = 0 >> 83630 100199: SIGNAL 20 (SIGCHLD) code=CLD_EXITED pid=83731 uid=0 status=0 >> 83630 100199: #341(0xffffd17c) ERR#4 'Interrupted system call' >> 83630 100199: SIGNAL 11 (SIGSEGV) code=SEGV_MAPERR trapno=36 addr=0xffffffe1 >> 83630 100199: process killed, signal = 11 (core dumped) >> >> As a reminder of the lldb backtrace of the sh.core >> and the like: >> >> (lldb) bt >> * thread #1, name = 'sh', stop reason = signal SIGSEGV >> * frame #0: 0xffffe190 >> frame #1: 0x00031aa8 sh`waitcmdloop(job=0x00064230) at jobs.c:608:11 >> frame #2: 0x00031a24 sh`waitcmd(argc=<unavailable>, argv=<unavailable>) at jobs.c:554:13 >> frame #3: 0x00028f54 sh`evalcommand(cmd=0x400ad0e4, flags=<unavailable>, backcmd=0x00000000) at eval.c:1107:16 >> frame #4: 0x00027800 sh`evaltree(n=0x400ad0e4, flags=<unavailable>) at eval.c:289:4 >> frame #5: 0x000344d0 sh`cmdloop(top=1) at main.c:221:4 >> frame #6: 0x000342f4 sh`main(argc=<unavailable>, argv=<unavailable>) at main.c:168:3 >> frame #7: 0x0002480c sh`__start(argc=8, argv=<unavailable>, env=<unavailable>, ps_strings=<unavailable>, obj=0x400b4004, cleanup=0x40081aa0) at crt1_c.c:92:7 >> (lldb) up >> frame #1: 0x00031aa8 sh`waitcmdloop(job=0x00064230) at jobs.c:608:11 >> 605 break; >> 606 } >> 607 } >> -> 608 } while (dowait(DOWAIT_BLOCK | DOWAIT_SIG, (struct job *)NULL) != -1); >> 609 >> 610 sig = pendingsig_waitcmd; >> 611 pendingsig_waitcmd = 0; >> >> (lldb) disass >> sh`waitcmdloop: >> 0x31a54 <+0>: push {r4, r5, r6, r7, r8, r9, r10, r11, lr} >> 0x31a58 <+4>: add r11, sp, #28 >> 0x31a5c <+8>: sub sp, sp, #4 >> 0x31a60 <+12>: movw r6, #0x3ea0 >> 0x31a64 <+16>: movw r7, #0x3e9c >> 0x31a68 <+20>: movw r9, #0x4040 >> 0x31a6c <+24>: movw r8, #0x3ea4 >> 0x31a70 <+28>: mov r4, r0 >> 0x31a74 <+32>: movt r6, #0x6 >> 0x31a78 <+36>: movt r7, #0x6 >> 0x31a7c <+40>: movt r9, #0x6 >> 0x31a80 <+44>: mov r10, #0 >> 0x31a84 <+48>: movt r8, #0x6 >> 0x31a88 <+52>: cmp r4, #0 >> 0x31a8c <+56>: beq 0x31ab4 ; <+96> at jobs.c:590:37 >> 0x31a90 <+60>: ldrb r0, [r4, #0x18] >> 0x31a94 <+64>: cmp r0, #2 >> 0x31a98 <+68>: beq 0x31b84 ; <+304> [inlined] getjobstatus at jobs.c:575 >> 0x31a9c <+72>: mov r0, #3 >> 0x31aa0 <+76>: mov r1, #0 >> 0x31aa4 <+80>: bl 0x32bcc ; dowait at jobs.c:1142 >> -> 0x31aa8 <+84>: cmn r0, #1 >> >> >> For reference a local context around the >> SIGSEGV looks like (all lines in the range >> selected): >> >> . . . >> 83833 102738: fcntl(2,F_DUPFD_CLOEXEC,0xa) = 10 (0xa) >> 83833 102738: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3) >> 83833 102738: dup2(3,2) = 2 (0x2) >> 83833 102738: close(3) = 0 (0x0) >> 83833 102738: unlink("./.data.json.SYR1bCaL") = 0 (0x0) >> 83833 102738: dup2(10,2) = 2 (0x2) >> 83833 102738: close(10) = 0 (0x0) >> 83833 102738: exit(0x0) >> 83833 102738: process exit, rval = 0 >> 77872 100638: wait4(-1,{ EXITED,val=0 },0x0,0x0) = 83833 (0x14779) >> 77872 100638: fcntl(0,F_DUPFD_CLOEXEC,0xa) = 10 (0xa) >> 77872 100638: openat(AT_FDCWD,"/var/run/poudriere/lock-poudriere-shared-json_top.pid",O_RDONLY,00) = 3 (0x3) >> 77872 100638: dup2(3,0) = 0 (0x0) >> 77872 100638: close(3) = 0 (0x0) >> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa) = 11 (0xb) >> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3) >> 77872 100638: dup2(3,2) = 2 (0x2) >> 77872 100638: close(3) = 0 (0x0) >> 77872 100638: lseek(0,0x0,SEEK_CUR) = 0 (0x0) >> 77872 100638: read(0,"77563",1024) = 5 (0x5) >> 77872 100638: read(0,0xffffffffb9e8,1024) = 0 (0x0) >> 77872 100638: dup2(10,0) = 0 (0x0) >> 77872 100638: close(10) = 0 (0x0) >> 77872 100638: dup2(11,2) = 2 (0x2) >> 77872 100638: close(11) = 0 (0x0) >> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa) = 10 (0xa) >> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3) >> 77872 100638: dup2(3,2) = 2 (0x2) >> 77872 100638: close(3) = 0 (0x0) >> 77872 100638: rmdir("/var/run/poudriere/lock-poudriere-shared-json_top") = 0 (0x0) >> 77872 100638: dup2(10,2) = 2 (0x2) >> 77872 100638: close(10) = 0 (0x0) >> 77872 100638: sigprocmask(SIG_SETMASK,{ },0x0) = 0 (0x0) >> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa) = 10 (0xa) >> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3) >> 77872 100638: dup2(3,2) = 2 (0x2) >> 77872 100638: close(3) = 0 (0x0) >> 77872 100638: sigaction(SIGINFO,{ 0x239c30 SA_RESTART ss_t },{ SIG_DFL 0x0 ss_t }) = 0 (0x0) >> 83731 100161: #240(0xffffd5f0,0xffffd5f0) = 0 (0x0) >> -- UNKNOWN FreeBSD32 SYSCALL 1 -- >> 83731 100161: #1(0x0) >> 83731 100161: process exit, rval = 0 >> 83630 100199: SIGNAL 20 (SIGCHLD) code=CLD_EXITED pid=83731 uid=0 status=0 >> 83630 100199: #341(0xffffd17c) ERR#4 'Interrupted system call' >> 83630 100199: SIGNAL 11 (SIGSEGV) code=SEGV_MAPERR trapno=36 addr=0xffffffe1 >> 83630 100199: process killed, signal = 11 (core dumped) >> 83316 100123: #7(0xffffffff,0xffffca58,0x0,0x0) = 83630 (0x146ae) >> -- UNKNOWN FreeBSD32 SYSCALL 477 -- >> 83316 100123: #477(0x0,0x7000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077833728 (0x403e7000) >> -- UNKNOWN FreeBSD32 SYSCALL 552 -- >> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory' >> -- UNKNOWN FreeBSD32 SYSCALL 552 -- >> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory' >> -- UNKNOWN FreeBSD32 SYSCALL 552 -- >> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory' >> -- UNKNOWN FreeBSD32 SYSCALL 552 -- >> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory' >> -- UNKNOWN FreeBSD32 SYSCALL 477 -- >> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077862400 (0x403ee000) >> -- UNKNOWN FreeBSD32 SYSCALL 4 -- >> 83316 100123: #4(0x2,0x403ee000,0x21) = 33 (0x21) >> -- UNKNOWN FreeBSD32 SYSCALL 477 -- >> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077866496 (0x403ef000) >> -- UNKNOWN FreeBSD32 SYSCALL 477 -- >> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077870592 (0x403f0000) >> -- UNKNOWN FreeBSD32 SYSCALL 477 -- >> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077874688 (0x403f1000) >> -- UNKNOWN FreeBSD32 SYSCALL 4 -- >> 83316 100123: #4(0x1,0x403ef000,0x2e) = 46 (0x2e) >> -- UNKNOWN FreeBSD32 SYSCALL 542 -- >> 83316 100123: #542(0xffffcd54,0x0) = 0 (0x0) >> -- UNKNOWN FreeBSD32 SYSCALL 2 -- >> 83842 100199: <new process> >> 83316 100123: #2() = 83842 (0x14782) >> -- UNKNOWN FreeBSD32 SYSCALL 6 -- >> -- UNKNOWN FreeBSD32 SYSCALL 6 -- >> 83316 100123: #6(0x7) = 0 (0x0) >> 83842 100199: #6(0x5) = 0 (0x0) >> . . . > > Turns out that the failure happens on the > processors with out-of-order execution and > the like but works on the strictly in-order > cortex-a53. (For as much testing as I've > done.) > > So it looks like some form of synchronization > is missing that in-order-only does not need. > (This would be the 2nd time I've run into such > for FreeBSD aarch64 if it holds true. The > prior example was fixed a fair time ago.) > > > The testing status . . . > > > Problem replicated using the following contexts > to attempt the textproc/itstool build, targeting > armv7 (cortex-a7): > > cortex-a72 aarch64 MACHHIATObin Double Shot > cortex-a57 aarch64 OverDrive 1000 > > (No successful builds for the above 2, > all stopping in configure the same way.) > > > No problem using the following to build > textproc/itstool, targeting armv7: > > cortex-a53 aarch64 Rock64 (armv7 on aarch64 case) > cortext-a7 armv7 OrangePi+ 2ed (native armv7 case) > > > It will take a long time to run a full > poudriere bulk that will build about > 200 ports, targeting the cortex-a7 on > the slower cortex-a53: days. So > further evidence that the cortex-a53 > does not get the problem will take a > while. I have confirmed that I still get the problem on the cortex-a72 when I substitute the kernel.txz and kernel-dbg.txz content from: https://artifact.ci.freebsd.org/snapshot/main/bad9fa56620eb82395c5ab66d300e91a0222dde2/arm64/aarch64/ and boot with it. ( My non-debug build is also of bad9fa56 .) This avoids worries about my kernel build being involved. The debug kernel did not report anything special while the port was building. Earlier it did report the classic: [00:00:06] Mounting system devices for FBSDFSSDjailArmV7-default lock order reversal: 1st 0xffffa0017b9915b0 ufs (ufs, lockmgr) _at_ /usr/src/sys/kern/vfs_mount.c:1071 2nd 0xffffa0017bdb8070 devfs (devfs, lockmgr) _at_ /usr/src/sys/kern/vfs_mount.c:1083 lock order devfs -> ufs established at: . . . I have not tried substituting base.txz and base-dbg.txz content. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)Received on Wed Mar 10 2021 - 05:00:17 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC