Re: FYI: main (bad9fa56620e based): some unexpected SIGSEGV's are tied to interrupted system calls (cortex-a57/a72 fail, cortex-a53/cortex-a7 work)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 9 Mar 2021 22:00:07 -0800
On 2021-Mar-9, at 21:11, Mark Millard <marklmi at yahoo.com> wrote:

> On 2021-Mar-9, at 19:17, Mark Millard <marklmi at yahoo.com> wrote:
> 
> [My only testing context for this has been main, not 13.0.
> But it might be a 13.0 worry.]
> 
>> Using the quickest to so-far-reliably-fail type of example from
>> another thread I used truss to see what happens, here filtered
>> down to two processes that appear to be involved and only
>> near the failure. (The overall truss output is huge from the
>> prior activity in the poudriere bulk relatated activity). Also,
>> this initiated watching from aarch64 but the failing code is
>> armv7.
>> 
>> 83630 100199: #340(0x1,0xffffd18c,0xffffd17c)    = 0 (0x0)
>> 83630 100199: #416(0x14,0xffffd1b4,0xffffd19c)   = 0 (0x0)
>> 83630 100199: #7(0xffffffff,0xffffd178,0x1,0x0)  = 0 (0x0)
>> 83731 100161: #240(0xffffd5f0,0xffffd5f0)        = 0 (0x0)
>> 83731 100161: #1(0x0)                           
>> 83731 100161: process exit, rval = 0
>> 83630 100199: SIGNAL 20 (SIGCHLD) code=CLD_EXITED pid=83731 uid=0 status=0
>> 83630 100199: #341(0xffffd17c)                   ERR#4 'Interrupted system call'
>> 83630 100199: SIGNAL 11 (SIGSEGV) code=SEGV_MAPERR trapno=36 addr=0xffffffe1
>> 83630 100199: process killed, signal = 11 (core dumped)
>> 
>> As a reminder of the lldb backtrace of the sh.core
>> and the like:
>> 
>> (lldb) bt
>> * thread #1, name = 'sh', stop reason = signal SIGSEGV
>> * frame #0: 0xffffe190
>>   frame #1: 0x00031aa8 sh`waitcmdloop(job=0x00064230) at jobs.c:608:11
>>   frame #2: 0x00031a24 sh`waitcmd(argc=<unavailable>, argv=<unavailable>) at jobs.c:554:13
>>   frame #3: 0x00028f54 sh`evalcommand(cmd=0x400ad0e4, flags=<unavailable>, backcmd=0x00000000) at eval.c:1107:16
>>   frame #4: 0x00027800 sh`evaltree(n=0x400ad0e4, flags=<unavailable>) at eval.c:289:4
>>   frame #5: 0x000344d0 sh`cmdloop(top=1) at main.c:221:4
>>   frame #6: 0x000342f4 sh`main(argc=<unavailable>, argv=<unavailable>) at main.c:168:3
>>   frame #7: 0x0002480c sh`__start(argc=8, argv=<unavailable>, env=<unavailable>, ps_strings=<unavailable>, obj=0x400b4004, cleanup=0x40081aa0) at crt1_c.c:92:7
>> (lldb) up
>> frame #1: 0x00031aa8 sh`waitcmdloop(job=0x00064230) at jobs.c:608:11
>>  605 						break;
>>  606 				}
>>  607 			}
>> -> 608 		} while (dowait(DOWAIT_BLOCK | DOWAIT_SIG, (struct job *)NULL) != -1);
>>  609 	
>>  610 		sig = pendingsig_waitcmd;
>>  611 		pendingsig_waitcmd = 0;
>> 
>> (lldb) disass
>> sh`waitcmdloop:
>>   0x31a54 <+0>:   push   {r4, r5, r6, r7, r8, r9, r10, r11, lr}
>>   0x31a58 <+4>:   add    r11, sp, #28
>>   0x31a5c <+8>:   sub    sp, sp, #4
>>   0x31a60 <+12>:  movw   r6, #0x3ea0
>>   0x31a64 <+16>:  movw   r7, #0x3e9c
>>   0x31a68 <+20>:  movw   r9, #0x4040
>>   0x31a6c <+24>:  movw   r8, #0x3ea4
>>   0x31a70 <+28>:  mov    r4, r0
>>   0x31a74 <+32>:  movt   r6, #0x6
>>   0x31a78 <+36>:  movt   r7, #0x6
>>   0x31a7c <+40>:  movt   r9, #0x6
>>   0x31a80 <+44>:  mov    r10, #0
>>   0x31a84 <+48>:  movt   r8, #0x6
>>   0x31a88 <+52>:  cmp    r4, #0
>>   0x31a8c <+56>:  beq    0x31ab4                   ; <+96> at jobs.c:590:37
>>   0x31a90 <+60>:  ldrb   r0, [r4, #0x18]
>>   0x31a94 <+64>:  cmp    r0, #2
>>   0x31a98 <+68>:  beq    0x31b84                   ; <+304> [inlined] getjobstatus at jobs.c:575
>>   0x31a9c <+72>:  mov    r0, #3
>>   0x31aa0 <+76>:  mov    r1, #0
>>   0x31aa4 <+80>:  bl     0x32bcc                   ; dowait at jobs.c:1142
>> ->  0x31aa8 <+84>:  cmn    r0, #1
>> 
>> 
>> For reference a local context around the
>> SIGSEGV looks like (all lines in the range
>> selected):
>> 
>> . . .
>> 83833 102738: fcntl(2,F_DUPFD_CLOEXEC,0xa)       = 10 (0xa)
>> 83833 102738: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
>> 83833 102738: dup2(3,2)                          = 2 (0x2)
>> 83833 102738: close(3)                           = 0 (0x0)
>> 83833 102738: unlink("./.data.json.SYR1bCaL")    = 0 (0x0)
>> 83833 102738: dup2(10,2)                         = 2 (0x2)
>> 83833 102738: close(10)                          = 0 (0x0)
>> 83833 102738: exit(0x0)                         
>> 83833 102738: process exit, rval = 0
>> 77872 100638: wait4(-1,{ EXITED,val=0 },0x0,0x0) = 83833 (0x14779)
>> 77872 100638: fcntl(0,F_DUPFD_CLOEXEC,0xa)       = 10 (0xa)
>> 77872 100638: openat(AT_FDCWD,"/var/run/poudriere/lock-poudriere-shared-json_top.pid",O_RDONLY,00) = 3 (0x3)
>> 77872 100638: dup2(3,0)                          = 0 (0x0)
>> 77872 100638: close(3)                           = 0 (0x0)
>> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       = 11 (0xb)
>> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
>> 77872 100638: dup2(3,2)                          = 2 (0x2)
>> 77872 100638: close(3)                           = 0 (0x0)
>> 77872 100638: lseek(0,0x0,SEEK_CUR)              = 0 (0x0)
>> 77872 100638: read(0,"77563",1024)               = 5 (0x5)
>> 77872 100638: read(0,0xffffffffb9e8,1024)        = 0 (0x0)
>> 77872 100638: dup2(10,0)                         = 0 (0x0)
>> 77872 100638: close(10)                          = 0 (0x0)
>> 77872 100638: dup2(11,2)                         = 2 (0x2)
>> 77872 100638: close(11)                          = 0 (0x0)
>> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       = 10 (0xa)
>> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
>> 77872 100638: dup2(3,2)                          = 2 (0x2)
>> 77872 100638: close(3)                           = 0 (0x0)
>> 77872 100638: rmdir("/var/run/poudriere/lock-poudriere-shared-json_top") = 0 (0x0)
>> 77872 100638: dup2(10,2)                         = 2 (0x2)
>> 77872 100638: close(10)                          = 0 (0x0)
>> 77872 100638: sigprocmask(SIG_SETMASK,{ },0x0)   = 0 (0x0)
>> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       = 10 (0xa)
>> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
>> 77872 100638: dup2(3,2)                          = 2 (0x2)
>> 77872 100638: close(3)                           = 0 (0x0)
>> 77872 100638: sigaction(SIGINFO,{ 0x239c30 SA_RESTART ss_t },{ SIG_DFL 0x0 ss_t }) = 0 (0x0)
>> 83731 100161: #240(0xffffd5f0,0xffffd5f0)        = 0 (0x0)
>> -- UNKNOWN FreeBSD32 SYSCALL 1 --
>> 83731 100161: #1(0x0)                           
>> 83731 100161: process exit, rval = 0
>> 83630 100199: SIGNAL 20 (SIGCHLD) code=CLD_EXITED pid=83731 uid=0 status=0
>> 83630 100199: #341(0xffffd17c)                   ERR#4 'Interrupted system call'
>> 83630 100199: SIGNAL 11 (SIGSEGV) code=SEGV_MAPERR trapno=36 addr=0xffffffe1
>> 83630 100199: process killed, signal = 11 (core dumped)
>> 83316 100123: #7(0xffffffff,0xffffca58,0x0,0x0)  = 83630 (0x146ae)
>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>> 83316 100123: #477(0x0,0x7000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077833728 (0x403e7000)
>> -- UNKNOWN FreeBSD32 SYSCALL 552 --
>> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory'
>> -- UNKNOWN FreeBSD32 SYSCALL 552 --
>> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory'
>> -- UNKNOWN FreeBSD32 SYSCALL 552 --
>> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory'
>> -- UNKNOWN FreeBSD32 SYSCALL 552 --
>> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory'
>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077862400 (0x403ee000)
>> -- UNKNOWN FreeBSD32 SYSCALL 4 --
>> 83316 100123: #4(0x2,0x403ee000,0x21)            = 33 (0x21)
>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077866496 (0x403ef000)
>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077870592 (0x403f0000)
>> -- UNKNOWN FreeBSD32 SYSCALL 477 --
>> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077874688 (0x403f1000)
>> -- UNKNOWN FreeBSD32 SYSCALL 4 --
>> 83316 100123: #4(0x1,0x403ef000,0x2e)            = 46 (0x2e)
>> -- UNKNOWN FreeBSD32 SYSCALL 542 --
>> 83316 100123: #542(0xffffcd54,0x0)               = 0 (0x0)
>> -- UNKNOWN FreeBSD32 SYSCALL 2 --
>> 83842 100199: <new process>
>> 83316 100123: #2()                               = 83842 (0x14782)
>> -- UNKNOWN FreeBSD32 SYSCALL 6 --
>> -- UNKNOWN FreeBSD32 SYSCALL 6 --
>> 83316 100123: #6(0x7)                            = 0 (0x0)
>> 83842 100199: #6(0x5)                            = 0 (0x0)
>> . . .
> 
> Turns out that the failure happens on the
> processors with out-of-order execution and
> the like but works on the strictly in-order
> cortex-a53. (For as much testing as I've
> done.)
> 
> So it looks like some form of synchronization
> is missing that in-order-only does not need.
> (This would be the 2nd time I've run into such
> for FreeBSD aarch64 if it holds true. The
> prior example was fixed a fair time ago.)
> 
> 
> The testing status . . .
> 
> 
> Problem replicated using the following contexts
> to attempt the textproc/itstool build, targeting
> armv7 (cortex-a7):
> 
> cortex-a72 aarch64 MACHHIATObin Double Shot
> cortex-a57 aarch64 OverDrive 1000
> 
> (No successful builds for the above 2,
> all stopping in configure the same way.)
> 
> 
> No problem using the following to build
> textproc/itstool, targeting armv7:
> 
> cortex-a53 aarch64 Rock64    (armv7 on aarch64 case)
> cortext-a7 armv7   OrangePi+ 2ed (native armv7 case)
> 
> 
> It will take a long time to run a full
> poudriere bulk that will build about
> 200 ports, targeting the cortex-a7 on
> the slower cortex-a53: days. So
> further evidence that the cortex-a53
> does not get the problem will take a
> while.

I have confirmed that I still get the problem on
the cortex-a72 when I substitute the kernel.txz
and kernel-dbg.txz content from:

https://artifact.ci.freebsd.org/snapshot/main/bad9fa56620eb82395c5ab66d300e91a0222dde2/arm64/aarch64/

and boot with it. ( My non-debug build is also of
bad9fa56 .) This avoids worries about my kernel
build being involved.

The debug kernel did not report anything special
while the port was building. Earlier it did report
the classic:

[00:00:06] Mounting system devices for FBSDFSSDjailArmV7-default
lock order reversal:
 1st 0xffffa0017b9915b0 ufs (ufs, lockmgr) _at_ /usr/src/sys/kern/vfs_mount.c:1071
 2nd 0xffffa0017bdb8070 devfs (devfs, lockmgr) _at_ /usr/src/sys/kern/vfs_mount.c:1083
lock order devfs -> ufs established at:
. . .

I have not tried substituting base.txz and
base-dbg.txz content.


===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
Received on Wed Mar 10 2021 - 05:00:17 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC