Re: FYI: main (bad9fa56620e based): some unexpected SIGSEGV's are tied to interrupted system calls (cortex-a57/a72 fail, cortex-a53/cortex-a7 work)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 9 Mar 2021 21:11:22 -0800
On 2021-Mar-9, at 19:17, Mark Millard <marklmi at yahoo.com> wrote:

[My only testing context for this has been main, not 13.0.
But it might be a 13.0 worry.]

> Using the quickest to so-far-reliably-fail type of example from
> another thread I used truss to see what happens, here filtered
> down to two processes that appear to be involved and only
> near the failure. (The overall truss output is huge from the
> prior activity in the poudriere bulk relatated activity). Also,
> this initiated watching from aarch64 but the failing code is
> armv7.
> 
> 83630 100199: #340(0x1,0xffffd18c,0xffffd17c)    = 0 (0x0)
> 83630 100199: #416(0x14,0xffffd1b4,0xffffd19c)   = 0 (0x0)
> 83630 100199: #7(0xffffffff,0xffffd178,0x1,0x0)  = 0 (0x0)
> 83731 100161: #240(0xffffd5f0,0xffffd5f0)        = 0 (0x0)
> 83731 100161: #1(0x0)                           
> 83731 100161: process exit, rval = 0
> 83630 100199: SIGNAL 20 (SIGCHLD) code=CLD_EXITED pid=83731 uid=0 status=0
> 83630 100199: #341(0xffffd17c)                   ERR#4 'Interrupted system call'
> 83630 100199: SIGNAL 11 (SIGSEGV) code=SEGV_MAPERR trapno=36 addr=0xffffffe1
> 83630 100199: process killed, signal = 11 (core dumped)
> 
> As a reminder of the lldb backtrace of the sh.core
> and the like:
> 
> (lldb) bt
> * thread #1, name = 'sh', stop reason = signal SIGSEGV
>  * frame #0: 0xffffe190
>    frame #1: 0x00031aa8 sh`waitcmdloop(job=0x00064230) at jobs.c:608:11
>    frame #2: 0x00031a24 sh`waitcmd(argc=<unavailable>, argv=<unavailable>) at jobs.c:554:13
>    frame #3: 0x00028f54 sh`evalcommand(cmd=0x400ad0e4, flags=<unavailable>, backcmd=0x00000000) at eval.c:1107:16
>    frame #4: 0x00027800 sh`evaltree(n=0x400ad0e4, flags=<unavailable>) at eval.c:289:4
>    frame #5: 0x000344d0 sh`cmdloop(top=1) at main.c:221:4
>    frame #6: 0x000342f4 sh`main(argc=<unavailable>, argv=<unavailable>) at main.c:168:3
>    frame #7: 0x0002480c sh`__start(argc=8, argv=<unavailable>, env=<unavailable>, ps_strings=<unavailable>, obj=0x400b4004, cleanup=0x40081aa0) at crt1_c.c:92:7
> (lldb) up
> frame #1: 0x00031aa8 sh`waitcmdloop(job=0x00064230) at jobs.c:608:11
>   605 						break;
>   606 				}
>   607 			}
> -> 608 		} while (dowait(DOWAIT_BLOCK | DOWAIT_SIG, (struct job *)NULL) != -1);
>   609 	
>   610 		sig = pendingsig_waitcmd;
>   611 		pendingsig_waitcmd = 0;
> 
> (lldb) disass
> sh`waitcmdloop:
>    0x31a54 <+0>:   push   {r4, r5, r6, r7, r8, r9, r10, r11, lr}
>    0x31a58 <+4>:   add    r11, sp, #28
>    0x31a5c <+8>:   sub    sp, sp, #4
>    0x31a60 <+12>:  movw   r6, #0x3ea0
>    0x31a64 <+16>:  movw   r7, #0x3e9c
>    0x31a68 <+20>:  movw   r9, #0x4040
>    0x31a6c <+24>:  movw   r8, #0x3ea4
>    0x31a70 <+28>:  mov    r4, r0
>    0x31a74 <+32>:  movt   r6, #0x6
>    0x31a78 <+36>:  movt   r7, #0x6
>    0x31a7c <+40>:  movt   r9, #0x6
>    0x31a80 <+44>:  mov    r10, #0
>    0x31a84 <+48>:  movt   r8, #0x6
>    0x31a88 <+52>:  cmp    r4, #0
>    0x31a8c <+56>:  beq    0x31ab4                   ; <+96> at jobs.c:590:37
>    0x31a90 <+60>:  ldrb   r0, [r4, #0x18]
>    0x31a94 <+64>:  cmp    r0, #2
>    0x31a98 <+68>:  beq    0x31b84                   ; <+304> [inlined] getjobstatus at jobs.c:575
>    0x31a9c <+72>:  mov    r0, #3
>    0x31aa0 <+76>:  mov    r1, #0
>    0x31aa4 <+80>:  bl     0x32bcc                   ; dowait at jobs.c:1142
> ->  0x31aa8 <+84>:  cmn    r0, #1
> 
> 
> For reference a local context around the
> SIGSEGV looks like (all lines in the range
> selected):
> 
> . . .
> 83833 102738: fcntl(2,F_DUPFD_CLOEXEC,0xa)       = 10 (0xa)
> 83833 102738: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
> 83833 102738: dup2(3,2)                          = 2 (0x2)
> 83833 102738: close(3)                           = 0 (0x0)
> 83833 102738: unlink("./.data.json.SYR1bCaL")    = 0 (0x0)
> 83833 102738: dup2(10,2)                         = 2 (0x2)
> 83833 102738: close(10)                          = 0 (0x0)
> 83833 102738: exit(0x0)                         
> 83833 102738: process exit, rval = 0
> 77872 100638: wait4(-1,{ EXITED,val=0 },0x0,0x0) = 83833 (0x14779)
> 77872 100638: fcntl(0,F_DUPFD_CLOEXEC,0xa)       = 10 (0xa)
> 77872 100638: openat(AT_FDCWD,"/var/run/poudriere/lock-poudriere-shared-json_top.pid",O_RDONLY,00) = 3 (0x3)
> 77872 100638: dup2(3,0)                          = 0 (0x0)
> 77872 100638: close(3)                           = 0 (0x0)
> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       = 11 (0xb)
> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
> 77872 100638: dup2(3,2)                          = 2 (0x2)
> 77872 100638: close(3)                           = 0 (0x0)
> 77872 100638: lseek(0,0x0,SEEK_CUR)              = 0 (0x0)
> 77872 100638: read(0,"77563",1024)               = 5 (0x5)
> 77872 100638: read(0,0xffffffffb9e8,1024)        = 0 (0x0)
> 77872 100638: dup2(10,0)                         = 0 (0x0)
> 77872 100638: close(10)                          = 0 (0x0)
> 77872 100638: dup2(11,2)                         = 2 (0x2)
> 77872 100638: close(11)                          = 0 (0x0)
> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       = 10 (0xa)
> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
> 77872 100638: dup2(3,2)                          = 2 (0x2)
> 77872 100638: close(3)                           = 0 (0x0)
> 77872 100638: rmdir("/var/run/poudriere/lock-poudriere-shared-json_top") = 0 (0x0)
> 77872 100638: dup2(10,2)                         = 2 (0x2)
> 77872 100638: close(10)                          = 0 (0x0)
> 77872 100638: sigprocmask(SIG_SETMASK,{ },0x0)   = 0 (0x0)
> 77872 100638: fcntl(2,F_DUPFD_CLOEXEC,0xa)       = 10 (0xa)
> 77872 100638: openat(AT_FDCWD,"/dev/null",O_WRONLY|O_CREAT|O_TRUNC,0666) = 3 (0x3)
> 77872 100638: dup2(3,2)                          = 2 (0x2)
> 77872 100638: close(3)                           = 0 (0x0)
> 77872 100638: sigaction(SIGINFO,{ 0x239c30 SA_RESTART ss_t },{ SIG_DFL 0x0 ss_t }) = 0 (0x0)
> 83731 100161: #240(0xffffd5f0,0xffffd5f0)        = 0 (0x0)
> -- UNKNOWN FreeBSD32 SYSCALL 1 --
> 83731 100161: #1(0x0)                           
> 83731 100161: process exit, rval = 0
> 83630 100199: SIGNAL 20 (SIGCHLD) code=CLD_EXITED pid=83731 uid=0 status=0
> 83630 100199: #341(0xffffd17c)                   ERR#4 'Interrupted system call'
> 83630 100199: SIGNAL 11 (SIGSEGV) code=SEGV_MAPERR trapno=36 addr=0xffffffe1
> 83630 100199: process killed, signal = 11 (core dumped)
> 83316 100123: #7(0xffffffff,0xffffca58,0x0,0x0)  = 83630 (0x146ae)
> -- UNKNOWN FreeBSD32 SYSCALL 477 --
> 83316 100123: #477(0x0,0x7000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077833728 (0x403e7000)
> -- UNKNOWN FreeBSD32 SYSCALL 552 --
> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory'
> -- UNKNOWN FreeBSD32 SYSCALL 552 --
> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory'
> -- UNKNOWN FreeBSD32 SYSCALL 552 --
> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory'
> -- UNKNOWN FreeBSD32 SYSCALL 552 --
> 83316 100123: #552(0xffffff9c,0xffffc504,0xffffc908,0x0) ERR#2 'No such file or directory'
> -- UNKNOWN FreeBSD32 SYSCALL 477 --
> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077862400 (0x403ee000)
> -- UNKNOWN FreeBSD32 SYSCALL 4 --
> 83316 100123: #4(0x2,0x403ee000,0x21)            = 33 (0x21)
> -- UNKNOWN FreeBSD32 SYSCALL 477 --
> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077866496 (0x403ef000)
> -- UNKNOWN FreeBSD32 SYSCALL 477 --
> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077870592 (0x403f0000)
> -- UNKNOWN FreeBSD32 SYSCALL 477 --
> 83316 100123: #477(0x0,0x1000,0x3,0xc001002,0xffffffff,0x40401428,0x0,0x0) = 1077874688 (0x403f1000)
> -- UNKNOWN FreeBSD32 SYSCALL 4 --
> 83316 100123: #4(0x1,0x403ef000,0x2e)            = 46 (0x2e)
> -- UNKNOWN FreeBSD32 SYSCALL 542 --
> 83316 100123: #542(0xffffcd54,0x0)               = 0 (0x0)
> -- UNKNOWN FreeBSD32 SYSCALL 2 --
> 83842 100199: <new process>
> 83316 100123: #2()                               = 83842 (0x14782)
> -- UNKNOWN FreeBSD32 SYSCALL 6 --
> -- UNKNOWN FreeBSD32 SYSCALL 6 --
> 83316 100123: #6(0x7)                            = 0 (0x0)
> 83842 100199: #6(0x5)                            = 0 (0x0)
> . . .

Turns out that the failure happens on the
processors with out-of-order execution and
the like but works on the strictly in-order
cortex-a53. (For as much testing as I've
done.)

So it looks like some form of synchronization
is missing that in-order-only does not need.
(This would be the 2nd time I've run into such
for FreeBSD aarch64 if it holds true. The
prior example was fixed a fair time ago.)


The testing status . . .


Problem replicated using the following contexts
to attempt the textproc/itstool build, targeting
armv7 (cortex-a7):

cortex-a72 aarch64 MACHHIATObin Double Shot
cortex-a57 aarch64 OverDrive 1000

(No successful builds for the above 2,
all stopping in configure the same way.)


No problem using the following to build
textproc/itstool, targeting armv7:

cortex-a53 aarch64 Rock64    (armv7 on aarch64 case)
cortext-a7 armv7   OrangePi+ 2ed (native armv7 case)


It will take a long time to run a full
poudriere bulk that will build about
200 ports, targeting the cortex-a7 on
the slower cortex-a53: days. So
further evidence that the cortex-a53
does not get the problem will take a
while.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
Received on Wed Mar 10 2021 - 04:11:32 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC