Re: A reliable port cross-build failure (hangup) in my context (amd64->armv7 cross build, with native-tool speedup involved)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Fri, 28 Dec 2018 00:16:48 -0800
[The historical notes are removed and replaced by partial trace
information from example hang-ups, not that I've figured out
what contributes yet.]

I ran into the following while trying to get evidence
about the hang-up for an amd64->armv7 cross-build of
multimedia/gstreamer1-qt_at_qt5 .

The following from trying to get evidence for the hang-up
via a manual run of "make multimedia/gstreamer1-qt FLAVOR=qt5”
in a poudriere bulk -i’s interactive mode for the context
that has the hang-up in normal poudriere-devel runs.


From top after the hang-up (to identify some context):

14528 root          2  52    0   100M    24M    0 kqread  11   0:00   0.00% /usr/local/bin/qemu-arm-static /usr/local/bin/cmake -E cmake_autogen /wrkdirs/usr/ports/multimedia/gstreamer1-qt/work-qt5/.
14527 root          2  52    0    88M    13M    0 select  22   0:00   0.00% /usr/local/bin/qemu-arm-static ninja -j1 -v all

from ps -auxd as well (to identify more context):

root       10114    0.0  0.0  10328  1756  1  I+J  13:47       0:00.01 |                 `-- make FLAVOR=qt5
root       14526    0.0  0.0  10204  1792  1  I+J  13:50       0:00.00 |                   `-- /bin/sh -e -c (cd /wrkdirs/usr/ports/multimedia/gstreamer1-qt/work-qt5/.build; if ! /usr/bin/env QT_SELE
root       14527    0.0  0.0  90304 13084  1  I+J  13:50       0:00.09 |                     `-- /usr/local/bin/qemu-arm-static ninja -j1 -v all
root       14528    0.0  0.0 102876 25060  1  IJ   13:50       0:00.12 |                       `-- /usr/local/bin/qemu-arm-static /usr/local/bin/cmake -E cmake_autogen /wrkdirs/usr/ports/multimedia/g

I had made a qemu-user-static that enabled do_strace when
it is used to run cmake or ninja.

The only do_strace lines from qemu-arm-static running cmake
or ninja mentioning process 14528 are included in the sequence:

(Before the below was a long list of "14527 fstatat” lines.
I’ll note that "'Unknown syscall 545” is from ppoll use.)

82400 sigprocmask(1,-1610620016,-191968524,-186261416,0,24) = 0
82400 sigaction(2,-1610620040,-191968596,-186261584,210460,0) = 0
82400 sigaction(15,-1610620040,-191968572,-186261584,210460,0) = 0
82400 sigaction(1,-1610620040,-191968548,-186261584,210460,0) = 0
82400 gettimeofday(-1610619984,0,4,-186261584,-1610619440,-1610619528) = 0
82400 gettimeofday(-1610619984,0,4,359949,1545969996,0) = 0
82400 gettimeofday(-1610620120,0,4,2,-184666112,-1610619520) = 0
82400 fstatat(-100,"elements/gstqtvideosink/CMakeFiles", 0x9fffe200, 0) = 0
82400 fstatat(-100,"elements/gstqtvideosink/gstqt5videosink_autogen", 0x9fffe200, 0) = 0
82400 pipe2(-1610620176,0,-1610620108,0,-1610620120,167084) = 0
82400 fcntl(5,1,-1610620108,-185863932,-192200556,-1610620228) = 0
82400 fcntl(5,2,1,-185863932,-192200556,-1610620228) = 0
82400 vfork(0,66450,-186876196,-1610620184,-1610620240,0) = 82401
82400 close(6) = 0
 = 0
82400 Unknown syscall 545
82401 setpgid(0,0,-186876196,-1610620184,-1610620240,0) = 0
82401 sigprocmask(3,-191586912,0,-1610620184,-1610620240,0) = 0
82401 close(5) = 0
82401 open("/dev/null",0,0) = 5
82401 dup2(5,0,0,-1610620184,-1610620240,0) = 0
82401 close(5) = 0
82401 fcntl(0,2,0,-1610620184,-1610620240,0) = 0
82401 dup2(6,1,0,-1610620184,-1610620240,0) = 1
82401 fcntl(1,2,0,-1610620184,-1610620240,0) = 0
82401 dup2(6,2,0,-1610620184,-1610620240,0)82400 sigpending(-1610620072,1,0,-191968524,0,0) = 0

The vfork then close(6) sequence for 82400 vs. the later
use of 6 in dup2 in 82401 may be rather odd. But it looks
like qemu-*-static uses do_freebsd_fork to implement
do_freebsd_vfork, despite reporting vfork before
calling do_freebsd_vfork. (Does the close(6) appear to
indicate a race for native operation of ninja for the
period when the address space is shared?)

Ninja has Subprocess::Start code that has:

#ifdef POSIX_SPAWN_USEVFORK
  flags |= POSIX_SPAWN_USEVFORK;
#endif


  if (posix_spawnattr_setflags(&attr, flags) != 0)
    Fatal("posix_spawnattr_setflags: %s", strerror(errno));

  const char* spawned_args[] = { "/bin/sh", "-c", command.c_str(), NULL };
  if (posix_spawn(&pid_, "/bin/sh", &action, &attr,
                  const_cast<char**>(spawned_args), environ) != 0)
    Fatal("posix_spawn: %s", strerror(errno));

that is in use here. I think that this explains the vfork use.


It turns out that putting the hung-up build in the background
and then killing 82401 with the likes of kill -6 leads to more
output that had apparently been buffered. It shows the use of
the (amd64 native) /bin/sh that in turn leads to
/usr/local/bin/cmake via qemu-arm-static. /bin/sh, being
native, gets no do_strace output from qemu-arm-static.

82400 sigpending(-1610620072,1,0,-191968524,0,0) = 0
82400 read(5,0x9fffd368,4096) = 58
82400 Unknown syscall 545
82400 sigpending(-1610620072,1,0,-191968524,0,0) = 0
82400 read(5,0x9fffd368,4096) = 0
82400 close(5) = 0
82400 wait4(82401,-1610620004,0,0,-191968640,0) = 82401
82400 mmap(0,86016,3,201330690,-1,-1610620169) = 0xf4777000
82400 gettimeofday(-1610620224,0,4,-1610619944,31,16777216) = 0
82400 write(1,0xf4950000,283)[1/129] cd /wrkdirs/usr/ports/multimedia/gstreamer1-qt/work-qt5/.build/elements/gstqtvideosink && /usr/local/bin/cmake -E cmake_autogen /wrkdirs/usr/ports/multimedia/gstreamer1-qt/work-qt5/.build/elements/gstqtvideosink/CMakeFiles/gstqt5videosink_autogen.dir/AutogenInfo.cmake Debug
 = 283
82400 write(1,0xf4950000,137)FAILED: elements/gstqtvideosink/CMakeFiles/gstqt5videosink_autogen elements/gstqtvideosink/gstqt5videosink_autogen/mocs_compilation.cpp 
 = 137
82400 write(1,0xf4950000,275)cd /wrkdirs/usr/ports/multimedia/gstreamer1-qt/work-qt5/.build/elements/gstqtvideosink && /usr/local/bin/cmake -E cmake_autogen /wrkdirs/usr/ports/multimedia/gstreamer1-qt/work-qt5/.build/elements/gstqtvideosink/CMakeFiles/gstqt5videosink_autogen.dir/AutogenInfo.cmake Debug
 = 275
82400 write(1,0xf4950000,5) = 2
 = 5

(Note that some 82400 writes are reporting 82401 information:)

82400 write(1,0xf4950000,49)82401 fcntl(2,2,0,-1610620184,-1610620240,0) = 0
 = 49
82400 write(1,0xf4950000,19)82401 close(6) = 0
 = 19
82400 write(1,0xf4950000,401)82401 execve("/bin/sh",{"/bin/sh","-c","cd /wrkdirs/usr/ports/multimedia/gstreamer1-qt/work-qt5/.build/elements/gstqtvideosink && /usr/local/bin/cmake -E cmake_autogen /wrkdirs/usr/ports/multimedia/gstreamer1-qt/work-qt5/.build/elements/gstqtvideosink/CMakeFiles/gstqt5videosink_autogen.dir/AutogenInfo.cmake Debug",NULL})82401 __sysctl({ 0 3 }, 2, 0x9fffda80, 0x9fffdf64, 0xf5002097, 0x0000000c) = 0
 = 401

(The /bin/sh activity is not logged: /bin/sh is native amd64 code here. The
below is from the later /usr/local/bin/cmake via qemu-arm-static.

. . . (much omitted) . . .

82400 write(1,0xf4950000,60)82401 mmap(0,28672,3,201330690,-1,-1610621989) = 0xf41a0000
 = 60
82400 write(1,0xf4950000,74)82401 clock_gettime(4,-1610621832,4,-199622492,-199622492,-199622656) = 0
 = 74
82400 write(1,0xf4950000,62)82401 kqueue(-199622656,0,53102,0,-199622656,-1610621444) = 3
 = 62
82400 write(1,0xf4950000,81)82401 ioctl(3, 0x20006601 { IO GRP:0x66('f') CMD:1 LEN:0 }, 0x0000cf6e, ...) = 0
 = 81

. . . (some omitted) . . .

(Then there is a fairly long sequence of access's and then a sequence of
fstatat's just before:)


82400 write(1,0xf4950000,32)82401 write(9,0xf4e1a945,1) = 1
 = 32
82400 write(1,0xf4950000,61)82401 clock_gettime(4,-1610622624,4,100863,1,-199483392) = 0
 = 61
82400 write(1,0xf4950000,106)82401 kevent(3,-1610688200,2,-1610688200,1024,0)qemu: uncaught target signal 6 (Abort trap) - core dumped
 = 106
82400 write(1,0xf4950000,41)ninja: build stopped: subcommand failed.
 = 41

So it was hung at the kevent until the kill -6 .


Via another experiment ninja was at the time waiting
in ppoll:

Reading symbols from ninja...done.
[New LWP 73023]
Core was generated by `ninja'.
Program terminated with signal SIGABRT, Aborted.
#0  0xf4e5e0dc in _ppoll () from /lib/libc.so.7
(gdb) bt
#0  0xf4e5e0dc in _ppoll () from /lib/libc.so.7
#1  0x00033bf0 in SubprocessSet::DoWork (this=<optimized out>) at src/subprocess-posix.cc:237
Backtrace stopped: previous frame inner to this frame (corrupt stack?)



===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
Received on Fri Dec 28 2018 - 07:16:54 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:19 UTC