Re: FYI: a Pine64+ 2G aarch64 head -r308247 crash and some information about it

From: Mark Millard <markmi_at_dsl-only.net>
Date: Mon, 7 Nov 2016 22:42:08 -0800
On 2016-Nov-7, at 8:54 PM, Mark Millard <markmi at dsl-only.net> wrote:

> This is just in case any of the information happens to prove useful/interesting.
> I'm not expecting any assistance.
> 
> Note: After the crash ddb was not responding to input so this is all that I have.
> 
> Note: This was an experiment with head -r308247 but was built like stable for
>      performance issues (GENERIC included and then overridden, not GENERIC-UP
>      based).
> 
> The below was found via dmesg and/or /var/log/messages content while the Pine64
> was busy building lang/gcc6 and its prerequisites (as a stability test).
> 
> It got lots of spurious interrupt notices per second, such as:
> 
>> gic0: Spurious interrupt detected: last irq: 27 on CPU0
>> gic0: Spurious interrupt detected: last irq: 27 on CPU2
>> gic0: Spurious interrupt detected: last irq: 106 on CPU3
>> gic0: Spurious interrupt detected: last irq: 27 on CPU1
>> gic0: Spurious interrupt detected: last irq: 27 on CPU1
>> gic0: gic0: Spurious interrupt detected: last irq: 27 on CPU1
>> Spurious interrupt detected: last irq: 27 on CPU3
>> gic0: gic0: Spurious interrupt detected: last irq: 27 on CPU0
>> Spurious interrupt detected: last irq: 27 on CPU1
>> gic0: Spurious interrupt detected: last irq: 27 on CPU1
>> gic0: Spurious interrupt detected: last irq: 27 on CPU2
>> gic0: Spurious interrupt detected: last irq: 27 on CPU3
> 
> 27 happened the most by far. 106 was fairly rare. I'd not noticed any other
> figures. From what I saw all were "gic0".
> 
> sh had a few signal 11's and one signal 4 as of when I had last checked:
> 
>> pid 13900 (sh), uid 0: exited on signal 11 (core dumped)
>> pid 19325 (sh), uid 0: exited on signal 11 (core dumped)
>> pid 49697 (sh), uid 0: exited on signal 11 (core dumped)
>> pid 68390 (sh), uid 0: exited on signal 4 (core dumped)
>> pid 81149 (sh), uid 0: exited on signal 11 (core dumped)
> 
> I did not notice any other core dumps.
> 
> And the following happened once that I noticed:
> 
>> (da0:umass-sim0:0:0:0): WRITE(10). CDB: 2a 00 16 a3 a4 80 00 00 40 00 
>> (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
>> (da0:umass-sim0:0:0:0): Retrying command
> 
> The root filesystem was on a USB SSD.
> 
> The above was all from a ssh session history. The below is from the serial
> console. . .
> 
> Later it got the crash:
> 
>> panic: ARM64TODO: reclaim_pv_chunk
>> cpuid = 2
>> KDB: stack backtrace:
>> db_trace_self() at db_trace_self_wrapper+0x28
>>        pc = 0xffff00000068b430  lr = 0xffff0000000631dc
>>        sp = 0xffff000083758080  fp = 0xffff000083758290
>> 
>> db_trace_self_wrapper() at vpanic+0x170
>>        pc = 0xffff0000000631dc  lr = 0xffff000000335f10
>>        sp = 0xffff0000837582a0  fp = 0xffff000083758320
>> 
>> vpanic() at panic+0x4c
>>        pc = 0xffff000000335f10  lr = 0xffff000000335d9c
>>        sp = 0xffff000083758330  fp = 0xffff0000837583b0
>> 
>> panic() at reclaim_pv_chunk+0x10
>>        pc = 0xffff000000335d9c  lr = 0xffff0000006a13d4
>>        sp = 0xffff0000837583c0  fp = 0xffff0000837583c0
>> 
>> reclaim_pv_chunk() at get_pv_entry+0x2bc
>>        pc = 0xffff0000006a13d4  lr = 0xffff00000069c270
>>        sp = 0xffff0000837583d0  fp = 0xffff000083758400
>> 
>> get_pv_entry() at pmap_enter+0x68c
>>        pc = 0xffff00000069c270  lr = 0xffff00000069b41c
>>        sp = 0xffff000083758410  fp = 0xffff0000837584b0
>> 
>> pmap_enter() at vm_fault_hold+0x2f0
>>        pc = 0xffff00000069b41c  lr = 0xffff000000641eb8
>>        sp = 0xffff0000837584c0  fp = 0xffff000083758600
>> 
>> vm_fault_hold() at vm_fault_quick_hold_pages+0x120
>>        pc = 0xffff000000641eb8  lr = 0xffff000000645004
>>        sp = 0xffff000083758610  fp = 0xffff000083758670
>> 
>> vm_fault_quick_hold_pages() at vn_io_fault1+0x250
>>        pc = 0xffff000000645004  lr = 0xffff00000042b788
>>        sp = 0xffff000083758680  fp = 0xffff0000837587c0
>> 
>> vn_io_fault1() at vn_io_fault+0x170
>>        pc = 0xffff00000042b788  lr = 0xffff0000004297a4
>>        sp = 0xffff0000837587d0  fp = 0xffff000083758840
>> 
>> vn_io_fault() at dofilewrite+0xbc
>>        pc = 0xffff0000004297a4  lr = 0xffff0000003a35e4
>>        sp = 0xffff000083758850  fp = 0xffff000083758890
>> 
>> dofilewrite() at kern_writev+0x6c
>>        pc = 0xffff0000003a35e4  lr = 0xffff0000003a32d4
>>        sp = 0xffff0000837588a0  fp = 0xffff0000837588e0
>> 
>> kern_writev() at sys_write+0x7c
>>        pc = 0xffff0000003a32d4  lr = 0xffff0000003a3258
>>        sp = 0xffff0000837588f0  fp = 0xffff000083758930
>> 
>> sys_write() at do_el0_sync+0x6fc
>>        pc = 0xffff0000003a3258  lr = 0xffff0000006a2778
>>        sp = 0xffff000083758940  fp = 0xffff000083758a70
>> 
>> do_el0_sync() at handle_el0_sync+0x64
>>        pc = 0xffff0000006a2778  lr = 0xffff00000068d1d0
>>        sp = 0xffff000083758a80  fp = 0xffff000083758b90
>> 
>> handle_el0_sync() at 0x696ff0
>>        pc = 0xffff00000068d1d0  lr = 0x0000000000696ff0
>>        sp = 0xffff000083758ba0  fp = 0x0000ffffffffc610
>> 
>> KDB: enter: panic
>> [ thread pid 850 tid 100149 ]
>> Stopped at      kdb_enter+0x40: undefined       d4200000
> 
> 
> Side notes:
> 
> To do the lang/gcc6 test I adjusted lang/gcc6/Makefile to (hopefully) allow
> an aarch64 build attempt:
> 
>> # svnlite diff /usr/ports/lang/gcc6
>> Index: /usr/ports/lang/gcc6/Makefile
>> ===================================================================
>> --- /usr/ports/lang/gcc6/Makefile	(revision 424540)
>> +++ /usr/ports/lang/gcc6/Makefile	(working copy)
>> _at__at_ -35,7 +35,7 _at__at_
>> DISTVERSION=	${PORTVERSION:C/([0-9]+).*\.s([0-9]+)/\1-\2/}
>> GCC_VERSION=	${PORTVERSION:C/(.+)\.s[0-9]{8}/\1/}
>> SUFFIX=		${PORTVERSION:C/([0-9]+).*/\1/}
>> -ONLY_FOR_ARCHS=	amd64 i386 powerpc powerpc64 sparc64 arm armv6 armv6hf
>> +ONLY_FOR_ARCHS=	amd64 i386 powerpc powerpc64 sparc64 arm armv6 armv6hf aarch64
>> USES=		compiler cpe gmake iconv libtool makeinfo perl5 tar:bzip2
>> USE_BINUTILS=	yes
>> USE_PERL5=	build
>> _at__at_ -73,7 +73,7 _at__at_
>> USE_GCC=	4.9+
>> .endif
>> 
>> -.if ${ARCH} == "armv6" || ${ARCH} == "armv6hf"
>> +.if ${ARCH} == "armv6" || ${ARCH} == "armv6hf" || ${ARCH} == "aarch64"
>> . if ${COMPILER_TYPE} == clang
>> .  if empty(PORT_OPTIONS:MBOOTSTRAP)
>> MAKE_ARGS+=CXXFLAGS=-fbracket-depth=512
> 
> It did not make it to lang/gcc6 itself but did build several things.
> 
> (I did not care if lang/gcc6 really worked or not: Environment stability
> test.)
> 
> ===
> Mark Millard
> markmi at dsl-only.net

There was fsck activity after cutting and applying power but luckily the SSD
content (largely) survived.

The compressed message log shows the following as an example of when the "gic0"
Spurious interrupts start:

> Nov  7 18:54:28 pine64 kernel: ugen3.2: <OWC Envoy Pro mini> at usbus3
> Nov  7 18:54:28 pine64 kernel: umass0 on uhub3
> Nov  7 18:54:28 pine64 kernel: umass0: <OWC Envoy Pro mini, class 0/0, rev 2.10/1.00, addr 2> on usbus3
> Nov  7 18:54:28 pine64 kernel: umass0:  SCSI over Bulk-Only; quirks = 0x0100
> Nov  7 18:54:28 pine64 kernel: umass0:0:0: Attached to scbus0
> Nov  7 18:54:28 pine64 kernel: mountroot: waiting for device /dev/ufs/PINE642Grootfs...
> Nov  7 18:54:28 pine64 kernel: da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
> Nov  7 18:54:28 pine64 kernel: da0: <OWC Envoy Pro mini 0> Fixed Direct Access SPC-4 SCSI device
> Nov  7 18:54:28 pine64 kernel: da0: Serial Number <REMOVED>
> Nov  7 18:54:28 pine64 kernel: da0: 40.000MB/s transfers
> Nov  7 18:54:28 pine64 kernel: da0: 228936MB (468862128 512 byte sectors)
> Nov  7 18:54:28 pine64 kernel: da0: quirks=0x2<NO_6_BYTE>
> Nov  7 18:54:28 pine64 kernel: awg0: link state changed to DOWN
> Nov  7 18:54:28 pine64 kernel: gic0: Spurious interrupt detected: last irq: 27 on CPU2
> Nov  7 18:54:28 pine64 kernel: awg0: link state changed to UP
> Nov  7 18:54:28 pine64 kernel: gic0: Spurious interrupt detected: last irq: 27 on CPU3
> Nov  7 18:54:31 pine64 kernel: .
> Nov  7 18:54:31 pine64 kernel: gic0: Spurious interrupt detected: last irq: 27 on CPU2
> Nov  7 18:54:31 pine64 ntpd[602]: ntpd 4.2.8p8-a (1): Starting
> Nov  7 18:54:31 pine64 ntpd[603]: leapsecond file ('/var/db/ntpd.leap-seconds.list'): good hash signature
> Nov  7 18:54:31 pine64 ntpd[603]: leapsecond file ('/var/db/ntpd.leap-seconds.list'): loaded, expire=2017-06-01T00:00:00Z last=2017-01-01T00:00:00Z ofs=37
> Nov  7 19:36:17 pine64 kernel: gic0: gic0: Spurious interrupt detected: last irq: 27 on CPU1
> Nov  7 19:36:17 pine64 kernel: Spurious interrupt detected: last irq: 106 on CPU3
> Nov  7 19:36:17 pine64 kernel: gic0: Spurious interrupt detected: last irq: 27 on CPU1
> Nov  7 19:36:17 pine64 kernel: gic0: Spurious interrupt detected: last irq: 27 on CPU2
> Nov  7 19:36:17 pine64 kernel: gic0: Spurious interrupt detected: last irq: 27 on CPU0
> Nov  7 19:36:17 pine64 last message repeated 2 times
> Nov  7 19:36:17 pine64 kernel: gic0: gic0: Spurious interrupt detected: last irq: 27 on CPU0
> Nov  7 19:36:17 pine64 kernel: Spurious interrupt detected: last irq: 27 on CPU3

Always shortly after awg0's link state changed to down after the da0 information.


In looking at the logs I've seen a 32 and some 114's and 92's as well as the 27's and 106's.
There were a few brief periods over which non-27's were fairly frequent.

There are a few of the CAM status message blocks, about 7 over everything recorded,
including a few prior boots of the Pine64.


An interesting point is that somewhat before and while the lang/gcc6 related build activity
was active the "gic0" notices stopped:

> Nov  8 02:44:40 pine64 kernel: gic0: Spurious interrupt detected: last irq: 27 on CPU3
> Nov  8 03:01:15 pine64 kernel: pid 13900 (sh), uid 0: exited on signal 11 (core dumped)
> Nov  8 03:01:39 pine64 kernel: pid 19325 (sh), uid 0: exited on signal 11 (core dumped)
> Nov  8 03:02:02 pine64 pkg-static: indexinfo-0.2.5 installed
> Nov  8 03:02:04 pine64 pkg-static: gettext-runtime-0.19.8.1 installed
> Nov  8 03:04:42 pine64 kernel: pid 49697 (sh), uid 0: exited on signal 11 (core dumped)
> Nov  8 03:05:26 pine64 kernel: pid 68390 (sh), uid 0: exited on signal 4 (core dumped)
> Nov  8 03:06:10 pine64 kernel: pid 81149 (sh), uid 0: exited on signal 11 (core dumped)
> Nov  8 03:08:36 pine64 pkg-static: gettext-tools-0.19.8.1 installed
> Nov  8 03:09:07 pine64 pkg-static: gmake-4.2.1_1 installed
> Nov  8 03:27:01 pine64 pkg-static: perl5-5.24.1.r4 installed
> Nov  8 03:27:33 pine64 pkg-static: p5-Locale-gettext-1.06 installed
> Nov  8 03:27:42 pine64 pkg-static: help2man-1.43.3_1 installed
> Nov  8 03:30:22 pine64 pkg-static: texinfo-6.1.20160425,1 installed
> Nov  8 03:31:59 pine64 pkg-static: m4-1.4.17_1,1 installed
> Nov  8 03:34:04 pine64 pkg-static: bison-2.7.1,1 installed
> Nov  8 03:37:44 pine64 pkg-static: gmp-5.1.3_3 installed
> Nov  8 03:39:14 pine64 pkg-static: mpfr-3.1.5 installed
> Nov  8 03:44:53 pine64 kernel: (da0:umass-sim0:0:0:0): WRITE(10). CDB: 2a 00 16 bd 19 c0 00 00 80 00 
> Nov  8 03:44:53 pine64 kernel: (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
> Nov  8 03:44:53 pine64 kernel: (da0:umass-sim0:0:0:0): Retrying command
> Nov  8 03:44:55 pine64 kernel: gic0: Spurious interrupt detected: last irq: 106 on CPU3
> Nov  8 03:44:55 pine64 kernel: gic0: Spurious interrupt detected: last irq: 106 on CPU3

Somewhat after the mpfr installation above the "gic0" notices return.

devel/binutils started to build but did not complete.



===
Mark Millard
markmi at dsl-only.net
Received on Tue Nov 08 2016 - 05:42:11 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC