Re: TARGET_ARCH=powerpc head -r317820 production-style kernel: periodic panics always in pid=11 (the Idle threads)

From: Mark Millard <markmi_at_dsl-only.net>
Date: Sat, 20 May 2017 02:01:55 -0700
On 2017-May-19, at 9:42 PM, Mark Millard <markmi_at_dsl-only.net> wrote:

> On 2017-May-9, at 2:00 PM, Mark Millard <markmi_at_dsl-only.net> wrote:
> 
> . . .
>> fatal kernel trap:
>>  exception       = 0x903a64e (unknown)
>>  srr0            = 0x7ff760
>>  srr1            = 0xc1007c
>>  lr              = 0x907f
>>  curthread       = 0x147d6c0
>>         pid = 11, comm = idle: cpu0
>> [ thread pid 11 tid 100003 ]
>> Stopped at      ffs_truncate+0x1080:    stw     r11, 0xf8(r31)
>> 
>> 1 contains (cpu1 instead of cpu0, so different tid):
>> 
>> fatal kernel trap:
>>  exception       = 0x903a64e (unknown)
>>  srr0            = 0x7ff760
>>  srr1            = 0xc1007c
>>  lr              = 0x907f
>>  curthread       = 0x147d360
>>         pid = 11, comm = idle: cpu1
>> [ thread pid 11 tid 100004 ]
>> Stopped at      ffs_truncate+0x1080:    stw     r11, 0xf8(r31)
>> 
>> 1 contains:
> 
> I've discovered where to find the trapframe
> in the vmcore.* files for these specific
> examples with 0x903a64e as the exception
> and such.
> 
> In the vmcore the memory image starts at
> byte offset 0x1000.
> 
> To see the values reported the only
> place in the image file to start that
> produces those values at the offsets
> for in side the powerpc trapframe is:
> 
> offset 0x1001 in the vmcore.* file.
> 
> So memory address 0x1 is being used
> as the trapframe address when that
> odd exception information is being
> displayed. Yep: misaligned.
> 
> The decoding is not of the actual
> trapframe: it is garbage that is
> not to be believed.
> 
> 
> Note: I lucked out after the above and
> got a somewhat different odd trap information
> that lead to actually getting a backtrace
> that included the actual pid 11 cpu 1 kernel
> thread stack bt associated with that odd
> information display.

Typo: That should have been "cpu 2".

> I'll send a separate reply for that information
> as it will take some transcribing from camera
> pictures and such.


As indicated, I got a different odd trap report
that gave a backtrace. . .

fatal user trap

exception = 0x4210000 (unknown)
srr0      = 0xc1007c09
srr1      = 0x3a64e80
lr        = 0xc0807fc9
curthread = 0x147d000
pid = 11, comm = idle: cpu 2

Now at this point it attempted to
db_print_loc_and_inst and got
another exception (at offset +0x60
in the routine).

So the backtrace has both the
consequences of that and what lead
up to that: an EXI trap was
attempting to report trap frame
information but was using a bad
address for the supposed frame.

The details of the backtrace:

panic: data storage interrupt trap
cpuid = 2
time = 145187154
KDB: stack backtrace
0xdf5ef2c0: at kdb_backtrace+0x5c
0xdf5ef3a0: at panic+0x54
0xdf5ef3f0: at trap_fatal+0x1cc
0xdf5ef420: at powerpc_interrupt+0x180
0xdf5ef5c0: kernel DSI read trap _at_ 0xc1007c09
            by db_disasm+0x30:
            srr1=0x1032
            r1  =0xdf5ef6b0
            cr  =0x24009022
            xer =0
            ctr =0x1852cc
            sr  =0x40000000
0xdf5ef6b0: at 0x1007480
0xdf5ef6d0: at db_print_loc_and_inst+0x60
0xdf5ef700: at db_trap+0x104
0xdf5ef790: at kdb_trap+0x1bc
0xdf5ef810: at trap_fatal+0x1b0
0xdf5ef840: at trap+0x1184
0xdf5ef870: kernel EXI trap
            by cpu_idle_60x+0x88:
            srr1=0x1032
            r1  =0xdf5ef930
            cr  =0x40000042
            xer =0x20000000
            ctr =0x8e3bd8
saved LR(0x2) is invalid.

So an EXI trap was attempting to
report a trap frame.

(Note: the LR's for pid 11 cpu threads
normally report an invalid LR in ddb.)



The actual EXI trapframe starts at 013f0878 in
vmcore.5:

013f0870  df 5e f9 30 00 10 08 f8  00 04 90 32 df 5e f9 30  |.^.0.......2.^.0|
013f0880  01 47 d0 00 00 00 00 00  25 94 48 3f 00 00 00 00  |.G......%.H?....|
013f0890  25 94 48 3f 00 4a a9 c8  00 00 00 00 00 00 00 44  |%.H?.J.........D|
013f08a0  01 fc a0 55 00 00 90 32  df 5d 1d 00 00 00 00 00  |...U...2.]......|
013f08b0  00 d4 bd ec 00 cb 98 98  00 c9 66 bc 00 c4 5d 08  |..........f...].|
013f08c0  00 c9 66 bc 00 d4 c5 3c  df 5e f9 e0 00 eb a7 80  |..f....<.^......|
013f08d0  00 c9 66 bc 01 47 d0 00  df 5e f9 8c 00 00 00 06  |..f..G...^......|
013f08e0  00 00 00 06 00 eb b5 80  00 00 00 00 00 8e 3b d8  |..............;.|
013f08f0  00 d2 6b f0 df 5e f9 30  00 8e 3b f4 40 00 00 42  |..k..^.0..;._at_..B|
013f0900  20 00 00 00 00 8e 3b d8  00 8e 3c 60 00 00 90 32  | .....;...<`...2|
013f0910  00 00 05 00 41 a1 d5 d4  42 00 00 00 00 00 00 00  |....A...B.......|

So:

r0    = 0x00049032
r1    = 0xdf5ef930
r2    = 0x0147d000
r3    = 0x00000000
r4    = 0x2594483f
r5    = 0x00000000
r6    = 0x2594483f
r7    = 0x004aa9c8
r8    = 0x00000000
r9    = 0x00000044
r10   = 0x01fca055
r11   = 0x00009032
r12   = 0xdf5d1d00
r13   = 0x00000000
r14   = 0x00d4bdec
r15   = 0x00cb9898
r16   = 0x00c966bc
r17   = 0x00c45d08
r18   = 0x00c966bc
r19   = 0x00d4c53c
r20   = 0xdf5ef9e0
r21   = 0x00eba780
r22   = 0x00c966bc
r23   = 0x1047d000
r24   = 0xdf5ef98c
r25   = 0x00000006 (this value shows up later in a bad spot)
r26   = 0x00000006 (this value shows up next to that)
r27   = 0x00ebb580
r28   = 0x00000000
r29   = 0x008e3bd8
r30   = 0x00d26bf0
r31   = 0xdf5ef930
lr    = 0x008e3bf4
cr    = 0x40000042
xer   = 0x20000000
ctr   = 0x008e3bd8
srr0  = 0x008e3c60
srr1  = 0x00009032
exc   = 0x00000500
dar   = 0x41a1d5d4
dsisr = 0x42000000

Other elements of the stack leading to this are:

013f0920  45 8b a7 b5 b1 fd 96 be  00 00 00 00 00 00 00 04  |E...............|
013f0930  df 5e f9 50 00 00 00 06  00 00 00 06 00 eb b5 80  |.^.P............|
                    (odd lr value; value repeats
                     above, not  ; here. r25/r26
                     a multiple  ; pair saved
                     of 4 even.  ; in those?
                     matches r25 ; matches r26
                     in trapframe; in trapframe)

013f0940  00 00 00 00 00 d4 ca 34  00 d2 6b f0 df 5e f9 50  |.......4..k..^.P|
013f0950  df 5e f9 70 00 8e 31 7c  00 00 00 02 00 eb b5 80  |.^.p..1|........|
013f0960  00 f2 d5 fc 00 00 00 06  00 d1 ca ac df 5e f9 70  |.............^.p|
013f0970  df 5e fa 50 00 53 6e 58  df 5e f9 80 00 00 00 00  |.^.P.SnX.^......|
. . .
013f0a50  df 5e fa 80 00 4a 3c b4  df 5e fa 60 fa 50 05 af  |.^...J<..^.`.P..|
013f0a60  df 5e fa 80 00 00 00 00  00 00 00 00 00 00 00 00  |.^..............|
013f0a70  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
013f0a80  00 00 00 00 00 8f 18 90  00 53 69 84 00 00 00 00  |.........Si.....|

From which I get from the stack
backpointer/lr-value pairs(via objdump
on the kernel):

00000006: ????????????????????
008e317c: cpu_idle+0x58
00536e58: scheduletd+0x4d4
004a3cb4: fork_exit+0xf8
008f1890: fork_trampoline+0x10

But cpu_idle+0x58 is the bl ,cpu_activeclock>
in the below and the prior bctrl after
the bl <cpu_idleclock> is what made the
call.

008e3124 <cpu_idle> stwu    r1,-32(r1)
008e3128 <cpu_idle+0x4> mflr    r0
008e312c <cpu_idle+0x8> stw     r29,20(r1)
008e3130 <cpu_idle+0xc> stw     r30,24(r1)
008e3134 <cpu_idle+0x10> stw     r31,28(r1)
008e3138 <cpu_idle+0x14> stw     r0,36(r1)
008e313c <cpu_idle+0x18> mr      r31,r1
008e3140 <cpu_idle+0x1c> bcl-    20,4*cr7+so,008e3144 <cpu_idle+0x20>
008e3144 <cpu_idle+0x20> mflr    r30
008e3148 <cpu_idle+0x24> lwz     r0,-36(r30)
008e314c <cpu_idle+0x28> add     r30,r0,r30
008e3150 <cpu_idle+0x2c> lwz     r29,-32768(r30)
008e3154 <cpu_idle+0x30> lwz     r0,0(r29)
008e3158 <cpu_idle+0x34> cmpwi   cr7,r0,0
008e315c <cpu_idle+0x38> beq-    cr7,008e3198 <cpu_idle+0x74>
008e3160 <cpu_idle+0x3c> cmpwi   cr7,r3,0
008e3164 <cpu_idle+0x40> bne-    cr7,008e3188 <cpu_idle+0x64>
008e3168 <cpu_idle+0x44> bl      005002a4 <critical_enter>
008e316c <cpu_idle+0x48> bl      008ad894 <cpu_idleclock>
008e3170 <cpu_idle+0x4c> lwz     r29,0(r29)
008e3174 <cpu_idle+0x50> mtctr   r29
008e3178 <cpu_idle+0x54> bctrl
008e317c <cpu_idle+0x58> bl      008ad794 <cpu_activeclock>

But ctr was reported as: 0x8e3bd8
which is cpu_idle_60x.

(So no surprises so far.)

The code through the reported cpu_idle_60x+0x88
is:

008e3bd8 <cpu_idle_60x> stwu    r1,-32(r1)
008e3bdc <cpu_idle_60x+0x4> mflr    r0
008e3be0 <cpu_idle_60x+0x8> stw     r30,24(r1)
008e3be4 <cpu_idle_60x+0xc> stw     r31,28(r1)
008e3be8 <cpu_idle_60x+0x10> stw     r0,36(r1)
008e3bec <cpu_idle_60x+0x14> mr      r31,r1
008e3bf0 <cpu_idle_60x+0x18> bcl-    20,4*cr7+so,008e3bf4 <cpu_idle_60x+0x1c>
008e3bf4 <cpu_idle_60x+0x1c> mflr    r30
008e3bf8 <cpu_idle_60x+0x20> lwz     r0,-32(r30)
008e3bfc <cpu_idle_60x+0x24> add     r30,r0,r30
008e3c00 <cpu_idle_60x+0x28> lwz     r9,-32756(r30)
008e3c04 <cpu_idle_60x+0x2c> lwz     r0,0(r9)
008e3c08 <cpu_idle_60x+0x30> cmpwi   cr7,r0,0
008e3c0c <cpu_idle_60x+0x34> beq-    cr7,008e3c78 <cpu_idle_60x+0xa0>
008e3c10 <cpu_idle_60x+0x38> mfmsr   r11
008e3c14 <cpu_idle_60x+0x3c> mfsprg  r0,7
008e3c18 <cpu_idle_60x+0x40> rlwinm  r9,r0,16,16,31
008e3c1c <cpu_idle_60x+0x44> cmpwi   cr7,r9,68
008e3c20 <cpu_idle_60x+0x48> beq-    cr7,008e3c4c <cpu_idle_60x+0x74>
008e3c24 <cpu_idle_60x+0x4c> cmplwi  cr7,r9,68
008e3c28 <cpu_idle_60x+0x50> bgt-    cr7,008e3c40 <cpu_idle_60x+0x68>
008e3c2c <cpu_idle_60x+0x54> cmpwi   cr7,r9,57
008e3c30 <cpu_idle_60x+0x58> beq-    cr7,008e3c4c <cpu_idle_60x+0x74>
008e3c34 <cpu_idle_60x+0x5c> cmpwi   cr7,r9,60
008e3c38 <cpu_idle_60x+0x60> bne+    cr7,008e3c64 <cpu_idle_60x+0x8c>
008e3c3c <cpu_idle_60x+0x64> b       008e3c4c <cpu_idle_60x+0x74>
008e3c40 <cpu_idle_60x+0x68> addi    r0,r9,-32768
008e3c44 <cpu_idle_60x+0x6c> cmplwi  cr7,r0,4
008e3c48 <cpu_idle_60x+0x70> bgt-    cr7,008e3c64 <cpu_idle_60x+0x8c>
008e3c4c <cpu_idle_60x+0x74> oris    r0,r11,4
008e3c50 <cpu_idle_60x+0x78> dssall
008e3c54 <cpu_idle_60x+0x7c> sync    
008e3c58 <cpu_idle_60x+0x80> mtmsr   r0
008e3c5c <cpu_idle_60x+0x84> isync
008e3c60 <cpu_idle_60x+0x88> b       008e3c78 <cpu_idle_60x+0xa0>

So that is the context for the EXI trap.
The "mtsmr r0" merged in PSL_POW to the
msr value when it was originally not
set. (r11 vs. r0 value comparison in
the trapframe.)

Why it would be going from without
POW for so long and then merging in
POW I do not know. (Or may be I just
did not find code that turns POW
back off on occasion.)

Still setting POW in msr seems to be
what is unique to the failure context.
Mkes me wonder if smu_doorbell_intr
and its irq is somehow involved.

There is no obvious, local tie to
r25 and r26 in/for the spots identified
earlier for the 013f0930 line of the
thread stack area.



Interestingly when I looked up POW in
mtsmsr use what I found reported:

Synchronization Required Prior: implementation dependent
Synchronization Required After: implementation dependent

The source code looks like:

static void
cpu_idle_60x(sbintime_t sbt)
{
        register_t msr;
        uint16_t vers;

        if (!powerpc_pow_enabled)
                return;

        msr = mfmsr();
        vers = mfpvr() >> 16;

#ifdef AIM
        switch (vers) {
        case IBM970:
        case IBM970FX:
        case IBM970MP:
        case MPC7447A:
        case MPC7448:
        case MPC7450:
        case MPC7455:
        case MPC7457:
                __asm __volatile("\
                            dssall; sync; mtmsr %0; isync"
                            :: "r"(msr | PSL_POW));
                break;
        default:
                powerpc_sync();
                mtmsr(msr | PSL_POW);
                isync();
                break;
        }
#endif
}

I've not yet figured out how to check
that the details are right here for
IBM970MP used via 32-bit FreeBSD.

===
Mark Millard
markmi at dsl-only.net
Received on Sat May 20 2017 - 07:01:59 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:11 UTC