Re: panic: sbuf_vprintf called with a NULL sbuf pointer

From: John Baldwin <jhb_at_freebsd.org> Date: Wed, 20 Jan 2016 17:14:56 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:02 UTC

On Wednesday, January 20, 2016 03:15:06 PM Don Lewis wrote:
> On  8 Dec, John Baldwin wrote:
> > On Tuesday, December 08, 2015 11:39:56 AM Don Lewis wrote:
> >> On  8 Dec, John Baldwin wrote:
> >> > On Monday, December 07, 2015 10:10:51 PM Don Lewis wrote:
> >> >> On  2 Dec, John Baldwin wrote:
> >> >> > On Wednesday, December 02, 2015 01:25:56 PM Don Lewis wrote:
> >> >> >> > If you want to look at this further, try going to frame 16 and dissassembling the
> >> >> >> > instructions before the call to see if you can spot which register the first
> >> >> >> > parameter (saved in %rdi IIRC) comes from.
> >> >> >> 
> >> >> >> Dump of assembler code for function sbuf_printf:
> >> >> >>    0xffffffff80a673e0 <+0>:	push   %rbp
> >> >> >>    0xffffffff80a673e1 <+1>:	mov    %rsp,%rbp
> >> >> >>    0xffffffff80a673e4 <+4>:	push   %r14
> >> >> >>    0xffffffff80a673e6 <+6>:	push   %rbx
> >> >> >>    0xffffffff80a673e7 <+7>:	sub    $0x50,%rsp
> >> >> >>    0xffffffff80a673eb <+11>:	mov    %rsi,%r14
> >> >> >>    0xffffffff80a673ee <+14>:	mov    %rdi,%rbx
> >> >> >>    0xffffffff80a673f1 <+17>:	mov    %r9,-0x38(%rbp)
> >> >> >>    0xffffffff80a673f5 <+21>:	mov    %r8,-0x40(%rbp)
> >> >> >>    0xffffffff80a673f9 <+25>:	mov    %rcx,-0x48(%rbp)
> >> >> >>    0xffffffff80a673fd <+29>:	mov    %rdx,-0x50(%rbp)
> >> >> >>    0xffffffff80a67401 <+33>:	lea    -0x60(%rbp),%rax
> >> >> >>    0xffffffff80a67405 <+37>:	mov    %rax,-0x20(%rbp)
> >> >> >>    0xffffffff80a67409 <+41>:	lea    0x10(%rbp),%rax
> >> >> >>    0xffffffff80a6740d <+45>:	mov    %rax,-0x28(%rbp)
> >> >> >>    0xffffffff80a67411 <+49>:	movl   $0x30,-0x2c(%rbp)
> >> >> >>    0xffffffff80a67418 <+56>:	movl   $0x10,-0x30(%rbp)
> >> >> >>    0xffffffff80a6741f <+63>:	mov    $0xffffffff8137bdf8,%rdi
> >> >> >>    0xffffffff80a67426 <+70>:	mov    %rbx,%rsi
> >> >> >>    0xffffffff80a67429 <+73>:	callq  0xffffffff80a66c00 <_assert_sbuf_integrity>
> >> >> >> 
> >> >> >> 
> >> >> >>    0xffffffff80a237b9 <+825>:	jmpq   0xffffffff80a236fd <sigexit+637>
> >> >> >>    0xffffffff80a237be <+830>:	mov    $0xffffffff80fd8ad3,%rsi
> >> >> >>    0xffffffff80a237c5 <+837>:	xor    %eax,%eax
> >> >> >>    0xffffffff80a237c7 <+839>:	mov    %r12,%rdi
> >> >> >>    0xffffffff80a237ca <+842>:	mov    -0x228(%rbp),%rdx
> >> >> >>    0xffffffff80a237d1 <+849>:	callq  0xffffffff80a673e0 <sbuf_printf>
> >> >> >> => 0xffffffff80a237d6 <+854>:	inc    %r14d
> >> >> >>    0xffffffff80a237d9 <+857>:	jmpq   0xffffffff80a236fd <sigexit+637>
> >> >> > 
> >> >> > So maybe try 'p $r12' in the corefile_open() frame.
> >> >> 
> >> >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=<optimized out>, 
> >> >>     uid=<optimized out>, pid=<optimized out>, td=<optimized out>, 
> >> >>     vpp=<optimized out>, namep=<optimized out>)
> >> >>     at /usr/src/sys/kern/kern_sig.c:3188
> >> >> 3188					sbuf_printf(&sb, "%s", comm);
> >> >> (kgdb) p $r12
> >> >> $1 = 0
> >> > 
> >> > So it's definitely zero. :(  The next step is probably to disassemble the
> >> > corefile_open function (ugh) and walk backwards to find where %r12 is read
> >> > from.  It might be from a local variable on the stack, so then you would
> >> > want to examine that memory in the stack and the surrounding memory to see
> >> > if there is memory corruption and perhaps if there is anything recognizable
> >> > about it (e.g. if the corruption contains some sort of data you recognize,
> >> > or if the corruption is bounded by a certain length, etc.).  It's a bit of
> >> > a shot in the dark though.
> >> > 
> >> > Is this reproducible?
> >> 
> >> No it's not.  The only time it happened was when there was a swap
> >> timeout, probably because of a lengthy deep recovery on one of the
> >> mirrored swap devices.
> >> 
> >> The code in question is:
> >>         struct sbuf sb;
> >> [snip]
> >>         (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN);
> >> [snip]
> >>         for (i = 0; format[i] != '\0'; i++) {
> >>                 switch (format[i]) {
> >>                 case '%':       /* Format character */
> >>                         i++;
> >>                         switch (format[i]) {
> >> [snip]
> >>                         case 'N':       /* process name */
> >>                                 sbuf_printf(&sb, "%s", comm);
> >>                                 break;
> >> 
> >> 
> >> &sb is used in a bunch of places, so the compiler probably computes its
> >> value once by adding the proper offset to the stack pointer and stashing
> >> the result in a register.  Since kern.corefile is "%N.core", the failure
> >> is happening on the first interation of the loop, so there isn't much
> >> opportunity for things to get corrupted.  Also, the control flow in this
> >> function only depends on the format, so there shouldn't be anything
> >> special about a swap timeout vs. a segfault generated core.
> > 
> > Yes, r12 is call-safe (IIRC), so I expect it only computes it once as well,
> > but I've sometimes seen the compiler spill local vars onto the stack due to
> > register pressure.  That said, I think it is unlikely it would have to spill
> > &sb during the early part of the function. :(
> 
> I had some more time to look at this.  It's a bit messy because
> corefile_open() and coredump() both get inlined into sig_exit().  Here's
> the section of code from the malloc() call and sbuf_new() to the failing
> sbuf_printf() call:
> 
>    0xffffffff80a23685 <+517>:   callq  0xffffffff80a00680 <malloc>
>    0xffffffff80a2368a <+522>:   mov    %rax,-0x238(%rbp)
>    0xffffffff80a23691 <+529>:   lea    -0x138(%rbp),%r12
>   ....
>    0xffffffff80a237b9 <+825>:   jmpq   0xffffffff80a236fd <sigexit+637>
>    0xffffffff80a237be <+830>:   mov    $0xffffffff80fd8ad3,%rsi
>    0xffffffff80a237c5 <+837>:   xor    %eax,%eax
>    0xffffffff80a237c7 <+839>:   mov    %r12,%rdi
>    0xffffffff80a237ca <+842>:   mov    -0x228(%rbp),%rdx
>    0xffffffff80a237d1 <+849>:   callq  0xffffffff80a673e0 <sbuf_printf>
> => 0xffffffff80a237d6 <+854>:   inc    %r14d
>    0xffffffff80a237d9 <+857>:   jmpq   0xffffffff80a236fd <sigexit+637>
>    0xffffffff80a237de <+862>:   mov    $0xffffffff81361469,%rsi
> 
> Basically it's:
> 	lea    -0x138(%rbp),%r12
> 	[ snip ]
> 	mov     %r12,%rdi
> 	callq	sbuf_printf

Yeah, at this point I can only think of bizarre-o things like a trap
clobbering %r12 in the saved trapframe or the like.  I can't imagine why
that would happen though.  If we had a bug that clobbered saved registers
in the trapframe you would expect it to happen far more often.  OTOH, if
some code had a stale pointer that was later reused by a kthread's stack
and the thread was interrupted (e.g. by a device interrupt) while in
this routine, _and_ the stale pointer was indirected and trashed the
saved %r12, then that would explain this.  I just feel like I have better
odds of seeing a unicorn in my driveway tonight. :-/

-- 
John Baldwin