Re: panic: sbuf_vprintf called with a NULL sbuf pointer

From: John Baldwin <jhb_at_freebsd.org>
Date: Tue, 08 Dec 2015 12:27:48 -0800
On Tuesday, December 08, 2015 11:39:56 AM Don Lewis wrote:
> On  8 Dec, John Baldwin wrote:
> > On Monday, December 07, 2015 10:10:51 PM Don Lewis wrote:
> >> On  2 Dec, John Baldwin wrote:
> >> > On Wednesday, December 02, 2015 01:25:56 PM Don Lewis wrote:
> >> >> > If you want to look at this further, try going to frame 16 and dissassembling the
> >> >> > instructions before the call to see if you can spot which register the first
> >> >> > parameter (saved in %rdi IIRC) comes from.
> >> >> 
> >> >> Dump of assembler code for function sbuf_printf:
> >> >>    0xffffffff80a673e0 <+0>:	push   %rbp
> >> >>    0xffffffff80a673e1 <+1>:	mov    %rsp,%rbp
> >> >>    0xffffffff80a673e4 <+4>:	push   %r14
> >> >>    0xffffffff80a673e6 <+6>:	push   %rbx
> >> >>    0xffffffff80a673e7 <+7>:	sub    $0x50,%rsp
> >> >>    0xffffffff80a673eb <+11>:	mov    %rsi,%r14
> >> >>    0xffffffff80a673ee <+14>:	mov    %rdi,%rbx
> >> >>    0xffffffff80a673f1 <+17>:	mov    %r9,-0x38(%rbp)
> >> >>    0xffffffff80a673f5 <+21>:	mov    %r8,-0x40(%rbp)
> >> >>    0xffffffff80a673f9 <+25>:	mov    %rcx,-0x48(%rbp)
> >> >>    0xffffffff80a673fd <+29>:	mov    %rdx,-0x50(%rbp)
> >> >>    0xffffffff80a67401 <+33>:	lea    -0x60(%rbp),%rax
> >> >>    0xffffffff80a67405 <+37>:	mov    %rax,-0x20(%rbp)
> >> >>    0xffffffff80a67409 <+41>:	lea    0x10(%rbp),%rax
> >> >>    0xffffffff80a6740d <+45>:	mov    %rax,-0x28(%rbp)
> >> >>    0xffffffff80a67411 <+49>:	movl   $0x30,-0x2c(%rbp)
> >> >>    0xffffffff80a67418 <+56>:	movl   $0x10,-0x30(%rbp)
> >> >>    0xffffffff80a6741f <+63>:	mov    $0xffffffff8137bdf8,%rdi
> >> >>    0xffffffff80a67426 <+70>:	mov    %rbx,%rsi
> >> >>    0xffffffff80a67429 <+73>:	callq  0xffffffff80a66c00 <_assert_sbuf_integrity>
> >> >> 
> >> >> 
> >> >>    0xffffffff80a237b9 <+825>:	jmpq   0xffffffff80a236fd <sigexit+637>
> >> >>    0xffffffff80a237be <+830>:	mov    $0xffffffff80fd8ad3,%rsi
> >> >>    0xffffffff80a237c5 <+837>:	xor    %eax,%eax
> >> >>    0xffffffff80a237c7 <+839>:	mov    %r12,%rdi
> >> >>    0xffffffff80a237ca <+842>:	mov    -0x228(%rbp),%rdx
> >> >>    0xffffffff80a237d1 <+849>:	callq  0xffffffff80a673e0 <sbuf_printf>
> >> >> => 0xffffffff80a237d6 <+854>:	inc    %r14d
> >> >>    0xffffffff80a237d9 <+857>:	jmpq   0xffffffff80a236fd <sigexit+637>
> >> > 
> >> > So maybe try 'p $r12' in the corefile_open() frame.
> >> 
> >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=<optimized out>, 
> >>     uid=<optimized out>, pid=<optimized out>, td=<optimized out>, 
> >>     vpp=<optimized out>, namep=<optimized out>)
> >>     at /usr/src/sys/kern/kern_sig.c:3188
> >> 3188					sbuf_printf(&sb, "%s", comm);
> >> (kgdb) p $r12
> >> $1 = 0
> > 
> > So it's definitely zero. :(  The next step is probably to disassemble the
> > corefile_open function (ugh) and walk backwards to find where %r12 is read
> > from.  It might be from a local variable on the stack, so then you would
> > want to examine that memory in the stack and the surrounding memory to see
> > if there is memory corruption and perhaps if there is anything recognizable
> > about it (e.g. if the corruption contains some sort of data you recognize,
> > or if the corruption is bounded by a certain length, etc.).  It's a bit of
> > a shot in the dark though.
> > 
> > Is this reproducible?
> 
> No it's not.  The only time it happened was when there was a swap
> timeout, probably because of a lengthy deep recovery on one of the
> mirrored swap devices.
> 
> The code in question is:
>         struct sbuf sb;
> [snip]
>         (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN);
> [snip]
>         for (i = 0; format[i] != '\0'; i++) {
>                 switch (format[i]) {
>                 case '%':       /* Format character */
>                         i++;
>                         switch (format[i]) {
> [snip]
>                         case 'N':       /* process name */
>                                 sbuf_printf(&sb, "%s", comm);
>                                 break;
> 
> 
> &sb is used in a bunch of places, so the compiler probably computes its
> value once by adding the proper offset to the stack pointer and stashing
> the result in a register.  Since kern.corefile is "%N.core", the failure
> is happening on the first interation of the loop, so there isn't much
> opportunity for things to get corrupted.  Also, the control flow in this
> function only depends on the format, so there shouldn't be anything
> special about a swap timeout vs. a segfault generated core.

Yes, r12 is call-safe (IIRC), so I expect it only computes it once as well,
but I've sometimes seen the compiler spill local vars onto the stack due to
register pressure.  That said, I think it is unlikely it would have to spill
&sb during the early part of the function. :(

> How is gdb able to print the register contents for an arbitrary stack
> frame?  It's not like this is a SPARC with register windows.  Aren't
> only the final register values when the core dump was generated saved
> along with the memory contents?  Unless a register contents are pushed
> onto the stack, I would think that if a callee overwrites a register,
> its contents are gone.

My understanding is that when the compiler saves a register (e.g. on the
stack during a function prologue) it also emits DWARF records that
indicate where it saved each register.  As the stack unwinder walks back
up the stack it populates the current "register view" based on the debug
info at that frame (e.g. register X is stored at address Y, register Z
is gone (<optimized out> in gdb parlance)).

-- 
John Baldwin
Received on Wed Dec 09 2015 - 17:23:45 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:01 UTC