Re: panic: sbuf_vprintf called with a NULL sbuf pointer

From: Don Lewis <truckman_at_FreeBSD.org> Date: Tue, 8 Dec 2015 11:39:56 -0800 (PST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:01 UTC

On  8 Dec, John Baldwin wrote:
> On Monday, December 07, 2015 10:10:51 PM Don Lewis wrote:
>> On  2 Dec, John Baldwin wrote:
>> > On Wednesday, December 02, 2015 01:25:56 PM Don Lewis wrote:
>> >> > If you want to look at this further, try going to frame 16 and dissassembling the
>> >> > instructions before the call to see if you can spot which register the first
>> >> > parameter (saved in %rdi IIRC) comes from.
>> >> 
>> >> Dump of assembler code for function sbuf_printf:
>> >>    0xffffffff80a673e0 <+0>:	push   %rbp
>> >>    0xffffffff80a673e1 <+1>:	mov    %rsp,%rbp
>> >>    0xffffffff80a673e4 <+4>:	push   %r14
>> >>    0xffffffff80a673e6 <+6>:	push   %rbx
>> >>    0xffffffff80a673e7 <+7>:	sub    $0x50,%rsp
>> >>    0xffffffff80a673eb <+11>:	mov    %rsi,%r14
>> >>    0xffffffff80a673ee <+14>:	mov    %rdi,%rbx
>> >>    0xffffffff80a673f1 <+17>:	mov    %r9,-0x38(%rbp)
>> >>    0xffffffff80a673f5 <+21>:	mov    %r8,-0x40(%rbp)
>> >>    0xffffffff80a673f9 <+25>:	mov    %rcx,-0x48(%rbp)
>> >>    0xffffffff80a673fd <+29>:	mov    %rdx,-0x50(%rbp)
>> >>    0xffffffff80a67401 <+33>:	lea    -0x60(%rbp),%rax
>> >>    0xffffffff80a67405 <+37>:	mov    %rax,-0x20(%rbp)
>> >>    0xffffffff80a67409 <+41>:	lea    0x10(%rbp),%rax
>> >>    0xffffffff80a6740d <+45>:	mov    %rax,-0x28(%rbp)
>> >>    0xffffffff80a67411 <+49>:	movl   $0x30,-0x2c(%rbp)
>> >>    0xffffffff80a67418 <+56>:	movl   $0x10,-0x30(%rbp)
>> >>    0xffffffff80a6741f <+63>:	mov    $0xffffffff8137bdf8,%rdi
>> >>    0xffffffff80a67426 <+70>:	mov    %rbx,%rsi
>> >>    0xffffffff80a67429 <+73>:	callq  0xffffffff80a66c00 <_assert_sbuf_integrity>
>> >> 
>> >> 
>> >>    0xffffffff80a237b9 <+825>:	jmpq   0xffffffff80a236fd <sigexit+637>
>> >>    0xffffffff80a237be <+830>:	mov    $0xffffffff80fd8ad3,%rsi
>> >>    0xffffffff80a237c5 <+837>:	xor    %eax,%eax
>> >>    0xffffffff80a237c7 <+839>:	mov    %r12,%rdi
>> >>    0xffffffff80a237ca <+842>:	mov    -0x228(%rbp),%rdx
>> >>    0xffffffff80a237d1 <+849>:	callq  0xffffffff80a673e0 <sbuf_printf>
>> >> => 0xffffffff80a237d6 <+854>:	inc    %r14d
>> >>    0xffffffff80a237d9 <+857>:	jmpq   0xffffffff80a236fd <sigexit+637>
>> > 
>> > So maybe try 'p $r12' in the corefile_open() frame.
>> 
>> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=<optimized out>, 
>>     uid=<optimized out>, pid=<optimized out>, td=<optimized out>, 
>>     vpp=<optimized out>, namep=<optimized out>)
>>     at /usr/src/sys/kern/kern_sig.c:3188
>> 3188					sbuf_printf(&sb, "%s", comm);
>> (kgdb) p $r12
>> $1 = 0
> 
> So it's definitely zero. :(  The next step is probably to disassemble the
> corefile_open function (ugh) and walk backwards to find where %r12 is read
> from.  It might be from a local variable on the stack, so then you would
> want to examine that memory in the stack and the surrounding memory to see
> if there is memory corruption and perhaps if there is anything recognizable
> about it (e.g. if the corruption contains some sort of data you recognize,
> or if the corruption is bounded by a certain length, etc.).  It's a bit of
> a shot in the dark though.
> 
> Is this reproducible?

No it's not.  The only time it happened was when there was a swap
timeout, probably because of a lengthy deep recovery on one of the
mirrored swap devices.

The code in question is:
        struct sbuf sb;
[snip]
        (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN);
[snip]
        for (i = 0; format[i] != '\0'; i++) {
                switch (format[i]) {
                case '%':       /* Format character */
                        i++;
                        switch (format[i]) {
[snip]
                        case 'N':       /* process name */
                                sbuf_printf(&sb, "%s", comm);
                                break;

&sb is used in a bunch of places, so the compiler probably computes its
value once by adding the proper offset to the stack pointer and stashing
the result in a register.  Since kern.corefile is "%N.core", the failure
is happening on the first interation of the loop, so there isn't much
opportunity for things to get corrupted.  Also, the control flow in this
function only depends on the format, so there shouldn't be anything
special about a swap timeout vs. a segfault generated core.

How is gdb able to print the register contents for an arbitrary stack
frame?  It's not like this is a SPARC with register windows.  Aren't
only the final register values when the core dump was generated saved
along with the memory contents?  Unless a register contents are pushed
onto the stack, I would think that if a callee overwrites a register,
its contents are gone.