On Wednesday, January 20, 2016 03:15:06 PM Don Lewis wrote: > On 8 Dec, John Baldwin wrote: > > On Tuesday, December 08, 2015 11:39:56 AM Don Lewis wrote: > >> On 8 Dec, John Baldwin wrote: > >> > On Monday, December 07, 2015 10:10:51 PM Don Lewis wrote: > >> >> On 2 Dec, John Baldwin wrote: > >> >> > On Wednesday, December 02, 2015 01:25:56 PM Don Lewis wrote: > >> >> >> > If you want to look at this further, try going to frame 16 and dissassembling the > >> >> >> > instructions before the call to see if you can spot which register the first > >> >> >> > parameter (saved in %rdi IIRC) comes from. > >> >> >> > >> >> >> Dump of assembler code for function sbuf_printf: > >> >> >> 0xffffffff80a673e0 <+0>: push %rbp > >> >> >> 0xffffffff80a673e1 <+1>: mov %rsp,%rbp > >> >> >> 0xffffffff80a673e4 <+4>: push %r14 > >> >> >> 0xffffffff80a673e6 <+6>: push %rbx > >> >> >> 0xffffffff80a673e7 <+7>: sub $0x50,%rsp > >> >> >> 0xffffffff80a673eb <+11>: mov %rsi,%r14 > >> >> >> 0xffffffff80a673ee <+14>: mov %rdi,%rbx > >> >> >> 0xffffffff80a673f1 <+17>: mov %r9,-0x38(%rbp) > >> >> >> 0xffffffff80a673f5 <+21>: mov %r8,-0x40(%rbp) > >> >> >> 0xffffffff80a673f9 <+25>: mov %rcx,-0x48(%rbp) > >> >> >> 0xffffffff80a673fd <+29>: mov %rdx,-0x50(%rbp) > >> >> >> 0xffffffff80a67401 <+33>: lea -0x60(%rbp),%rax > >> >> >> 0xffffffff80a67405 <+37>: mov %rax,-0x20(%rbp) > >> >> >> 0xffffffff80a67409 <+41>: lea 0x10(%rbp),%rax > >> >> >> 0xffffffff80a6740d <+45>: mov %rax,-0x28(%rbp) > >> >> >> 0xffffffff80a67411 <+49>: movl $0x30,-0x2c(%rbp) > >> >> >> 0xffffffff80a67418 <+56>: movl $0x10,-0x30(%rbp) > >> >> >> 0xffffffff80a6741f <+63>: mov $0xffffffff8137bdf8,%rdi > >> >> >> 0xffffffff80a67426 <+70>: mov %rbx,%rsi > >> >> >> 0xffffffff80a67429 <+73>: callq 0xffffffff80a66c00 <_assert_sbuf_integrity> > >> >> >> > >> >> >> > >> >> >> 0xffffffff80a237b9 <+825>: jmpq 0xffffffff80a236fd <sigexit+637> > >> >> >> 0xffffffff80a237be <+830>: mov $0xffffffff80fd8ad3,%rsi > >> >> >> 0xffffffff80a237c5 <+837>: xor %eax,%eax > >> >> >> 0xffffffff80a237c7 <+839>: mov %r12,%rdi > >> >> >> 0xffffffff80a237ca <+842>: mov -0x228(%rbp),%rdx > >> >> >> 0xffffffff80a237d1 <+849>: callq 0xffffffff80a673e0 <sbuf_printf> > >> >> >> => 0xffffffff80a237d6 <+854>: inc %r14d > >> >> >> 0xffffffff80a237d9 <+857>: jmpq 0xffffffff80a236fd <sigexit+637> > >> >> > > >> >> > So maybe try 'p $r12' in the corefile_open() frame. > >> >> > >> >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=<optimized out>, > >> >> uid=<optimized out>, pid=<optimized out>, td=<optimized out>, > >> >> vpp=<optimized out>, namep=<optimized out>) > >> >> at /usr/src/sys/kern/kern_sig.c:3188 > >> >> 3188 sbuf_printf(&sb, "%s", comm); > >> >> (kgdb) p $r12 > >> >> $1 = 0 > >> > > >> > So it's definitely zero. :( The next step is probably to disassemble the > >> > corefile_open function (ugh) and walk backwards to find where %r12 is read > >> > from. It might be from a local variable on the stack, so then you would > >> > want to examine that memory in the stack and the surrounding memory to see > >> > if there is memory corruption and perhaps if there is anything recognizable > >> > about it (e.g. if the corruption contains some sort of data you recognize, > >> > or if the corruption is bounded by a certain length, etc.). It's a bit of > >> > a shot in the dark though. > >> > > >> > Is this reproducible? > >> > >> No it's not. The only time it happened was when there was a swap > >> timeout, probably because of a lengthy deep recovery on one of the > >> mirrored swap devices. > >> > >> The code in question is: > >> struct sbuf sb; > >> [snip] > >> (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN); > >> [snip] > >> for (i = 0; format[i] != '\0'; i++) { > >> switch (format[i]) { > >> case '%': /* Format character */ > >> i++; > >> switch (format[i]) { > >> [snip] > >> case 'N': /* process name */ > >> sbuf_printf(&sb, "%s", comm); > >> break; > >> > >> > >> &sb is used in a bunch of places, so the compiler probably computes its > >> value once by adding the proper offset to the stack pointer and stashing > >> the result in a register. Since kern.corefile is "%N.core", the failure > >> is happening on the first interation of the loop, so there isn't much > >> opportunity for things to get corrupted. Also, the control flow in this > >> function only depends on the format, so there shouldn't be anything > >> special about a swap timeout vs. a segfault generated core. > > > > Yes, r12 is call-safe (IIRC), so I expect it only computes it once as well, > > but I've sometimes seen the compiler spill local vars onto the stack due to > > register pressure. That said, I think it is unlikely it would have to spill > > &sb during the early part of the function. :( > > I had some more time to look at this. It's a bit messy because > corefile_open() and coredump() both get inlined into sig_exit(). Here's > the section of code from the malloc() call and sbuf_new() to the failing > sbuf_printf() call: > > 0xffffffff80a23685 <+517>: callq 0xffffffff80a00680 <malloc> > 0xffffffff80a2368a <+522>: mov %rax,-0x238(%rbp) > 0xffffffff80a23691 <+529>: lea -0x138(%rbp),%r12 > .... > 0xffffffff80a237b9 <+825>: jmpq 0xffffffff80a236fd <sigexit+637> > 0xffffffff80a237be <+830>: mov $0xffffffff80fd8ad3,%rsi > 0xffffffff80a237c5 <+837>: xor %eax,%eax > 0xffffffff80a237c7 <+839>: mov %r12,%rdi > 0xffffffff80a237ca <+842>: mov -0x228(%rbp),%rdx > 0xffffffff80a237d1 <+849>: callq 0xffffffff80a673e0 <sbuf_printf> > => 0xffffffff80a237d6 <+854>: inc %r14d > 0xffffffff80a237d9 <+857>: jmpq 0xffffffff80a236fd <sigexit+637> > 0xffffffff80a237de <+862>: mov $0xffffffff81361469,%rsi > > Basically it's: > lea -0x138(%rbp),%r12 > [ snip ] > mov %r12,%rdi > callq sbuf_printf Yeah, at this point I can only think of bizarre-o things like a trap clobbering %r12 in the saved trapframe or the like. I can't imagine why that would happen though. If we had a bug that clobbered saved registers in the trapframe you would expect it to happen far more often. OTOH, if some code had a stale pointer that was later reused by a kthread's stack and the thread was interrupted (e.g. by a device interrupt) while in this routine, _and_ the stale pointer was indirected and trashed the saved %r12, then that would explain this. I just feel like I have better odds of seeing a unicorn in my driveway tonight. :-/ -- John BaldwinReceived on Thu Jan 21 2016 - 00:15:36 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:02 UTC