Re: panic: in_pcblookup_local (?)

From: Peter Wemm <peter_at_wemm.org> Date: Fri, 3 May 2013 11:16:32 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:37 UTC

On Thu, May 2, 2013 at 11:32 AM, John Baldwin <jhb_at_freebsd.org> wrote:
> On Thursday, May 02, 2013 1:53:47 pm Ian FREISLICH wrote:
>> John Baldwin wrote:
>> > On Thursday, May 02, 2013 7:25:08 am Robert N. M. Watson wrote:
>> > >
>> > > On 2 May 2013, at 11:42, Glen Barber wrote:
>> > >
>> > > > Hmm.  Perhaps it would be worthwhile for me to rebuild the current
>> > > > kernel with DDB support.  It looks like the machine has panicked a few
>> > > > times over the last two weeks or so, but based on the timestamps of the
>> > > > crash dumps and nagios complaints, happened during the middle of the
>> > > > night when I would not have really noticed, or otherwise would have just
>> > > > blamed my ISP.
>> > > >
>> > > > Two of the panics are ath(4) related.  One looks similar to the one
>> > > > referenced in this thread, similarly triggered by a CFEngine process.
>> > > >
>> > > > In that case, the backtrace looks like:
>> > > >
>> > > > #4 0xffffffff808cdbb3 at calltrap+0x8
>> > > > #5 0xffffffff807371d8 at in_pcb_lport+0x128
>> > > > #6 0xffffffff8073745a at in_pcbbind_setup+0x16a
>> > > > #7 0xffffffff80737d8e at in_pcbconnect_setup+0x71e
>> > > > #8 0xffffffff80737df9 at in_pcbconnect_mbuf+0x59
>> > > > #9 0xffffffff807bf29f at udp_connect+0x11f
>> > > > #10 0xffffffff80680615 at kern_connectat+0x275
>> > > >
>> > > > Regarding DDB though, it would be rather difficult to access the machine
>> > > > if it drops to a DDB debugger session, since the machine acts as my
>> > > > firewall.
>> > >
>> > > Thanks -- will take a look at the attached.
>> > >
>> > > FWIW, though, I'm worried by the number of panics you are seeing, especiall
>> y
>> > given that they involve multiple subsystems, and in particular, John's
>> > observation about a potentially corrupted pointer. This makes me wonder
>> > whether (a) you are experiencing hardware faults -- it would be worth running
>>
>> > some memory/cpu/etc tests and (b) if we might be seeing a software memory
>> > corruption bug of some sort.
>> >
>> > Other users have reported this (Ian Lepore), and Peter Wemm can now reproduce
>> > these at will as well, so I think this is a software bug.  What might be
>> > easiest if we can't figure this out from the crashdump is just to bisect the
>> > offending revision.
>>
>> I've started a binary search.  I'll let you know what that turns up.
>
> Thanks, and sorry for getting my Ian's mixed up. :-/
>
> --
> John Baldwin

I forgot to roll back one of the routers at nyi.freebsd.org and it
paniced again, the same way as before:

Fatal trap 9: general protection fault while in kernel mode^M
cpuid = 3; apic id = 03^M
instruction pointer     = 0x20:0xffffffff8067284c^M
stack pointer           = 0x28:0xffffff8098688760^M
frame pointer           = 0x28:0xffffff80986887a0^M
code segment            = base 0x0, limit 0xfffff, type 0x1b^M
                        = DPL 0, pres 1, long 1, def32 0, gran 1^M
processor eflags        = interrupt enabled, resume, IOPL = 0^M
current process         = 15041 (svn)^M
[ thread pid 15041 tid 100208 ]^M
Stopped at      in_pcblookup_local+0x5c:        cmpw    %r12w,0x18(%rax)^M

#8  0xffffffff80829dff in calltrap () at ../../../amd64/amd64/exception.S:228
#9  0xffffffff8067284c in in_pcblookup_local (pcbinfo=0xffffffff80c9e180, laddr=
      {s_addr = 708980576}, lport=607, lookupflags=1, cred=0xfffffe006956d700)
    at ../../../netinet/in_pcb.c:1438
#10 0xffffffff80672d38 in in_pcb_lport (inp=0xfffffe00098aa620,
    laddrp=0xffffff809845d860, lportp=0xffffff809845d86e,
cred=0xfffffe006956d700,
    lookupflags=1) at ../../../netinet/in_pcb.c:457
#11 0xffffffff80672fba in in_pcbbind_setup (inp=0xfffffe00098aa620, nam=0x0,
    laddrp=0xffffff809845d900, lportp=0xffffff809845d90e,
cred=0xfffffe006956d700)
    at ../../../netinet/in_pcb.c:615
#12 0xffffffff806738ee in in_pcbconnect_setup (inp=0xfffffe00098aa620,
    nam=<value optimized out>, laddrp=0xffffff809845d9b8,
lportp=0xffffff809845d9be,
    faddrp=0xffffff809845d9b4, fportp=0xffffff809845d9bc, oinpp=0x0,
    cred=0xfffffe006956d700) at ../../../netinet/in_pcb.c:1019
#13 0xffffffff80673959 in in_pcbconnect_mbuf (inp=0xfffffe00098aa620,
    nam=<value optimized out>, cred=<value optimized out>, m=0x0)
    at ../../../netinet/in_pcb.c:645
#14 0xffffffff806fafcf in udp_connect (so=0xfffffe002e150d48,
nam=0xfffffe00264df3b0,
    td=0xfffffe00091df490) at ../../../netinet/udp_usrreq.c:1530
#15 0xffffffff805faea5 in kern_connectat (td=0xfffffe00091df490, dirfd=-100,
    fd=<value optimized out>, sa=0xfffffe00264df3b0) at
../../../kern/uipc_syscalls.c:593
#16 0xffffffff805fafc1 in sys_connect (td=0xfffffe00091df490,
uap=0xffffff809845db70)
    at ../../../kern/uipc_syscalls.c:559
#17 0xffffffff8083f571 in amd64_syscall (td=0xfffffe00091df490, traced=0)
    at subr_syscall.c:134

There's been two separate machines, at least twice each on this exact
panic / trace.  Always with doing a 'svn update'.

Rolling back to April 5th 249172 solves it.  (There's nothing
particular about that rev, except it was top-of-tree when the last
update was done).

I see a number locking changes in the area.  Note that this is UDP,
most likely a dns lookup.

-- 
Peter Wemm - peter_at_wemm.org; peter_at_FreeBSD.org; peter_at_yahoo-inc.com; KI6FJV