On Thu, May 2, 2013 at 11:32 AM, John Baldwin <jhb_at_freebsd.org> wrote: > On Thursday, May 02, 2013 1:53:47 pm Ian FREISLICH wrote: >> John Baldwin wrote: >> > On Thursday, May 02, 2013 7:25:08 am Robert N. M. Watson wrote: >> > > >> > > On 2 May 2013, at 11:42, Glen Barber wrote: >> > > >> > > > Hmm. Perhaps it would be worthwhile for me to rebuild the current >> > > > kernel with DDB support. It looks like the machine has panicked a few >> > > > times over the last two weeks or so, but based on the timestamps of the >> > > > crash dumps and nagios complaints, happened during the middle of the >> > > > night when I would not have really noticed, or otherwise would have just >> > > > blamed my ISP. >> > > > >> > > > Two of the panics are ath(4) related. One looks similar to the one >> > > > referenced in this thread, similarly triggered by a CFEngine process. >> > > > >> > > > In that case, the backtrace looks like: >> > > > >> > > > #4 0xffffffff808cdbb3 at calltrap+0x8 >> > > > #5 0xffffffff807371d8 at in_pcb_lport+0x128 >> > > > #6 0xffffffff8073745a at in_pcbbind_setup+0x16a >> > > > #7 0xffffffff80737d8e at in_pcbconnect_setup+0x71e >> > > > #8 0xffffffff80737df9 at in_pcbconnect_mbuf+0x59 >> > > > #9 0xffffffff807bf29f at udp_connect+0x11f >> > > > #10 0xffffffff80680615 at kern_connectat+0x275 >> > > > >> > > > Regarding DDB though, it would be rather difficult to access the machine >> > > > if it drops to a DDB debugger session, since the machine acts as my >> > > > firewall. >> > > >> > > Thanks -- will take a look at the attached. >> > > >> > > FWIW, though, I'm worried by the number of panics you are seeing, especiall >> y >> > given that they involve multiple subsystems, and in particular, John's >> > observation about a potentially corrupted pointer. This makes me wonder >> > whether (a) you are experiencing hardware faults -- it would be worth running >> >> > some memory/cpu/etc tests and (b) if we might be seeing a software memory >> > corruption bug of some sort. >> > >> > Other users have reported this (Ian Lepore), and Peter Wemm can now reproduce >> > these at will as well, so I think this is a software bug. What might be >> > easiest if we can't figure this out from the crashdump is just to bisect the >> > offending revision. >> >> I've started a binary search. I'll let you know what that turns up. > > Thanks, and sorry for getting my Ian's mixed up. :-/ > > -- > John Baldwin I forgot to roll back one of the routers at nyi.freebsd.org and it paniced again, the same way as before: Fatal trap 9: general protection fault while in kernel mode^M cpuid = 3; apic id = 03^M instruction pointer = 0x20:0xffffffff8067284c^M stack pointer = 0x28:0xffffff8098688760^M frame pointer = 0x28:0xffffff80986887a0^M code segment = base 0x0, limit 0xfffff, type 0x1b^M = DPL 0, pres 1, long 1, def32 0, gran 1^M processor eflags = interrupt enabled, resume, IOPL = 0^M current process = 15041 (svn)^M [ thread pid 15041 tid 100208 ]^M Stopped at in_pcblookup_local+0x5c: cmpw %r12w,0x18(%rax)^M #8 0xffffffff80829dff in calltrap () at ../../../amd64/amd64/exception.S:228 #9 0xffffffff8067284c in in_pcblookup_local (pcbinfo=0xffffffff80c9e180, laddr= {s_addr = 708980576}, lport=607, lookupflags=1, cred=0xfffffe006956d700) at ../../../netinet/in_pcb.c:1438 #10 0xffffffff80672d38 in in_pcb_lport (inp=0xfffffe00098aa620, laddrp=0xffffff809845d860, lportp=0xffffff809845d86e, cred=0xfffffe006956d700, lookupflags=1) at ../../../netinet/in_pcb.c:457 #11 0xffffffff80672fba in in_pcbbind_setup (inp=0xfffffe00098aa620, nam=0x0, laddrp=0xffffff809845d900, lportp=0xffffff809845d90e, cred=0xfffffe006956d700) at ../../../netinet/in_pcb.c:615 #12 0xffffffff806738ee in in_pcbconnect_setup (inp=0xfffffe00098aa620, nam=<value optimized out>, laddrp=0xffffff809845d9b8, lportp=0xffffff809845d9be, faddrp=0xffffff809845d9b4, fportp=0xffffff809845d9bc, oinpp=0x0, cred=0xfffffe006956d700) at ../../../netinet/in_pcb.c:1019 #13 0xffffffff80673959 in in_pcbconnect_mbuf (inp=0xfffffe00098aa620, nam=<value optimized out>, cred=<value optimized out>, m=0x0) at ../../../netinet/in_pcb.c:645 #14 0xffffffff806fafcf in udp_connect (so=0xfffffe002e150d48, nam=0xfffffe00264df3b0, td=0xfffffe00091df490) at ../../../netinet/udp_usrreq.c:1530 #15 0xffffffff805faea5 in kern_connectat (td=0xfffffe00091df490, dirfd=-100, fd=<value optimized out>, sa=0xfffffe00264df3b0) at ../../../kern/uipc_syscalls.c:593 #16 0xffffffff805fafc1 in sys_connect (td=0xfffffe00091df490, uap=0xffffff809845db70) at ../../../kern/uipc_syscalls.c:559 #17 0xffffffff8083f571 in amd64_syscall (td=0xfffffe00091df490, traced=0) at subr_syscall.c:134 There's been two separate machines, at least twice each on this exact panic / trace. Always with doing a 'svn update'. Rolling back to April 5th 249172 solves it. (There's nothing particular about that rev, except it was top-of-tree when the last update was done). I see a number locking changes in the area. Note that this is UDP, most likely a dns lookup. -- Peter Wemm - peter_at_wemm.org; peter_at_FreeBSD.org; peter_at_yahoo-inc.com; KI6FJVReceived on Fri May 03 2013 - 16:16:33 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:37 UTC