Lawrence Stewart <lstewart_at_freebsd.org> wrote: > On 06/20/10 21:15, Fabian Keil wrote: > > Lawrence Stewart<lstewart_at_freebsd.org> wrote: > > > >> On 06/20/10 03:58, Fabian Keil wrote: > >>> Lawrence Stewart<lstewart_at_freebsd.org> wrote: > >>> > >>>> On 06/13/10 18:12, Lawrence Stewart wrote: > >>> > >>>>> The time has come to solicit some external testing for my SIFTR tool. > >>>>> I'm hoping to commit it within a week or so unless problems are discovered. > >>> > >>>>> I'm interested in all feedback and reports of success/failure, along > >>>>> with details of the architecture tested and number of CPUs if you would > >>>>> be so kind. > >>> > >>> I got the following hand-transcribed panic maybe a second after > >>> sysctl net.inet.siftr.enabled=1 > >>> > >>> Fatal trap 12: page fault while in kernel mode > >>> cpuid = 1; apic id = 01 > >>> [...] > >>> current process = 12 (swi4: clock) > >>> [ thread pid 12 tid 100006 ] > >>> Stopped at siftr_chkpkt+0xd0: addq $0x1,0x8(%r14) > >>> db> where > >>> Tracing pid 12 tid 100006 td 0xffffff00034037e0 > >>> siftr_chkpt() at siftr_chkpkt+0xd0 > >>> pfil_run_hooks() at pfil_run_hooks+0xb4 > >>> ip_output() at ip_output+0x382 > >>> tcp_output() tcp_output+0xa41 > >>> tcp_timer_rexmt() at tcp_timer_rexmt+0x251 > >>> softclock() at softclock+0x291 > >>> intr_event_execute_handlers() at intr_event_execute_handlers+0x66 > >>> ithread_loop at ithread_loop+0x8e > >>> fork_exit() at fork_exit+0x112 > >>> fork_trampoline() at fork_trampoline+0xe > >>> --- trap 0, rip = 0, rsp = 0xffffff800003ad30, rbp = 0 --- > >> > >> So I've tracked down the line of code where the page fault is occurring: > >> > >> if (dir == PFIL_IN) > >> ss->n_in++; > >> else > >> ss->n_out++; > >> > >> ss is a DPCPU (dynamic per-cpu) variable used to keep a set of stats > >> per-cpu and is initialised at the start of the function like so: > >> > >> ss = DPCPU_PTR(ss); > >> > >> So for ss to be NULL, that implies DPCPU_PTR() is returning NULL on your > >> machine. I know very little about the inner workings of the DPCPU_* > >> macros, but I'm pretty sure the way I use them in SIFTR is correct or at > >> least as intended. > > > > siftr_chkpkt() passes ss to siftr_chkreinject() before dereferencing > > it itself. I think if ss was NULL, the panic should already occur in > > siftr_chkreinject(). > > Yes but siftr_chkreinject() only dereferences ss in the exceptional case > of a malloc failure or duplicate pkt. It's unlikely either case happens > for you and so wouldn't trigger the panic. > > > To be sure I added: > > > > diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c > > index 8bc3498..b9fdfe4 100644 > > --- a/sys/netinet/siftr.c > > +++ b/sys/netinet/siftr.c > > _at__at_ -788,6 +788,16 _at__at_ siftr_chkpkt(void *arg, struct mbuf **m, struct ifnet *ifp, int dir, > > if (siftr_chkreinject(*m, dir, ss)) > > goto ret; > > > > + if (ss == NULL) { > > + printf("ss is NULL"); > > + ss = DPCPU_PTR(ss); > > + if (ss == NULL) { > > + printf("ss is still NULL"); > > + goto ret; > > + } > > + } > > + > > + > > if (dir == PFIL_IN) > > ss->n_in++; > > else > > > > which doesn't seem to affect the problem. > > As in it still panics and the "ss is NULL" message is not printed? I > would have expected to at least see "ss is NULL" printed if my > hypothesis was correct... hmm. Yes, it still panics, but no message is printed. > Perhaps the way I discovered the line number at which the panic occurred > was wrong. I compiled SIFTR on my amd64 dev server with "CFLAGS+=-g" in > the SIFTR Makefile to get debug symbols, ran "objdump -Sd siftr.ko | vim > -", searched for the instruction reported in the panic message i.e. > "addq $0x1,0x8(%r14)" and then with a bit of trial and error, recompiled > SIFTR with the line of code "volatile int blah = 0; blah = 2;" at > various points in the function and looking at the change in the objdump > output to pinpoint which line of C code corresponded with the "addq" > instruction. > > The "volatile int blah = 0; blah = 2;" compiles to "movl > $0x0,0xffffffffffffffd4(%rbp)" followed immediately by "movl > $0x2,0xffffffffffffffd4(%rbp)". When I put that code above the "if (dir > == PFIL_IN)" statement I see the objdump output show the assembly code > before the "addq" instruction and when I move it after the if statement > the assembly code moves after the "addq" instruction. That's a neat trick. > Perhaps you could reproduce the above procedure and see if you identify > the same point in the siftr_chkpkt function I did for the instruction > referenced by the panic message? I do. Using: diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c index b9fdfe4..fc6bd9a 100644 --- a/sys/netinet/siftr.c +++ b/sys/netinet/siftr.c _at__at_ -797,12 +797,15 _at__at_ siftr_chkpkt(void *arg, struct mbuf **m, struct ifnet *ifp, int dir, } } + volatile int blah = 0; blah = 2; if (dir == PFIL_IN) ss->n_in++; else ss->n_out++; + volatile int foo = 0; foo = 3; + /* * Create a tcphdr struct starting at the correct offset * in the IP packet. ip->ip_hl gives the ip header length I get: 803: c7 45 d4 00 00 00 00 movl $0x0,0xffffffffffffffd4(%rbp) 80a: c7 45 d4 02 00 00 00 movl $0x2,0xffffffffffffffd4(%rbp) 811: 0f 84 8a 02 00 00 je aa1 <siftr_chkpkt+0x371> 817: 49 83 46 08 01 addq $0x1,0x8(%r14) 81c: c7 45 d0 00 00 00 00 movl $0x0,0xffffffffffffffd0(%rbp) 823: c7 45 d0 03 00 00 00 movl $0x3,0xffffffffffffffd0(%rbp) > >> Could you please go ahead and retest using a GENERIC kernel and see if > >> you can reproduce? There could be something in your custom kernel > >> causing the offsets or linker set magic used by the DPCPU bits to break > >> which in turn is triggering this panic in SIFTR. > > > > I'll retry without pf first, and with GENERIC afterwards. > > Sounds good, thanks. Taking pf (and altq) out of the picture doesn't seem to make a difference. Fabian
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:04 UTC