In the last episode (Oct 12), Kris Kennaway said: > Dan Nelson wrote: >> In the last episode (Oct 10), John Baldwin said: >>> On Wednesday 12 September 2007 02:50:37 pm Ivan Voras wrote: >>>> Dan Nelson wrote: >>>>> The same panic was also reported for 6.2 via PR 107865 and PR >>>>> 112490. 112490 included a workaround patch (I haven't tried it; >>>>> just found it). >>>> >>>> The proposed patch in kern/112490 looks trivial but someone who >>>> knows more about net locking should check it out. Unfortunately it >>>> lacks context and I don't know the code in question to apply it >>>> safely on a production machine :( >>> >>> I also get panics with what appears to be a double free of >>> rt_gwroute in rtexpunge(), so I think while this PR may help some >>> with figuring out the problem, I'm not sure it solves the root bug. >>> >>> Hmm, possibly try this patch: >> >> This makes the panics more frequent on my machine, actually :) > > Since you can reproduce this frequently the best thing might be to > instrument all the route handling with KTR so that you can do > post-mortem and try to figure out where the double-free or missing > reference happened. I've added some KTR debugging (why are the macros named CTR* instead of KTR* ? ) and I think I've got the problem nailed down, but I don't know anything about networking so I don't know what the solution is. I've attached a KTR dump and the debugging patches I made (done to preserve line numbering at the expense of style). It looks like two threads are entering rt_check at the same time. In the ktrdump, lrt0 is 0xc674d000 and lrt0->rt_gwroute is 0xc674ae88 for both threads. The thread on CPU1 locks lrt0 at line :1287 (ktr index 641), then locks lrt0->rt_gwroute at :1303 (k642). It frees lrt0->rt_gwroute at :1305 (k643), then unlocks lrt0 at :1308 (k651) before calling rtalloc1(). Meanwhile, the thread on CPU0 has entered the rt_check function and is spinning on the lrt0 lock at line :1287 (k649). When CPU1 unlocks ltr0 (k651 above), lrt0->rt_gwroute is still pointing to the freed rtentry. CPU0 then attempts to lock the now-freed lrt0->rt_gwroute and crashes. So, the problem is that ltr0->rt_gwroute is being left in an inconsistent state while ltr0 is unlocked. What's the solution? Zero out rt_gwroute before unlocking lrt0, then do some extra checks after re-locking to handle the case where another thread has called rtalloc1 before us, or something else? Or is there some other locking problem higher up that's allowing rt_check to be called in parellel on the same rtentry in the first place? -- Dan Nelson dnelson_at_allantgroup.com
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:19 UTC