Re: Panic in arpresolve->rt_check?

From: Dan Nelson <dnelson_at_allantgroup.com>
Date: Tue, 16 Oct 2007 11:31:23 -0500
In the last episode (Oct 12), Kris Kennaway said:
> Dan Nelson wrote:
>> In the last episode (Oct 10), John Baldwin said:
>>> On Wednesday 12 September 2007 02:50:37 pm Ivan Voras wrote:
>>>> Dan Nelson wrote:
>>>>> The same panic was also reported for 6.2 via PR 107865 and PR
>>>>> 112490.  112490 included a workaround patch (I haven't tried it;
>>>>> just found it).
>>>>
>>>> The proposed patch in kern/112490 looks trivial but someone who
>>>> knows more about net locking should check it out. Unfortunately it
>>>> lacks context and I don't know the code in question to apply it
>>>> safely on a production machine :(
>>>
>>> I also get panics with what appears to be a double free of
>>> rt_gwroute in rtexpunge(), so I think while this PR may help some
>>> with figuring out the problem, I'm not sure it solves the root bug.
>>> 
>>> Hmm, possibly try this patch:
>>
>> This makes the panics more frequent on my machine, actually :)
> 
> Since you can reproduce this frequently the best thing might be to
> instrument all the route handling with KTR so that you can do
> post-mortem and try to figure out where the double-free or missing
> reference happened.

I've added some KTR debugging (why are the macros named CTR* instead of
KTR* ? ) and I think I've got the problem nailed down, but I don't know
anything about networking so I don't know what the solution is.

I've attached a KTR dump and the debugging patches I made (done to
preserve line numbering at the expense of style).

It looks like two threads are entering rt_check at the same time.  In
the ktrdump, lrt0 is 0xc674d000 and lrt0->rt_gwroute is 0xc674ae88 for
both threads.

The thread on CPU1 locks lrt0 at line :1287 (ktr index 641), then locks
lrt0->rt_gwroute at :1303 (k642).  It frees lrt0->rt_gwroute at :1305
(k643), then unlocks lrt0 at :1308 (k651) before calling rtalloc1().

Meanwhile, the thread on CPU0 has entered the rt_check function and is
spinning on the lrt0 lock at line :1287 (k649).  When CPU1 unlocks ltr0
(k651 above), lrt0->rt_gwroute is still pointing to the freed rtentry. 
CPU0 then attempts to lock the now-freed lrt0->rt_gwroute and crashes.

So, the problem is that ltr0->rt_gwroute is being left in an
inconsistent state while ltr0 is unlocked.  What's the solution?  Zero
out rt_gwroute before unlocking lrt0, then do some extra checks after
re-locking to handle the case where another thread has called rtalloc1
before us, or something else?  Or is there some other locking problem
higher up that's allowing rt_check to be called in parellel on the same
rtentry in the first place?

-- 
	Dan Nelson
	dnelson_at_allantgroup.com

Received on Tue Oct 16 2007 - 14:31:25 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:19 UTC