Re: panic with tcp timers

From: Julien Charbon <jch_at_freebsd.org> Date: Mon, 20 Jun 2016 11:55:55 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:06 UTC

 Hi,

On 6/20/16 9:39 AM, Gleb Smirnoff wrote:
> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
> J> > Comparing stable/10 and head, I see two changes that could
> J> > affect that:
> J> > 
> J> > - callout_async_drain
> J> > - switch to READ lock for inp info in tcp timers
> J> > 
> J> > That's why you are in To, Julien and Hans :)
> J> > 
> J> > We continue investigating, and I will keep you updated.
> J> > However, any help is welcome. I can share cores.
> 
> Now, spending some time with cores and adding a bunch of
> extra CTRs, I have a sequence of events that lead to the
> panic. In short, the bug is in the callout system. It seems
> to be not relevant to the callout_async_drain, at least for
> now. The transition to READ lock unmasked the problem, that's
> why NetflixBSD 10 doesn't panic.
> 
> The panic requires heavy contention on the TCP info lock.
> 
> [CPU 1] the callout fires, tcp_timer_keep entered
> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
> [CPU 2] schedules the callout
> [CPU 2] tcp_discardcb called
> [CPU 2] callout successfully canceled
> [CPU 2] tcpcb freed
> [CPU 1] unblocks... panic
> 
> When the lock was WLOCK, all contenders were resumed in a
> sequence they came to the lock. Now, that they are readers,
> once the lock is released, readers are resumed in a "random"
> order, and this allows tcp_discardcb to go before the old
> running callout, and this unmasks the panic.

 Highly interesting.  I should be able to reproduce that (will be useful
for testing the corresponding fix).

 Fix proposal:  If callout_async_drain() returns 0 (fail) (instead of 1
(success) here) when the callout cancellation is a success _but_ the
callout is current running, that should fix it.

 For the history:  It comes back to my old callout question:

 Does _callout_stop_safe() is allowed to return 1 (success) even if the
callout is still currently running;  a.k.a. it is not because you
successfully cancelled a callout that the callout is not currently running.

 We did propose a patch to make _callout_stop_safe() returns 0 (fail)
when the callout is currently running:

callout_stop() should return 0 when the callout is currently being
serviced and indeed unstoppable
https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most

 But this change impacted too many old code paths and was interesting
only for TCP timers and thus was abandoned.

 My 2 cents.

--
Julien