Re: panic with tcp timers

From: Julien Charbon <jch_at_freebsd.org> Date: Thu, 21 Jul 2016 09:54:20 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:06 UTC

 Hi,

On 7/14/16 11:02 PM, Larry Rosenman wrote:
> On 2016-07-14 12:01, Julien Charbon wrote:
>> On 6/20/16 11:55 AM, Julien Charbon wrote:
>>> On 6/20/16 9:39 AM, Gleb Smirnoff wrote:
>>>> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
>>>> J> > Comparing stable/10 and head, I see two changes that could
>>>> J> > affect that:
>>>> J> >
>>>> J> > - callout_async_drain
>>>> J> > - switch to READ lock for inp info in tcp timers
>>>> J> >
>>>> J> > That's why you are in To, Julien and Hans :)
>>>> J> >
>>>> J> > We continue investigating, and I will keep you updated.
>>>> J> > However, any help is welcome. I can share cores.
>>>>
>>>> Now, spending some time with cores and adding a bunch of
>>>> extra CTRs, I have a sequence of events that lead to the
>>>> panic. In short, the bug is in the callout system. It seems
>>>> to be not relevant to the callout_async_drain, at least for
>>>> now. The transition to READ lock unmasked the problem, that's
>>>> why NetflixBSD 10 doesn't panic.
>>>>
>>>> The panic requires heavy contention on the TCP info lock.
>>>>
>>>> [CPU 1] the callout fires, tcp_timer_keep entered
>>>> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
>>>> [CPU 2] schedules the callout
>>>> [CPU 2] tcp_discardcb called
>>>> [CPU 2] callout successfully canceled
>>>> [CPU 2] tcpcb freed
>>>> [CPU 1] unblocks... panic
>>>>
>>>> When the lock was WLOCK, all contenders were resumed in a
>>>> sequence they came to the lock. Now, that they are readers,
>>>> once the lock is released, readers are resumed in a "random"
>>>> order, and this allows tcp_discardcb to go before the old
>>>> running callout, and this unmasks the panic.
>>>
>>>  Highly interesting.  I should be able to reproduce that (will be useful
>>> for testing the corresponding fix).
>>
>>  Finally, I was able to reproduce it (without glebius fix).   The trick
>> was to really lower TCP keep timer expiration:
>>
>> $ sysctl -a | grep tcp.keep
>> net.inet.tcp.keepidle: 7200000
>> net.inet.tcp.keepintvl: 75000
>> net.inet.tcp.keepinit: 75000
>> net.inet.tcp.keepcnt: 8
>> $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl
>> net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10"
>> Password:
>> net.inet.tcp.keepidle: 7200000 -> 10
>> net.inet.tcp.keepintvl: 75000 -> 50
>> net.inet.tcp.keepinit: 75000 -> 10
>>
>>  Note: It will certainly close all your ssh connections to the tested
>> server.
>>
>>  Now I will test in order:
>>
>> #1. glebius fix
>> https://svnweb.freebsd.org/base?view=revision&revision=302350
>>
>> #2. rss extra fix
>> https://reviews.freebsd.org/D7135
>>
>> #3. rrs TCP Timer cleanup
>> https://reviews.freebsd.org/D7136
> 
> please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884

 My tests result so far:

#1. r302350:  First glebius TCP timer fix:  No more TCP timer kernel
panic during 48h under 200k TCP query per second load.

 Sadly I was unable to reproduce the issue described here:

panic: bogus refcnt 0 on lle 0xfffff80004608c00
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884

#2. r303098:  Got all kernel callout changes since r302350, (updates on
callout code are indeed always full of surprises):
https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log&pathrev=303098

 No kernel panic either.

 Still to test:

#3. rss extra fix (if still relevant now)
https://reviews.freebsd.org/D7135

#4. rrs TCP Timer cleanup:
https://reviews.freebsd.org/D7136

 My 2 cents.

--
Julien