Re: panic with tcp timers

From: Hans Petter Selasky <hps_at_selasky.org> Date: Thu, 21 Jul 2016 10:05:48 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:06 UTC

On 07/21/16 09:54, Julien Charbon wrote:
>
>  Hi,
>
> On 7/14/16 11:02 PM, Larry Rosenman wrote:
>> On 2016-07-14 12:01, Julien Charbon wrote:
>>> On 6/20/16 11:55 AM, Julien Charbon wrote:
>>>> On 6/20/16 9:39 AM, Gleb Smirnoff wrote:
>>>>> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
>>>>> J> > Comparing stable/10 and head, I see two changes that could
>>>>> J> > affect that:
>>>>> J> >
>>>>> J> > - callout_async_drain
>>>>> J> > - switch to READ lock for inp info in tcp timers
>>>>> J> >
>>>>> J> > That's why you are in To, Julien and Hans :)
>>>>> J> >
>>>>> J> > We continue investigating, and I will keep you updated.
>>>>> J> > However, any help is welcome. I can share cores.
>>>>>
>>>>> Now, spending some time with cores and adding a bunch of
>>>>> extra CTRs, I have a sequence of events that lead to the
>>>>> panic. In short, the bug is in the callout system. It seems
>>>>> to be not relevant to the callout_async_drain, at least for
>>>>> now. The transition to READ lock unmasked the problem, that's
>>>>> why NetflixBSD 10 doesn't panic.
>>>>>
>>>>> The panic requires heavy contention on the TCP info lock.
>>>>>
>>>>> [CPU 1] the callout fires, tcp_timer_keep entered
>>>>> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
>>>>> [CPU 2] schedules the callout
>>>>> [CPU 2] tcp_discardcb called
>>>>> [CPU 2] callout successfully canceled
>>>>> [CPU 2] tcpcb freed
>>>>> [CPU 1] unblocks... panic
>>>>>
>>>>> When the lock was WLOCK, all contenders were resumed in a
>>>>> sequence they came to the lock. Now, that they are readers,
>>>>> once the lock is released, readers are resumed in a "random"
>>>>> order, and this allows tcp_discardcb to go before the old
>>>>> running callout, and this unmasks the panic.
>>>>
>>>>  Highly interesting.  I should be able to reproduce that (will be useful
>>>> for testing the corresponding fix).
>>>
>>>  Finally, I was able to reproduce it (without glebius fix).   The trick
>>> was to really lower TCP keep timer expiration:
>>>
>>> $ sysctl -a | grep tcp.keep
>>> net.inet.tcp.keepidle: 7200000
>>> net.inet.tcp.keepintvl: 75000
>>> net.inet.tcp.keepinit: 75000
>>> net.inet.tcp.keepcnt: 8
>>> $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl
>>> net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10"
>>> Password:
>>> net.inet.tcp.keepidle: 7200000 -> 10
>>> net.inet.tcp.keepintvl: 75000 -> 50
>>> net.inet.tcp.keepinit: 75000 -> 10
>>>
>>>  Note: It will certainly close all your ssh connections to the tested
>>> server.
>>>
>>>  Now I will test in order:
>>>
>>> #1. glebius fix
>>> https://svnweb.freebsd.org/base?view=revision&revision=302350
>>>
>>> #2. rss extra fix
>>> https://reviews.freebsd.org/D7135
>>>
>>> #3. rrs TCP Timer cleanup
>>> https://reviews.freebsd.org/D7136
>>
>> please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884
>
>  My tests result so far:
>
> #1. r302350:  First glebius TCP timer fix:  No more TCP timer kernel
> panic during 48h under 200k TCP query per second load.
>
>  Sadly I was unable to reproduce the issue described here:
>
> panic: bogus refcnt 0 on lle 0xfffff80004608c00
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884
>
> #2. r303098:  Got all kernel callout changes since r302350, (updates on
> callout code are indeed always full of surprises):
> https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log&pathrev=303098
>
>  No kernel panic either.
>
>  Still to test:
>
> #3. rss extra fix (if still relevant now)
> https://reviews.freebsd.org/D7135
>
> #4. rrs TCP Timer cleanup:
> https://reviews.freebsd.org/D7136
>
>  My 2 cents.
>

Hi,

You should also check for memory leaks using "vmstat -m" .

--HPS