Hi, On 7/14/16 11:02 PM, Larry Rosenman wrote: > On 2016-07-14 12:01, Julien Charbon wrote: >> On 6/20/16 11:55 AM, Julien Charbon wrote: >>> On 6/20/16 9:39 AM, Gleb Smirnoff wrote: >>>> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: >>>> J> > Comparing stable/10 and head, I see two changes that could >>>> J> > affect that: >>>> J> > >>>> J> > - callout_async_drain >>>> J> > - switch to READ lock for inp info in tcp timers >>>> J> > >>>> J> > That's why you are in To, Julien and Hans :) >>>> J> > >>>> J> > We continue investigating, and I will keep you updated. >>>> J> > However, any help is welcome. I can share cores. >>>> >>>> Now, spending some time with cores and adding a bunch of >>>> extra CTRs, I have a sequence of events that lead to the >>>> panic. In short, the bug is in the callout system. It seems >>>> to be not relevant to the callout_async_drain, at least for >>>> now. The transition to READ lock unmasked the problem, that's >>>> why NetflixBSD 10 doesn't panic. >>>> >>>> The panic requires heavy contention on the TCP info lock. >>>> >>>> [CPU 1] the callout fires, tcp_timer_keep entered >>>> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo); >>>> [CPU 2] schedules the callout >>>> [CPU 2] tcp_discardcb called >>>> [CPU 2] callout successfully canceled >>>> [CPU 2] tcpcb freed >>>> [CPU 1] unblocks... panic >>>> >>>> When the lock was WLOCK, all contenders were resumed in a >>>> sequence they came to the lock. Now, that they are readers, >>>> once the lock is released, readers are resumed in a "random" >>>> order, and this allows tcp_discardcb to go before the old >>>> running callout, and this unmasks the panic. >>> >>> Highly interesting. I should be able to reproduce that (will be useful >>> for testing the corresponding fix). >> >> Finally, I was able to reproduce it (without glebius fix). The trick >> was to really lower TCP keep timer expiration: >> >> $ sysctl -a | grep tcp.keep >> net.inet.tcp.keepidle: 7200000 >> net.inet.tcp.keepintvl: 75000 >> net.inet.tcp.keepinit: 75000 >> net.inet.tcp.keepcnt: 8 >> $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl >> net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10" >> Password: >> net.inet.tcp.keepidle: 7200000 -> 10 >> net.inet.tcp.keepintvl: 75000 -> 50 >> net.inet.tcp.keepinit: 75000 -> 10 >> >> Note: It will certainly close all your ssh connections to the tested >> server. >> >> Now I will test in order: >> >> #1. glebius fix >> https://svnweb.freebsd.org/base?view=revision&revision=302350 >> >> #2. rss extra fix >> https://reviews.freebsd.org/D7135 >> >> #3. rrs TCP Timer cleanup >> https://reviews.freebsd.org/D7136 > > please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 My tests result so far: #1. r302350: First glebius TCP timer fix: No more TCP timer kernel panic during 48h under 200k TCP query per second load. Sadly I was unable to reproduce the issue described here: panic: bogus refcnt 0 on lle 0xfffff80004608c00 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 #2. r303098: Got all kernel callout changes since r302350, (updates on callout code are indeed always full of surprises): https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log&pathrev=303098 No kernel panic either. Still to test: #3. rss extra fix (if still relevant now) https://reviews.freebsd.org/D7135 #4. rrs TCP Timer cleanup: https://reviews.freebsd.org/D7136 My 2 cents. -- Julien
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:06 UTC