Hi Gleb, On 6/17/16 6:53 AM, Gleb Smirnoff wrote: > At Netflix we are observing a race in TCP timers with head. > The problem is a regression, that doesn't happen on stable/10. > The panic usually happens after several hours at 55 Gbit/s of > traffic. > > What happens is that tcp_timer_keep finds t_tcpcb being > NULL. Some coredumps have tcpcb already initialized, > with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which > means that other CPU was working on the tcpcb while > the faulted one was working on the panic. So, this all looks > like a use after free, which conflicts with new allocation. > > Comparing stable/10 and head, I see two changes that could > affect that: > > - callout_async_drain > - switch to READ lock for inp info in tcp timers > > That's why you are in To, Julien and Hans :) > > We continue investigating, and I will keep you updated. > However, any help is welcome. I can share cores. Thanks for sharing. Let me run our TCP tests on a recent version of HEAD to see if by chance I can reproduce it. If I am not able to reproduce it I will ask for debug kernel and cores and see if I can help. Few notes here: - Around 2 months ago I did test HEAD with callout_async_drain() in TCP timers with our TCP QA testsuite but no kernel panic. That said I did not let our test run during several hours. - At Verisign we run 10 with READ lock for inp info in tcp timers change. Again, it does not mean this change has no impact here. My 2 cents. -- Julien
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:06 UTC