Re: panic with tcp timers

From: Bjoern A. Zeeb <bzeeb-lists_at_lists.zabbadoz.net> Date: Fri, 17 Jun 2016 14:41:02 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:06 UTC

On 17 Jun 2016, at 4:53, Gleb Smirnoff wrote:

>   Hi!
>
>   At Netflix we are observing a race in TCP timers with head.
> The problem is a regression, that doesn't happen on stable/10.
> The panic usually happens after several hours at 55 Gbit/s of
> traffic.
>
> What happens is that tcp_timer_keep finds t_tcpcb being
> NULL. Some coredumps have tcpcb already initialized,
> with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which
> means that other CPU was working on the tcpcb while
> the faulted one was working on the panic. So, this all looks
> like a use after free, which conflicts with new allocation.
>
> Comparing stable/10 and head, I see two changes that could
> affect that:
>
> - callout_async_drain
> - switch to READ lock for inp info in tcp timers
>
> That's why you are in To, Julien and Hans :)
>
> We continue investigating, and I will keep you updated.
> However, any help is welcome. I can share cores.

There’s also the change to no longer mark the zones NO_FREE.
In theory I was convinced at the time that it should not be an issue 
anymore.

If I had overlooked something or follow-up timer changes invalidated 
assumptions then that could also be trouble.

That said, I was not able to get any related panics or log entries 
anymore lately (but I am currently slightly behind head with my branch).

We should get the problem fixed however and not try to “paint over” 
again.

/bz