Re: Survey results very helpful, thanks! (was: Re: net.inet.tcp.timer_race: does anyone have a non-zero value?)

From: Robert Watson <rwatson_at_FreeBSD.org>
Date: Mon, 8 Mar 2010 20:33:32 +0000 (GMT)
On Mon, 8 Mar 2010, Doug Hardie wrote:

> I run a number of 4 core systems with em interfaces.  These are production 
> systems that are unmanned and located a long way from me.  Under unusual 
> conditions it can take up to 6 hours to get there.  I have been waiting to 
> switch to 8.0 because of the discussions on the em device and now it sounds 
> like I had better just skip 8.x and wait for 9.  7.2 is working just fine.

Not sure that any information in this survey thread should be relevant to that 
decision.  This race has existed since before FreeBSD, having appeared in the 
original BSD network stack, and is just as present in FreeBSD 7.x as 8.x or 
9.x.  When I learned about the race during the early 7.x development cycle, I 
added a counter/statistic to measure how much it happened in practice, but was 
not able to exercise it in my testing, and so left the counter in to appear in 
7.0 and later so that we could perform this survey as core counts/etc 
increase.

The two likely outcomes were "it is never exercised" and "it is exercised but 
only very infrequently", neither really justifying the quite complex change to 
correct it given requirements at the time.  On-going development work on the 
virtual network stack is what justifies correcting the bug at this point, 
moving from detecting and handling the race to preventing it from occuring as 
an invariant.  The motivation here, BTW, is that we'd like to eliminate the 
type-stable storage requirement for connection state (which ensures that 
memory once used for a connection block is only ever used for connection 
blocks in the future), allowing memory to be fully freed when a virtual 
network stack is destroyed.  Using type-stable storage helped address this 
bug, but was primarily present to reduce the overhead of monitoring using 
netstat(1).  We'll now need to use a slightly more expensive solution (true 
reference counts) in that context, although in practice it will almost 
certainly be an unmeasurable cost.

Which is to say that while there might be something in the em/altq/... thread 
to reasonably lead you to avoid 8.0, nothing in the TCP timer race thread 
should do so, since it affects 7.2 just as much as 8.0.  Even if you do see a 
non-zero counter, that's not a matter for operational concern, just useful 
from the perspective of a network stack developer to understanding timing and 
behaviors in the stack.  :-)

Robert
Received on Mon Mar 08 2010 - 19:33:33 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:01 UTC