odd TCP rtt/retransmit timeout issue...

From: John-Mark Gurney <gurney_j_at_resnet.uoregon.edu>
Date: Mon, 25 Sep 2006 02:57:45 -0700
I was brining up another interface that I just added to /etc/rc.conf and
ran the command /etc/rc.d/netif start to initalize it...  But then my
connection never came back.... I found that the shell was still active
as I could type commands like sleep 5, and another session's w would
see sleep 5 run on the session...  even filling up the send-q w/ 32k
of data didn't get the HEAD box to send any data to the client...

With the help of silby, I managed to find that the t_rxtcur value in
the tcpcb was getting a very large value.  The session that hung had
a retransmit timeout of 19 days...  This led us to find that the
TCPT_RANGESET macro was letting very large tvmin values override the
more sane tvmax values due to an extra else.  I have added that so
we shouldn't see any more multi day timeouts, but we still apparently
have a problem where the rtt value calculated is wildly incorrect...

It appears that each connection will get a different "random" rtt
values...  From a few connections to my machine:
(kgdb) print ((struct tcpcb *)0xc3a34af8)->t_rxtcur
$3 = 64000
(kgdb) print ((struct tcpcb *)0xc3a3457c)->t_rxtcur
$6 = 1662654093
(kgdb) print ((struct tcpcb *)0xc3a343a8)->t_rxtcur
$12 = 1358
(kgdb) print ((struct tcpcb *)0xc3a9e1d4)->t_rxtcur
$17 = 203
(kgdb) print ((struct tcpcb *)0xc3a9e000)->t_rxtcur
$19 = 284155863

most connections are stable around the "picked" value, though I have
seen some connections oscillate between 64000 and a really large value..

I was trying to track this down, and a kernel as of 9/17 exhibits the
problem, but I managed to track it down to a RELENG_6 commit (which
obviously would effect HEAD) when I realized that each connection got
a different value, and my older tests I was getting lucky in not having
a bad timeout...

To obtain these values, I used kgdb kernel /dev/mem, and put the value
returned by netstat -Aanfinet's first column in as the tcpcb pointer
above..  (Why is the columned named Socket, when it's the control block
struct and not the socket struct?)

Anyone want to track down why we are getting such large values in
there?  I'll try to back track farther to see how old this issue is..

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."
Received on Mon Sep 25 2006 - 07:57:57 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:00 UTC