Re: strange ping response times...

From: Bruce Evans <brde_at_optusnet.com.au>
Date: Thu, 12 Apr 2012 21:22:02 +1000 (EST)
On Thu, 12 Apr 2012, Luigi Rizzo wrote:

> On Thu, Apr 12, 2012 at 01:18:59PM +1000, Bruce Evans wrote:
>> On Wed, 11 Apr 2012, Luigi Rizzo wrote:
>>
>>> On Wed, Apr 11, 2012 at 02:16:49PM +0200, Andre Oppermann wrote:
> ...
>>> ping takes a timestamp in userspace before trying to transmit
>>> the packet, and then the timestamp for the received packet
>>> is recorded in the kernel (in the interrupt or netisr thread
>>> i believe -- anyways, not in userspace).
>>
>> No, all timestamps recorded used by ping are recorded in userland.
>
> Bruce, look at the code in ping.c -- SO_TIMESTAMP is defined,
> so the program does (successfully) a
>
> 	setsockopt(s, SOL_SOCKET, SO_TIMESTAMP, &on, sizeof(on));
>
> and then (verified that at runtime the code follows this path)
> ...

Indeed it does.  This accounts for the previously unaccounted for
binuptime call in the kernel.

This is part of fenner's change (perhaps the most important part) to
put the timestamping closer to the actual i/o, so that the RTT doesn't
count setup overheads.  Timestamping still works when SO_TIMESTAMP is
undefed (after fixing the ifdefs on it).  Not using it costs only 1
more gettimeofday() call in ping, but it increases my apparent RTT
from 7-8 usec to 10-11 usec.  3 extra usec seems a lot for the overhead.

I found that ping -f saves 1 gettimeofday() call per packet, and my
version saves another 1.

     -current ping -q  localhost: 4
     my       ping -q  localhost: 3
     -current ping -fq localhost: 3
     my       ping -fq localhost: 2

1 gettimeofday() call is needed for putting a timestamp in the output
packet.  Apparently, this is not well integrated with the bookkeeping
for select(), and up to 3 more gettimeofday() calls are used.  select()
timeouts only have 1/HZ granulatrity, so the gettimeofday() calls to
set them up could use clock_gettime() with CLOCK_REALTIME_FAST_N_BROKEN,
but this would be a bogus optimization since most of the overhead is in
the syscall provided the timecounter hardware is not slow (it takes 9-80
cycles to read TSC timecounter hardware and 600-800 cycles for
clock_gettime() with a TSC timecounter).

I just noticed Note that CLOCK_MONOTONIC cannot be used for the timestamp
in the output packet if SO_TIMESTAMP is used for the timestamp in the
input packet, since SO_TIMESTAMP uses CLOCK_REALTIME for historical
reasons.

There are 2 select()s per packet.  Perhaps the number of gettimeofday()s
can be reduced to 1 per packet in most cases (get one for the output packet
and use it for both select()s).  With my version the truss trace for
ping localhost is:

% 64 bytes from 127.0.0.1: icmp_seq=6 ttl=64 time=0.473 ms
% write(1,0x80d4000,57)				 = 57 (0x39)
% gettimeofday({1334226305 409249},0x0)		 = 0 (0x0)

Need this for the next select().  There is a ~1 second pause here.

% select(4,{3},0x0,0x0,{0 989607})		 = 0 (0x0)

Truss doesn't show this until select() returns ~1 second later.  The
gettimeofday() call was needed because we don't simply use a 1 second
pause, but adjust for overheads.  My version uses a fancier adjustment.

% gettimeofday({1334226306 408632},0x0)		 = 0 (0x0)

We need this accurately to put in the packet.  But all the other timestamps
can be derived from this for the non-flood case.  We can just try to send
every `interval' and adjust the timeouts a bit when we see that we get
here a bit early or late, and when we get here more than a bit early or
late we can either recalibrate or adjust by more.

Note that this gettimeofday() returned 1.0 - 0.000617 seconds after the
previous one, although we asked for a timeout of ~1/HZ = 0.0010000 seconds.
Select timeouts have a large granularity and we expect errors of O(1/HZ)
and mostly compensate for them.  The drift without compensation would be
1% with HZ = 100 and -i 1.0, and much larger with -i <small>.  My
version compensates more accurately than -current.

% sendto(0x3,0x80b8d34,0,0x0,{ AF_INET 127.0.0.1:0 },0x10) = 64 (0x40)
% gettimeofday({1334226306 408831},0x0)		 = 0 (0x0)
% select(4,{3},0x0,0x0,{0 990025})		 = 1 (0x1)
% recvmsg(0x3,0xbfbfe8c0,0x0)			 = 84 (0x54)
% gettimeofday({1334226306 409104},0x0)		 = 0 (0x0)

I don't know what this is for.  We got a timestamp in the returned packet,
and use it.  Since this timestamp has microsecond resolution and select()
only has 1/Hz resolution. this timestamp should be more than good enough
for sleeping for the interval.

% 64 bytes from 127.0.0.1: icmp_seq=7 ttl=64 time=0.472 ms
% write(1,0x80d4000,57)				 = 57 (0x39)

Next packet:

% gettimeofday({1334226306 409279},0x0)		 = 0 (0x0)
% ...

But select() timeouts are not needed at all.  Versions before fenner's
changes used a select() timeout for flood pings.  alarm() was used to
generate other timeouts.  The alarm() code was not very good.  It did
a lot of work in the signal handler to set up the next alarm (1 call
to signal() and 1 call to alarm()), and it did unsafe work in the signal
handler.  It was not careful about drift.  These overheads and the drift
can can be avoided using a periodic itimer.  Ideally, expiry of each period
would just cause select() to return, but AFAIK there is no way to make
select() return without catching the signal, and even a null signal catcher
has a lot of overhead.  It is unclear if the overhead for a null signal
catcher is larger than that for select timeouts.  The current overhead for
select timeouts is 2 or 3 gettimeofday() calls in ping and 1 or 2
getmicrouptime() calls in each select().  These take at least 1 usec
on my Athlon64 2GHz UP, and signal handling also takes about 1 usec
in old versions of FreeBSD, but now takes more like 1.5 usec.

SO_TIMESTAMP seems to have been invented in 1996 by fenner to make
ping work better.  (The log messages seem to only mention fixing it
but I can't find it in earlier code.)  There seems to be no way to
timestamp outgoing packets properly in the kernel.  ICMP_TIMESTAMP
might give milliseconds resolution for ping, but FreeBSD ping has given
microseconds resulotion since 1993.

Bruce
Received on Thu Apr 12 2012 - 09:22:17 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:25 UTC