Re: CLOCK_MONOTONIC / CLOCK_UPTIME is not really monotonic between threads

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Fri, 3 Jun 2016 08:06:28 +0300
On Thu, Jun 02, 2016 at 09:16:47PM -0700, Maxim Sobolev wrote:
> Hi there, we have an application here which is trying to measure UDP
> command/response round-trip-time. It runs two posix threads (more actually,
> but that's probably irrelevant), one (let's call it A) that does high-level
> logic and the second one (B) that does network packet I/O.
> 
> The sending side is done by first thread A forming the request, then
> calling the function clock_gettime(CLOCK_MONOTONIC) and passing the packet
> into the thread B. Obtained timestamp is stored with some logical
> transaction ID allowing us to pull that stored value later on when response
> arrives. Then we have a separate process that receives those requests,
> processing them and sending back some form of response.
> 
> Upon receiving a response from the network, the network I/O thread (B)
> timestamps it by running clock_gettime(CLOCK_MONOTONIC) and passes the
> packet data along with that value via queue to the thread A for processing.
> 
> So if we put things into timeline, what our app does would probably look
> something like the following:
> 
> 1. Thread A generates request.
> 2. A calls clock_gettime(CLOCK_MONOTONIC), storing value as t1 internally
> 3. A passes packet to thread B
> 4. B sends out packet via sendto() to server process running on the same
> box (fully separate, not a thread)
> 
> [some microseconds later]
> 
> 5. B receives response from server with recvfrom()
> 6. B instantly calls clock_gettime(CLOCK_MONOTONIC), assigns returned value
> to t2
> 7. B passes packet data along with t2 to the A via queue
> 8. A picks up packet, parses it and retrieves corresponding t1 stored at
> step 2.
> 9. A calculates RTT by doing t2 - t1 assuming it's going to be positive...
> 
> As you might have guessed if you are still reading, from time to time t2 -
> t1 comes out slightly negative! Provided it's not some obscure bug in our
> app, there is no way this could happen if clock_gettime(CLOCK_MONOTONIC)
> would work as advertised. Event (2) could not possibly happen earlier than
> (6), which is guaranteed by the fact that the request needs to be processed
> by the external entity first in order for the response to be seen by our
> app at all. I've added some logs and it seems to be confirming that the
> server only sees a single request, there is no chance for the client to
> receive some other packet and confuse it. I've also confirmed with tcpdump,
> which shows reasonable time delay between request and reply of few hundreds
> microseconds.
> 
> I've checked all logic and I could not find any mistakes on my end here, so
> I added some logging for such events. The distribution appears to be
> centered around 0.00006s, but there are some events that go as far up as
> 0.012s.
> 
> I've also tried using CLOCK_UPTIME_PRECISE instead, but it makes no
> difference whatsoever.
> 
> My questions therefore are:
> 
> 1. Is it intended/expected behavior of the said API?
No.

> 2. Has anyone else bumped into this?
Not that I am aware of.

> 3. I know we are doing some clever optimizations using TSC to speed those
> APIs up, could be it related to that?
No idea.

> 4. If the answer for (3) is yes, then what is the method to disable using
> TSC and use slower but possibly more reliable method?
Set sysctl kern.timecounter.fast_gettime to 0.

> 5. Is there any way/plan to fix this long-term?
Fix what ?

> 6. If the behavior is intended/expected, what is the maximum discrepancy we
> can expect from this effect?
> 
> In general some time ago we've spend quite lot of time switching our app
> from using REALTIME into MONOTIME in the hope to eliminate any wizardry
> needed to deal with realtime possibly jumping back and forth due to NTP,
> leap seconds etc, it's shame that this is not working either. Apart from
> measuring command processing delay, that app does lot of high-volume voip
> call billing, so even such small discrepancy can easily build up into a
> bigger problem, not to mention the fact that every time we deal with
> duration we now need to have some check in place to make sure we don't have
> some negative values popping our of nowhere.
> 
> Any hints, suggestions, pointers are appreciated. I can also give more
> debug information as needed. Thanks!

In fact, your rather long mail does not contain any facts except a statement
that you possibly see clock_gettime(CLOCK_MONOTONIC) going backward.  First
obvious question is whether you did tried to reduce the testcase to some
usable size ?

That said, RDTSC going backward, or perceive of RDTSC value by
processors going provable backward, was one of the concerns when
userspace gettimeofday(2) code was developed.  We have at least three
things which should prevent that:
1. kernel tests TSC for SMP coherency on boot
2. since RDTSC is not serialized instruction, we bracket its execution
   with neccessary CPU-model specific fences to ensure that the counter
   is not read too early.
3. still, despite item 2, at least 1 lowest bit of the read counter is
   dropped to accomodate for jitter.

When testing the userspace gettimeofday(2) code, I found the following
program by Ingo Molnar.  It verifies SMP-coherency for the bare TSC and
for the library routines.  I specifically tested with it on multi-socket
Nehalems at that time, AFAIR.

I am not asking for details about your machine config, system version and
other neccessary information, before either the Ingo' program demostrates
the back-stepping, or you provide tiny test which shows the problem.

/*
 * Copyright (C) 2005, Ingo Molnar
 *
 * time-warp-test.c: check TSC synchronity on x86 CPUs. Also detects
 *                   gettimeofday()-level time warps.
 *
 * Compile with: gcc -Wall -O2 -o time-warp-test time-warp-test.c -lrt
 * $Id: time-warp-test.c,v 1.1 2012/07/25 19:11:05 kostik Exp kostik $
 */
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <assert.h>
#include <ctype.h>
#include <fcntl.h>
#include <signal.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

#define TEST_TSC 1
#define TEST_TOD 1
#define TEST_CLOCK 1

#if !TEST_TSC && !TEST_TOD && !TEST_CLOCK
# error this setting makes no sense ...
#endif

#if DEBUG
# define Printf(x...) printf(x)
#else
# define Printf(x...) do { } while (0)
#endif

/*
 * Shared locks and variables between the test tasks:
 */
enum {
	SHARED_LOCK		= 0,
	SHARED_TSC		= 2,
	SHARED_TOD		= 4,
	SHARED_CLOCK		= 6,
	SHARED_WORST_TSC	= 8,
	SHARED_WORST_TOD	= 10,
	SHARED_WORST_CLOCK	= 12,
	SHARED_NR_TSC_LOOPS	= 14,
	SHARED_NR_TSC_WARPS	= 16,
	SHARED_NR_TOD_LOOPS	= 18,
	SHARED_NR_TOD_WARPS	= 20,
	SHARED_NR_CLOCK_LOOPS	= 22,
	SHARED_NR_CLOCK_WARPS	= 24,
	SHARED_END		= 26,
};

#define SHARED(x)	(*(shared + SHARED_##x))
#define SHARED_LL(x)	(*(long long *)(shared + SHARED_##x))

#define BUG_ON(c) assert(!(c))

typedef unsigned long long cycles_t;
typedef unsigned long long usecs_t;
typedef unsigned long long u64;

#ifdef __x86_64__
#define DECLARE_ARGS(val, low, high)    unsigned low, high
#define EAX_EDX_VAL(val, low, high)     ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high)    "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high)     "=a" (low), "=d" (high)
#else
#define DECLARE_ARGS(val, low, high)    unsigned long long val
#define EAX_EDX_VAL(val, low, high)     (val)
#define EAX_EDX_ARGS(val, low, high)    "A" (val)
#define EAX_EDX_RET(val, low, high)     "=A" (val)
#endif

static inline unsigned long long __rdtscll(void)
{
	DECLARE_ARGS(val, low, high);

#if 0
	asm volatile(/*"cpuid;" "lfence;*/"rdtsc" : EAX_EDX_RET(val, low, high));
#else
	asm volatile("lfence;rdtsc;"/* shrd $8,%%edx,%%eax"*/ : EAX_EDX_RET(val, low, high));
#endif

	return EAX_EDX_VAL(val, low, high);
}

#define rdtscll(val) do { (val) = __rdtscll(); } while (0)

#define rdtod(val)					\
do {							\
	struct timeval tv;				\
							\
	gettimeofday(&tv, NULL);			\
	(val) = tv.tv_sec * 1000000ULL + tv.tv_usec;	\
} while (0)

#define rdclock(val)					\
do {							\
	struct timespec ts;				\
							\
	clock_gettime(CLOCK_MONOTONIC, &ts);		\
	(val) = ts.tv_sec * 1000000000ULL + ts.tv_nsec;	\
} while (0)

static unsigned long *setup_shared_var(void)
{
	char zerobuff [4096] = { 0, };
	int ret, fd;
	unsigned long *buf;

	fd = creat(".tmp_mmap", 0700);
	BUG_ON(fd == -1);
	close(fd);

	fd = open(".tmp_mmap", O_RDWR|O_CREAT|O_TRUNC);
	BUG_ON(fd == -1);
	ret = write(fd, zerobuff, 4096);
	BUG_ON(ret != 4096);

	buf = (void *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	BUG_ON(buf == (void *)-1);

	close(fd);
	unlink(".tmp_mmap");

	return buf;
}

static inline void lock(unsigned long *flag)
{
#if 0
	__asm__ __volatile__(
		"1: lock; btsl $0,%0\n"
		"jc 1b\n"
			     : "=g"(*flag) : : "memory");
#else
	__asm__ __volatile__(
               "1: lock; btsl $0,%0\n\t"
               "jnc 3f\n"
               "2: testl $1,%0\n\t"
               "je 1b\n\t"
               "rep ; nop\n\t"
               "jmp 2b\n"
               "3:"
               : "+m"(*flag) : : "memory");
#endif
}

static inline void unlock(unsigned long *flag)
{
#if 0
	__asm__ __volatile__(
		"lock; btrl $0,%0\n"
			     : "=g"(*flag) :: "memory");
	__asm__ __volatile__("rep; nop");
#else
	__asm__ __volatile__("movl $0,%0; rep; nop" : "=g"(*flag) :: "memory");
#endif
}

static void print_status(unsigned long *shared)
{
	const char progress[] = "\\|/-";

	static unsigned long long sum_tsc_loops, sum_tod_loops, sum_clock_loops,
				  sum_tod;
	static unsigned int count1, count2;
	static usecs_t prev_tod;

	usecs_t tod;

	if (!prev_tod)
		rdtod(prev_tod);

	count1++;
	if (count1 < 1000)
		return;
	count1 = 0;

	rdtod(tod);
	if (abs(tod - prev_tod) < 100000ULL)
		return;

	sum_tod += tod - prev_tod;
	sum_tsc_loops += SHARED_LL(NR_TSC_LOOPS);
	sum_tod_loops += SHARED_LL(NR_TOD_LOOPS);
	sum_clock_loops += SHARED_LL(NR_CLOCK_LOOPS);
	SHARED_LL(NR_TSC_LOOPS) = 0;
	SHARED_LL(NR_TOD_LOOPS) = 0;
	SHARED_LL(NR_CLOCK_LOOPS) = 0;

	if (TEST_TSC)
		printf(" | TSC: %.2fus, fail:%ld",
			(double)sum_tod/(double)sum_tsc_loops,
			SHARED(NR_TSC_WARPS));

	if (TEST_TOD)
		printf(" | TOD: %.2fus, fail:%ld",
			(double)sum_tod/(double)sum_tod_loops,
			SHARED(NR_TOD_WARPS));

	if (TEST_CLOCK)
		printf(" | CLK: %.2fus, fail:%ld",
			(double)sum_tod/(double)sum_clock_loops,
			SHARED(NR_CLOCK_WARPS));

	prev_tod = tod;
	count2++;
	printf(" %c\r", progress[count2 & 3]);
	fflush(stdout);
}

static inline void test_TSC(unsigned long *shared)
{
#if TEST_TSC
	cycles_t t0, t1;
	long long delta;

	lock(&SHARED(LOCK));
	rdtscll(t1);
	t0 = SHARED_LL(TSC);
	SHARED_LL(TSC) = t1;
	SHARED_LL(NR_TSC_LOOPS)++;
	unlock(&SHARED(LOCK));

	delta = t1-t0;
	if (delta < 0) {
		lock(&SHARED(LOCK));
		SHARED(NR_TSC_WARPS)++;
		if (delta < SHARED_LL(WORST_TSC)) {
			SHARED_LL(WORST_TSC) = delta;
			fprintf(stderr, "\rnew TSC-warp maximum: %9Ld cycles, %016Lx -> %016Lx\n",
				delta, t0, t1);
		}
		unlock(&SHARED(LOCK));
	}
	if (!((unsigned long)t0 & 31))
		asm volatile ("rep; nop");
#endif
}

static inline void test_TOD(unsigned long *shared)
{
#if TEST_TOD
	usecs_t T0, T1;
	long long delta;

	lock(&SHARED(LOCK));
	rdtod(T1);
	T0 = SHARED_LL(TOD);
	SHARED_LL(TOD) = T1;
	SHARED_LL(NR_TOD_LOOPS)++;
	unlock(&SHARED(LOCK));

	delta = T1-T0;
	if (delta < 0) {
		lock(&SHARED(LOCK));
		SHARED(NR_TOD_WARPS)++;
		if (delta < SHARED_LL(WORST_TOD)) {
			SHARED_LL(WORST_TOD) = delta;
			fprintf(stderr, "\rnew TOD-warp maximum: %9Ld usecs,  %016Lx -> %016Lx\n",
				delta, T0, T1);
		}
		unlock(&SHARED(LOCK));
	}
#endif
}

static inline void test_CLOCK(unsigned long *shared)
{
#if TEST_CLOCK
	usecs_t T0, T1;
	long long delta;

	lock(&SHARED(LOCK));
	rdclock(T1);
	T0 = SHARED_LL(CLOCK);
	SHARED_LL(CLOCK) = T1;
	SHARED_LL(NR_CLOCK_LOOPS)++;
	unlock(&SHARED(LOCK));

	delta = T1-T0;
	if (delta < 0) {
		lock(&SHARED(LOCK));
		SHARED(NR_CLOCK_WARPS)++;
		if (delta < SHARED_LL(WORST_CLOCK)) {
			SHARED_LL(WORST_CLOCK) = delta;
			fprintf(stderr, "\rnew CLOCK-warp maximum: %9Ld nsecs,  %016Lx -> %016Lx\n",
				delta, T0, T1);
		}
		unlock(&SHARED(LOCK));
	}
#endif
}

int main(int argc, char **argv)
{
	int i, parent, me;
	unsigned long *shared;
	unsigned long cpus, tasks;

	cpus = sysconf(_SC_NPROCESSORS_CONF);

	if (argc > 2) {
usage:
		fprintf(stderr,
			"usage: tsc-sync-test <threads>\n");
		exit(-1);
	}
	if (argc == 2) {
		tasks = atol(argv[1]);
		if (!tasks)
			goto usage;
	} else
		tasks = cpus;

	printf("%ld CPUs, running %ld parallel test-tasks.\n", cpus, tasks);
	printf("checking for time-warps via:\n"
#if TEST_TSC
	"- read time stamp counter (RDTSC) instruction (cycle resolution)\n"
#endif
#if TEST_TOD
	"- gettimeofday (TOD) syscall (usec resolution)\n"
#endif
#if TEST_CLOCK
	"- clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution)\n"
#endif
		"\n"
		);
	shared = setup_shared_var();

	parent = getpid();

	for (i = 1; i < tasks; i++) {
		if (!fork())
			break;
	}
	me = getpid();

	while (1) {
		int i;

		for (i = 0; i < 10; i++)
			test_TSC(shared);
		for (i = 0; i < 10; i++)
			test_TOD(shared);
		for (i = 0; i < 10; i++)
			test_CLOCK(shared);

		if (me == parent)
			print_status(shared);
	}

	return 0;
}
Received on Fri Jun 03 2016 - 03:06:43 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:05 UTC