Re: system call performance 4.x vs 5.x [and UP vs MP]

From: Robert Watson <rwatson_at_freebsd.org> Date: Wed, 28 Jan 2004 13:17:46 -0500 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:40 UTC

On Wed, 28 Jan 2004, Don Bowman wrote:

> This is a very simplistic benchmark, so don't get too hung up on the
> accuracy.
> 
> If you run this on a given machine on 4.x vs 5.x, you will notice a
> dramatic difference [yes, invariants, et al are disabled]. 
> 
> For example, on a 2.0GHz P4-Xeon, HTT enabled, MP kernel, i can
> do ~1M socket/s calls on 4.7, but only ~250K/s on 5.2.

Make sure you are running with sys/proc.h:1.366; this removes two
lock/unlocks of the process lock from the system call path.  I've been
doing some benchmarks of system call performance between 4.9 and
5.2-current on a dual-proc pIII box here, and by making this change, I saw
about a 20% reduction in the cost of a system call from dropping those
operations.  You're running without INVARIANTS, which means you're
skipping the other "big gratuitous locking in system call path" 
associated with the assertion checks in the trap code related to
signalling.

Also, I tend to use clock_gettime() to do time measurements, as among
other things, it lets you ask for the clock resolution, and offers
finer-grained timer measurements:

	struct timespec ts_start, ts_end, ts_res;
#if 0
	struct timespec ts_dummy;
#endif

        assert(clock_getres(CLOCK_REALTIME, &ts_res) == 0);
        printf("Clock resolution: %d.%09lu\n", ts_res.tv_sec, ts_res.tv_nsec);

        assert(clock_gettime(CLOCK_REALTIME, &ts_start) == 0);
        for (i = 0; i < NUM; i++)
#if 1
		/*
		 * Should require no locks; all thread-local data with
		 * an MPSAFE system call.
		 */
		getuid()
#endif
#if 0
		/*
		 * This needs to grab the process lock to follow the
		 * parent process pointer, and should cost more.
		 */
		getppid();
#endif
#if 0
		clock_gettime(CLOCK_REALTIME, &ts_dummy);
#endif

	assert(clock_gettime(CLOCK_REALTIME, &ts_end) == 0);
	timespecsub(&ts_end, &ts_start);

        printf("%d.%09lu for %d iterations\n", ts_end.tv_sec,
            ts_end.tv_nsec, NUM);
        printf("%d.%09lu per/iteration\n", ts_end.tv_sec / NUM,
            ts_end.tv_nsec / NUM);

I usually do about 100,000 or 200,000 iterations.  Too many, and you lose
the CPU.  Also, depending on hardware, I've seen performance with
SCHED_ULE plus the recent IPI changes improve with respect to SCHED_4BSD,
or get worse.  You might want to try both.  If you're using ULE, try
switching back to 4BSD and let us know what that changes. 

For the socket code, you really want the socket locking changes found in
the netperf_socket branch.  Once we get 5.2.1 out the door, the next
priority will be to begin merging that, which pushes giant off the
majority of the network stack.  In the mean time, you might want to try
measuring using pipe() instead, since the pipe code is Giant-free.

There are a number of performance optimizations "in the works" for things
like interrupt scheduling latency, cost of kernel context switches, etc,
and hopefully we'll see those patches posted soon.  I've benchmarked some
of the early versions and see pretty dramatic improvements.  The 5.x
branch has been based on a lot of long-term investment in infrastructure,
and many of the local optimizations have been deferred to get the
architecture right.  The result is hopefully an architecture that offers
much more scalability and performance, but in the short term bad results
for micro-benchmarks.  We've about reached the point where we're ready to
start with local optimizations, and I think you'll see a pretty rapid
payoff.  For example, with the various context switch/interrupt
latency/... changes in the pipeline, I measure a halving of packet
delivery latency end-to-end.  Which doesn't mean we don't have further to
go of course, but does suggest there's a lot of hope.

BTW, is your table below "4.7 UP vs 5.x MP"?  I was left unclear from the
title.  Generally, the results I see suggest 5.x UP is currently slower
than 4.x UP (something we should make back up over the next three or four
months), but that 5.x MP is quite a bit faster than 4.x MP in many
interesting cases (i.e., network throughput, builds, etc).  Especially
with the recent IPI changes and scheduling changes, I see substantially
lower latency in scheduling various kernel threads on 5.x-MP compared to
4.x-MP, which means a lot more work gets done. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert_at_fledge.watson.org      Senior Research Scientist, McAfee Research

> 
>      syscall      4.7        5.2
>        write  1015036     169800 
>       socket  1078994     223253
>       select   430564     155077
> gettimeofday   252762     183620
> 
> As a side note, any idea why gettimeofday is so much more
> expensive than socket?
> 
> Any suggestion on why such a difference between 4.x and 5.x?
> code is compiled the same on each, 'gcc -O2', no threading
> options chosen.
> 
> For interest, you can try the same program on 4.x in UP vs MP,
> and the difference is very dramatic too.
> 
> #include <sys/types.h>
> #include <sys/uio.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <sys/time.h>
> #include <stdio.h>
> 
> #define M(n) measure(#n, n);
> 
> static void
> measure(char *name, void (*fp)(int,...))
> {
>     double speed;
>     int j;
>     unsigned long long i = 0;
>     unsigned long long us;
>     struct timeval tp,tp1;
>     gettimeofday(&tp, 0);
>     tp1 = tp;
> 
>     while (tp1.tv_sec - tp.tv_sec < 10)
>     {
>         for (j = 0; j < 1000000; j++)
>         {
>             fp(0,0,0,0);
>             i++;
>         }
>         gettimeofday(&tp1, 0);
>     }
>     us = ((tp1.tv_sec - tp.tv_sec) * 1000000) + (tp1.tv_usec - tp.tv_usec);
>     speed = (1000000.0 * i) / us;
>     printf("{%s: %llu %llu %6.2f}\n", name, i,us, speed);
> }
> 
> static void
> doGettimeofday()
> {
>     double speed;
>     unsigned long long i = 0;
>     unsigned long long us;
>     struct timeval tp,tp1;
>     gettimeofday(&tp, 0);
>     tp1 = tp;
> 
>     while (tp1.tv_sec - tp.tv_sec < 10)
>     {
>         gettimeofday(&tp1, 0);
>         i++;
>     }
>     us = ((tp1.tv_sec - tp.tv_sec) * 1000000) + (tp1.tv_usec - tp.tv_usec);
>     speed = (1000000.0 * i) / us;
>     printf("{gettimeofday: %llu %llu %6.2f}\n", i,us, speed);
> }
> 
> int
> main(int argc, char **argv)
> {
>     M(write);
>     M(socket);
>     M(select);
>     doGettimeofday();
>     return 0;
> }
> 
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>