Re: 80386 support in -current

From: Maxim Sobolev <sobomax_at_portaone.com> Date: Mon, 26 Jan 2004 12:44:48 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:39 UTC

Out of curiosity I had run the bench on my good ol' P4 2GHz notebook, 
and was very surprised that it much slower than even PIII-400 in most cases:

                                 212.22 cycles/call
-DNO_MPLOCKED -DI386_CPU        117.32 cycles/call
-DI386_CPU                      117.19 cycles/call
-DNO_MPLOCKED                   31.39 cycles/call

So, indeed, xchg is *lot* slower on p4 in non-SMP case than cmpxchgl, I 
had tried to rewrite atomic_readandclear_int() using cmpxchg - in 
non-SMP case it became more than 10 times faster than current xchg 
version (15 cycles vs. 200 cycles). However, when I've hacked all 
functions in atomic.h to use cmpxchg instead of xchg, and run make world 
benchmark on kernels without this change and with it, I found that there 
was hardly any improvement in performance, despite expected decrease of 
mutex unlocking operation.

-Maxim

===================================================================
RCS file: /sys/i386/include/atomic.h,v
retrieving revision 1.1
diff -du -r1.1 /sys/i386/include/atomic.h
--- /sys/i386/include/atomic.h  2004/01/25 22:25:56     1.1
+++ /sys/i386/include/atomic.h  2004/01/26 10:31:14
_at__at_ -217,10 +217,10 _at__at_
  static __inline void                                   \
  atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
  {                                                      \
-       __asm __volatile(SOP                            \
+       __asm __volatile(__XSTRING(MPLOCKED) SOP        \
         : "+m" (*p),                    /* 0 */         \
           "+r" (v)                      /* 1 */         \
-       : : "memory");                                  \
+       : : "memory", "eax");                           \
  }                                                      \
  struct __hack

_at__at_ -258,10 +258,10 _at__at_
  ATOMIC_ASM(add,             long,  "addl %1,%0",  "ir",  v);
  ATOMIC_ASM(subtract, long,  "subl %1,%0",  "ir",  v);

-ATOMIC_STORE_LOAD(char,        "cmpxchgb %b0,%1", "xchgb %b1,%0");
-ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1", "xchgw %w1,%0");
-ATOMIC_STORE_LOAD(int, "cmpxchgl %0,%1",  "xchgl %1,%0");
-ATOMIC_STORE_LOAD(long,        "cmpxchgl %0,%1",  "xchgl %1,%0");
+ATOMIC_STORE_LOAD(char,        "cmpxchgb %b0,%1", "movb %b1,%%al; 1: 
cmpxchgb %b1,%0; jnz 1b");
+ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1", "movw %w1,%%ax; 1: cmpxchgw 
%w1,%0; jnz 1b");
+ATOMIC_STORE_LOAD(int, "cmpxchgl %0,%1",  "movl %1,%%eax; 1: cmpxchgl 
%1,%0;  jnz 1b");
+ATOMIC_STORE_LOAD(long,        "cmpxchgl %0,%1",  "movl %1,%%eax; 1: 
cmpxchgl %1,%0;  jnz 1b");

  #undef ATOMIC_ASM
  #undef ATOMIC_STORE_LOAD
_at__at_ -418,11 +418,15 _at__at_
         u_int result;

         __asm __volatile (
-       "       xorl    %0,%0 ;         "
-       "       xchgl   %1,%0 ;         "
+       "       " __XSTRING(MPLOCKED) " "
+       "1:                             "
+       "       cmpxchgl %3,%2 ;        "
+       "       jnz 1b ;                "
         "# atomic_readandclear_int"
-       : "=&r" (result)                /* 0 (result) */
-       : "m" (*addr));                 /* 1 (addr) */
+       : "=a" (result)                 /* 0 (result) */
+       : "0" (*addr),                  /* 1 */
+         "m" (*addr),                  /* 2 (addr) */
+         "r" (0));                     /* 3 */

         return (result);
  }
_at__at_ -433,11 +437,15 _at__at_
         u_long result;

         __asm __volatile (
-       "       xorl    %0,%0 ;         "
-       "       xchgl   %1,%0 ;         "
+       "       " __XSTRING(MPLOCKED) " "
+       "1:                             "
+       "       cmpxchgl %3,%2 ;        "
+       "       jnz 1b ;                "
         "# atomic_readandclear_int"
-       : "=&r" (result)                /* 0 (result) */
-       : "m" (*addr));                 /* 1 (addr) */
+       : "=a" (result)                 /* 0 (result) */
+       : "0" (*addr),                  /* 1 */
+         "m" (*addr),                  /* 2 (addr) */
+         "r" (0));                     /* 3 */

         return (result);
  }

Bruce Evans wrote:

> On Sat, 24 Jan 2004, Dag-Erling [iso-8859-1] Sm?rgrav wrote:
> 
> 
>>Peter Jeremy <PeterJeremy_at_optushome.com.au> writes:
>>
>>>Does anyone know why FreeBSD 5.x would not run on a 386SX/387SX
>>>combination?  I realise the performance would be very poor but I
>>>don't see any reason why it wouldn't work at all.
>>
>>It should run fine (though quite slowly) on a 386 with a 387 FPU, but
>>you need to roll your own release.  The reason why we don't support
>>the 386 out of the box is that a kernel that will run on a 386 will be
>>very inefficient on newer CPUs (the synchronization code relies on a
>>particular instruction which was introduced with the 486 and must be
>>emulated on the 386)
> 
> 
> This is the specious reason.  The synchronization code relies on a
> particular instruction that might be very inefficient to emulate on a
> 386, but emulation is not done; the instruction is just replaced by
> an instruction or sequence of instructions that is slower in some cases
> and faster in others (mostly slower, but not especially so, except
> probably on P4's).  The actual reason was mostly that the 386 version
> doesn't work in the SMP case or ([or]?) on "P6 [sic] or higher" and
> making it work well would be too hard.
> 
> SMP is now in GENERIC, so support for it is more important than when
> I386_CPU was removed from GENERIC.
> 
> The ifdef tangle for this stuff combined with lack of testing seems to
> have broken the 386 support in practice.  Libraries are now chummy with
> the kernel implementation of atomic operations, but not chummy enough to
> know when it actually works in userland.  libthr uses the kernel
> atomic_cmpset_*(), but this never works on plain i386's in userland
> (the I386_CPU version doesn't work unless the application gains i/o
> privilege since it uses cli/sti, and the !I386_CPU version doesn't
> work because it uses cmpxchg).
> 
> Some benchmarks for atomic_cmpset_int() run in userland:
> 
> Athlon XP1600          NO_MPLOCKED:             2.02 cycles/call
> Athlon XP1600:                                 18.07 cycles/call
> Athlon XP1600 I386_CPU NO_MPLOCKED:            19.06 cycles/call
> Athlon XP1600 I386_CPU:                        19.06 cycles/call
> Celeron 400            NO_MPLOCKED:             5.03 cycles/call
> Celeron 400:                                   25.36 cycles/call
> Celeron 400   I386_CPU NO_MPLOCKED:            35.27 cycles/call
> Celeron 400   I386_CPU:                        35.32 cycles/call
> 
> %%%
> #include <sys/types.h>
> 
> /*
>  * This is userland benchmark, so lock prefixes are normally forced (for
>  * the !I386_CPU version only).  Compile it with -DNO_MPLOCKED to cancel
>  * this.
>  */
> #ifdef NO_MPLOCKED
> #define	_KERNEL
> #endif
> #include <machine/atomic.h>
> #undef _KERNEL
> 
> #include <machine/cpufunc.h>
> 
> #include <err.h>
> #include <fcntl.h>
> #include <stdio.h>
> 
> #define	NITER	100000000
> 
> int
> main(void)
> {
> 	uint64_t tsc0, tsc1, tsc2;
> 	volatile u_int dst;
> 	int i;
> 
> #ifdef I386_CPU
> 	if (open("/dev/io", O_RDONLY) < 0)
> 		err(1, "open");
> #endif
> 	dst = 0;
> 	tsc0 = rdtsc();
> 	for (i = 0; i < NITER; i++) {
> #if 0
> 		atomic_store_rel_int(&dst, 0);
> #else
> 		dst = 0;
> #endif
> 	}
> 	tsc1 = rdtsc();
> 	for (i = 0; i < NITER; i++) {
> 		atomic_cmpset_int(&dst, 0, 1);
> #if 0
> 		/*
> 		 * XXX mtx_unlock*() would use this, but it expands to
> 		 * xchgl in the !I386_CPU case so it gives a locked
> 		 * instruction even in the !SMP case.  The locking
> 		 * more than doubles the runtime for this benchmark.
> 		 * Don't do it, since we're benchmarking
> 		 * atomic_cmpset_int(), not atomic_store_rel_int().
> 		 */
> 		atomic_store_rel_int(&dst, 0);
> #else
> 		dst = 0;
> #endif
> 	}
> 	tsc2 = rdtsc();
> 	printf("%.2f cycles/call\n",
> 	    ((tsc2 - tsc1) - (tsc1 - tsc0)) / (double)NITER);
> 	return (0);
> }
> %%%
> 
> Notes:
> - the atomic_cmpset_int() tests the usual case of an uncontested lock.
> - cli/sti takes about the same time as a lock prefix on the benchmarked
>   CPUs.  The lock is always forced in userland, so the I386_CPU version
>   gives only a tiny pessimization for time in userland on these CPUs.
>   It mainly pessimizes for use (it doesn't actually work without i/o
>   privilege even in the !SMP case).
> - the kernel sometimes uses xchg instead of "[lock] cmpxchg.  The lock
>   prefix for xchg is implicit.  So the !SMP case uses unnecessary lock
>   prefixes.  This pessimizes mtx_unlock*() by about the same amount as
>   not supporting I386_CPU optimizes mtx_lock*() (on the benchmarked
>   CPUs).  Also, the cli/sti in the I386_CPU version of atomic_cmpset*()
>   are just a waste of time for use in mtx_lock_spin(), since
>   mtx_lock_spin() has already done the cli.  So the inefficiency of
>   I386_VERSION is just a misimplementation detail in many cases.
> - I believe cli and/or sti takes 300 cycles on a P4, so the I386_CPU
>   version is correctly described as "very inefficient" for P4's.
> 
> Bruce
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> 
> 
> 
>