(unknown charset) Re: 80386 support in -current

From: (unknown charset) Bruce Evans <bde_at_zeta.org.au> Date: Sun, 25 Jan 2004 17:28:31 +1100 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:39 UTC

On Sat, 24 Jan 2004, Dag-Erling [iso-8859-1] Smørgrav wrote:

> Peter Jeremy <PeterJeremy_at_optushome.com.au> writes:
> > Does anyone know why FreeBSD 5.x would not run on a 386SX/387SX
> > combination?  I realise the performance would be very poor but I
> > don't see any reason why it wouldn't work at all.
>
> It should run fine (though quite slowly) on a 386 with a 387 FPU, but
> you need to roll your own release.  The reason why we don't support
> the 386 out of the box is that a kernel that will run on a 386 will be
> very inefficient on newer CPUs (the synchronization code relies on a
> particular instruction which was introduced with the 486 and must be
> emulated on the 386)

This is the specious reason.  The synchronization code relies on a
particular instruction that might be very inefficient to emulate on a
386, but emulation is not done; the instruction is just replaced by
an instruction or sequence of instructions that is slower in some cases
and faster in others (mostly slower, but not especially so, except
probably on P4's).  The actual reason was mostly that the 386 version
doesn't work in the SMP case or ([or]?) on "P6 [sic] or higher" and
making it work well would be too hard.

SMP is now in GENERIC, so support for it is more important than when
I386_CPU was removed from GENERIC.

The ifdef tangle for this stuff combined with lack of testing seems to
have broken the 386 support in practice.  Libraries are now chummy with
the kernel implementation of atomic operations, but not chummy enough to
know when it actually works in userland.  libthr uses the kernel
atomic_cmpset_*(), but this never works on plain i386's in userland
(the I386_CPU version doesn't work unless the application gains i/o
privilege since it uses cli/sti, and the !I386_CPU version doesn't
work because it uses cmpxchg).

Some benchmarks for atomic_cmpset_int() run in userland:

Athlon XP1600          NO_MPLOCKED:             2.02 cycles/call
Athlon XP1600:                                 18.07 cycles/call
Athlon XP1600 I386_CPU NO_MPLOCKED:            19.06 cycles/call
Athlon XP1600 I386_CPU:                        19.06 cycles/call
Celeron 400            NO_MPLOCKED:             5.03 cycles/call
Celeron 400:                                   25.36 cycles/call
Celeron 400   I386_CPU NO_MPLOCKED:            35.27 cycles/call
Celeron 400   I386_CPU:                        35.32 cycles/call

%%%
#include <sys/types.h>

/*
 * This is userland benchmark, so lock prefixes are normally forced (for
 * the !I386_CPU version only).  Compile it with -DNO_MPLOCKED to cancel
 * this.
 */
#ifdef NO_MPLOCKED
#define	_KERNEL
#endif
#include <machine/atomic.h>
#undef _KERNEL

#include <machine/cpufunc.h>

#include <err.h>
#include <fcntl.h>
#include <stdio.h>

#define	NITER	100000000

int
main(void)
{
	uint64_t tsc0, tsc1, tsc2;
	volatile u_int dst;
	int i;

#ifdef I386_CPU
	if (open("/dev/io", O_RDONLY) < 0)
		err(1, "open");
#endif
	dst = 0;
	tsc0 = rdtsc();
	for (i = 0; i < NITER; i++) {
#if 0
		atomic_store_rel_int(&dst, 0);
#else
		dst = 0;
#endif
	}
	tsc1 = rdtsc();
	for (i = 0; i < NITER; i++) {
		atomic_cmpset_int(&dst, 0, 1);
#if 0
		/*
		 * XXX mtx_unlock*() would use this, but it expands to
		 * xchgl in the !I386_CPU case so it gives a locked
		 * instruction even in the !SMP case.  The locking
		 * more than doubles the runtime for this benchmark.
		 * Don't do it, since we're benchmarking
		 * atomic_cmpset_int(), not atomic_store_rel_int().
		 */
		atomic_store_rel_int(&dst, 0);
#else
		dst = 0;
#endif
	}
	tsc2 = rdtsc();
	printf("%.2f cycles/call\n",
	    ((tsc2 - tsc1) - (tsc1 - tsc0)) / (double)NITER);
	return (0);
}
%%%

Notes:
- the atomic_cmpset_int() tests the usual case of an uncontested lock.
- cli/sti takes about the same time as a lock prefix on the benchmarked
  CPUs.  The lock is always forced in userland, so the I386_CPU version
  gives only a tiny pessimization for time in userland on these CPUs.
  It mainly pessimizes for use (it doesn't actually work without i/o
  privilege even in the !SMP case).
- the kernel sometimes uses xchg instead of "[lock] cmpxchg.  The lock
  prefix for xchg is implicit.  So the !SMP case uses unnecessary lock
  prefixes.  This pessimizes mtx_unlock*() by about the same amount as
  not supporting I386_CPU optimizes mtx_lock*() (on the benchmarked
  CPUs).  Also, the cli/sti in the I386_CPU version of atomic_cmpset*()
  are just a waste of time for use in mtx_lock_spin(), since
  mtx_lock_spin() has already done the cli.  So the inefficiency of
  I386_VERSION is just a misimplementation detail in many cases.
- I believe cli and/or sti takes 300 cycles on a P4, so the I386_CPU
  version is correctly described as "very inefficient" for P4's.

Bruce