Re: fork speed vs /bin/sh

From: Matthew Dillon <dillon_at_apollo.backplane.com> Date: Thu, 27 Nov 2003 11:48:47 -0800 (PST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:31 UTC

:What this shows is that vfork() is 3 times faster than fork() on static
:binaries, and 9 times faster on dynamic binaries.  If people are
:worried about a 40% slowdown, then perhaps they'd like to investigate
:a speedup that works no matter whether its static or dynamic?  There is
:a reason that popen(3) uses vfork().  /bin/sh should too, regardless of
:whether its dynamic or static.  csh/tcsh already uses vfork() for the
:same reason.
:
:NetBSD have already taken advantage of this speedup and their /bin/sh uses
:vfork().  Some enterprising individual who cares about /bin/sh speed should
:check out that.  Start looking near #ifdef DO_SHAREDVFORK.

    That isn't really a fair comparison because your vfork is hitting a
    degenerate case and isn't actually doing anything significant.  You
    really need to exec() something.  I've included a program below 
    that [v]fork/exec's "./sh -c exit 0" 5000 times.

	Dell2550, 2xCPU (MP build), DFly

    0.000u  4.095s 0:02.53 161.6%   154+107k 0+0io 0pf+0w VFORK/EXEC STATIC SH
    0.000u  6.681s 0:04.04 165.3%   94+97k 0+0io 0pf+0w	  FORK/EXEC STATIC SH
    0.500u 16.844s 0:16.34 106.1%   53+84k 0+0io 0pf+0w	  VFORK/EXEC DYNAMIC SH
    0.093u 18.303s 0:23.86 77.0%    42+79k 0+0io 0pf+0w	  FORK/EXEC DYNAMIC SH

	Athlon64, 2xCPU (UP), DFly

    0.078u 0.687s 0:00.74 101.3%    399+226k 0+0io 0pf+0w VFORK/EXEC STATIC SH
    0.117u 0.968s 0:01.07 100.0%    273+208k 0+0io 0pf+0w FORK/EXEC STATIC SH
    2.218u 2.484s 0:04.71 99.5%     121+180k 0+0io 1pf+0w VFORK/EXEC DYNAMIC SH
    2.281u 2.773s 0:04.98 101.4%    113+179k 0+0io 0pf+0w FORK/EXEC DYNAMIC SH

    1.304u 2.289s 0:03.60 99.4%     121+180k 0+0io 0pf+0w VFORK/EXEC DYNAMIC SH
							  WITH PREBINDING.
    1.296u 2.648s 0:03.90 100.7%    112+180k 0+0io 1pf+0w FORK/EXEC DYNAMIC SH
							  WITH PREBINDING.

    These results were rather unexpected, actually.  I'm not sure why the
    numbers on the DELL box are so bad with a dynamic 'sh' but I suspect that
    the dynamic linking is blowing out the L1 cache.

    In anycase, taking the Athlon64 system the difference between static and
    dynamic is around 4 seconds while the difference between vfork and fork
    is only around 0.25 seconds, so while moving to vfork() helps it doesn't
    help all that much.

    Unless you happen to be hitting a boundary condition on the L1 cache,
    that is.  If that is presumably the case on the Dell box (which only
    has a 16K L1 cache where as the AMD64 has a 64K L1 cache), then the
    difference is around 14 seconds between vfork static and vfork dynamic
    verses an additional 8 seconds going from vfork to fork.  Vfork would
    probably be a significant improvement on the DELL box.

    Prebinding generates around a 20% overhead improvement for the dynamic 'sh'
    on the Athlon64 but on the Dell2550 prebinding actually made things
    go slower (not shown above), from 23.8 seconds to 26 seconds.  I 
    think there is an edge case due to prebinding having a greater L1 cache
    impact.  For larger, more complex programs prebinding shows definite,
    if small, improvements.

						-Matt

/*
 * CD into the directory containing the ./sh executable before running
 */
#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>

main()
{
    int i;
    pid_t pid;

    for (i = 0; i < 5000; ++i) {
	if ((pid = vfork()) == 0) {	/* <<<<< CHANGE THIS FORK/VFORK */
		execl("./sh", "./sh", "-c", "exit", "0", NULL);
		write(2, "problem\n", 8);
		_exit(1);
	}
	if (pid > 0)
	    waitpid(pid, NULL, 0);
    }
    return(0);
}