Benchmarking mpsafevfs with parallel tarball extraction

From: Kris Kennaway <kris_at_obsecurity.org> Date: Fri, 6 May 2005 11:35:29 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:34 UTC

Here are my benchmark numbers for parallel tarball extraction
with/without mpsafevfs on a 12-processor E4500 running up-to-date 6.0.
Kernel was built without INVARIANTS and other debugging options,
without ADAPTIVE_GIANT (which causes about a 200% performance penalty
on system time in my testing, and has marginal impact on real or user
time) and with 4BSD scheduler (ULE causes spontaneous reboots on this
machine).  The e4500 uses the esp SCSI controller, which runs
without Giant.

The test is this:

#!/bin/sh

for i in 1 2 3 4 5 6 7 8 9 10 11 12; do
        mkdir $i
        tar xfC /var/portbuild/sparc64/5/tarballs/bindist.tar $i &
done

on a 2000mb preallocated malloc backed md disk (machine has 5GB RAM).
Before each test I umount, newfs with default options (i.e. no -U;
this kills performance on md by a factor of several times) and mount.
The tarball is

# ls -l /var/portbuild/sparc64/5/tarballs/bindist.tar
-rw-r--r--  1 kris  kris  133231104 Apr 28 12:18 /var/portbuild/sparc64/5/tarballs/bindist.tar
# tar tvf /var/portbuild/sparc64/5/tarballs/bindist.tar | wc -l
    5664

(it's a copy of a sparc64 5.4-STABLE world I use to populate package
build chroots).

A single extraction (with tarball cached) with mpsafevfs=1 takes:

       14.85 real         1.31 user        10.43 sys
       14.90 real         1.31 user        10.40 sys
       15.03 real         1.26 user        10.55 sys
       14.49 real         1.35 user        10.47 sys
       14.50 real         1.36 user        10.42 sys
       14.50 real         1.28 user        10.52 sys
       14.52 real         1.33 user        10.48 sys
       14.44 real         1.38 user        10.36 sys
       14.54 real         1.37 user        10.39 sys
       14.63 real         1.29 user        10.56 sys

mean=14.64 seconds real time

without mpsafevfs:

       14.72 real         1.39 user        10.45 sys
       14.70 real         1.40 user        10.47 sys
       14.99 real         1.41 user        10.54 sys
       15.13 real         1.48 user        10.45 sys
       15.18 real         1.40 user        10.50 sys
       14.87 real         1.64 user        10.38 sys
       14.66 real         1.42 user        10.37 sys
       14.69 real         1.49 user        10.30 sys
       14.87 real         1.45 user        10.60 sys
       14.75 real         1.47 user        10.43 sys
mean=14.86 real

x mpsafevfs
+ !mpsafevfs
+--------------------------------------------------------------------------+
|       +                                   x                              |
| +    ++ + +       +  x  xx x  x         + x  +       x   +         x    x|
||__________M________A__|_________________|_M________________|             |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         14.66         15.18         14.87        14.856    0.18810163
+  10         14.44         15.03         14.54         14.64     0.2081666
Difference at 95.0% confidence
        -0.216 +/- 0.186404
        -1.45396% +/- 1.25474%
        (Student's t, pooled s = 0.198388)

So mpsafevfs has a slight measurable benefit even for non-concurrent
extraction.

The parallel extraction without mpsafevfs:
      319.42 real        35.70 user      1547.38 sys
      317.80 real        35.41 user      1532.87 sys
      318.49 real        35.35 user      1542.23 sys
      321.82 real        35.51 user      1559.50 sys
      317.66 real        35.51 user      1566.16 sys
      318.63 real        35.64 user      1552.48 sys
      319.51 real        35.69 user      1548.99 sys
      317.79 real        35.34 user      1542.89 sys
      319.89 real        35.70 user      1536.34 sys
      318.76 real        35.24 user      1545.21 sys

with mpsafevfs:
       80.24 real        27.70 user       475.54 sys
       83.13 real        27.94 user       491.55 sys
       87.66 real        28.45 user       500.68 sys
       81.88 real        28.12 user       463.51 sys
       83.23 real        27.87 user       483.62 sys
       82.20 real        28.07 user       482.57 sys
       83.82 real        28.29 user       473.70 sys
       84.54 real        27.95 user       472.12 sys
       80.29 real        28.24 user       461.87 sys
       87.77 real        28.34 user       482.03 sys
       82.10 real        27.79 user       475.31 sys

system clock:
+--------------------------------------------------------------------------+
| x                                                                     ++ |
| x                                                                     ++ |
|xx                                                                     +++|
|xxxx                                                                   +++|
||A|                                                                    |A |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10        461.87        500.68        482.03       478.719     11.975802
+  10       1532.87       1566.16       1547.38      1547.405     10.066401
Difference at 95.0% confidence
        1068.69 +/- 10.3942
        223.239% +/- 2.17124%
        (Student's t, pooled s = 11.0624)

wall clock:
+--------------------------------------------------------------------------+
|                                                                        + |
|                                                                        + |
|                                                                        + |
| x                                                                      + |
| x                                                                      + |
| x                                                                      + |
|xx                                                                      + |
|xxx                                                                     + |
|xxx                                                                     ++|
||A|                                                                     A||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  11         80.24         87.77         83.13     83.350909     2.5277241
+  10        317.66        321.82        318.76       318.977     1.2618333
Difference at 95.0% confidence
        235.626 +/- 1.85556
        282.692% +/- 2.2262%
        (Student's t, pooled s = 2.02905)

i.e. mpsafevfs shows enormous improvements in both cases.

Comparing to the mean time for a single extraction, 12 simultaneous
extractions with mpsafevfs take the time of 5.69 single, and 21.788
without mpsafevfs.  This is an effective concurrency of 2.11 (12/5.69)
extractions for mpsafevfs and 0.55 without (i.e. nearly twice as bad
as just sequentializing the extractions).

I might be bumping into the bandwidth of md here - when I ran less
rigorous tests with lower concurrency of extractions I seemed to be
getting marginally better performance (about an effective concurrency
of 2.2 for both 3 and 10 simultaneous extractions - so at least it
doesn't seem to degrade badly).  Or this might be reflecting VFS lock
contention (which there is certainly a lot of, according to mutex
profiling traces).

Certainly for package builds on this machine I get much better
performance and lower CPU utilization if I do every package build in a
separate (swap-backed) md than with them all in a single large md,
which tells me it's not hard to saturate a single md.

Even if I am hitting another limit here that is placing an upper bound
on the performance, filesystem performance with mpsafevfs is clearly
much better than without, and we are now seeing clear benefits from
SMP on 6.0 compared to earlier versions of FreeBSD.

Kris

P.S. Big props to Jeff Roberson for making this work!  Thanks also to
Hiroki Sato for donating the E4500 and other machine resources.