Re: FreeBSD 5.3 Bridge performance take II

From: Matthew Dillon <dillon_at_apollo.backplane.com> Date: Thu, 9 Sep 2004 01:55:01 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:11 UTC

:% One nice thing about using this experimental code is that I hope it will
:% allow us to reason more effectively about the extent to which improving
:% per-cpu data structures improves efficiency -- I can now much more
:% easily say "OK, what happens if eliminate the cost of locking for common
:% place mbuf allocation/free".  I've also started looking at per-interface
:% caches based on the same model, which has some similar limitations (but
:% also some similar benefits), such as stuffing per-interface uma caches
:% in struct ifnet. 
:
:I.e., using per-thread UMA caches is a 30-60 minute hack that allows me to
:explore and measure the performance benefits (and costs) of several
:different approaches, including per-cpu, per-thread, and per-data
:structure/object caching without doing the full implementation up front. 
:Per-thread caching, for example, can simulate the effects of
:non-preemption and mutex avoidance in micro-benchmarking, although in the
:general case under macro-benchmark perspective it suffers from a number of
:problems (including the draining, balancing, and extra storage cost
:issues).  I didn't attempt to address these problems under the assumption
:that the current implementation is a tool for exploring performance, not
:something to actually use.
:

    Well, I see some major problems with this avenue of development.
    I see this as an end-run around existing, broken (or perceived to be
    broken) APIs.

    If you don't believe that the slab allocator has severe performance
    issues then the slab allocator (aka malloc()) should simply be used
    directly.  If you do believe that the slab allocator has severe performance
    issues then the correct solution is to FIRST FIX THE SLAB ALLOCATOR.
    Until the slab allocator is fixed the system-wide overhead will skew
    the results from any other optimization tests you try to make.  i.e.
    results from other optimizations may appear to be less effective simply
    due to being washed out by the slab allocator's overhead.

    In the same respect, the idea that a per-thread memory cache is going to
    be more efficient then a per-cpu memory cache implies that it is too
    expensive to implement the locking required to implement a per-cpu
    memory cache vs a per-thread memory cache.  If that is the implication,
    the solution is to fix the required locking.  Frankly it should be no
    more expensive then a critical section and a critical section to
    access per-cpu data should be no more then a nanosecond or two more
    expensive then access to a per-thread data structure.

    Per-thread caching APIs have major design hurdles to overcome.  I've
    already listed a few of them, but there are many, many more.  For example,
    locality of reference may seem to be a slam dunk but you actually get
    better locality of reference with a per-cpu cache, especially when a
    thread migrates between cpus, but also because the cache is able to
    take advantage of and reuse a very recently reused chunk of data
    that might have been freed by some other thread on the same cpu... so
    you get it even after a context switch (and you don't get that with a 
    per-thread cache).  This is a case that can occur quite often, especially
    when interrupt threads are shipping data to protocol threads.  I could
    go on and on, but I am not going to because I think it damn well ought
    to be obvious.

    It seems clear to me that it makes little sense to spend time
    on a per-thread memory cache when a per-cpu memory cache is, just from
    an algorithmic point of view, going to be far more effective, far
    easier to manage, possible to have larger (deterministic) hysteresis
    without creating too much non-deterministic slop, and so on and so forth.

    If this is just an experiment, and therefore something that will never
    be committed, then I still don't understand why you are even wasting time
    working on it when you could be fixing the slab allocator instead.
    IMHO I do not believe that a per-thread memory cache would even come
    close to characterizing the performance benefits of a mutexless per-cpu
    cache.  While the base overhead is similar, the side effects and
    management requirements are going to be very different.  If the
    experiment is supposed to characterize the potential performance
    improvement over the slab allocator, then I believe it is already
    too flawed to be an accurate measure of that.

    Now does this mean that caches in front of the slab allocator are bad?
    No, it doesn't.  I use front-end caches in several places in 
    DragonFly and you definitely want to do the same thing in FreeBSD.
    The difference, I think, is that we only use front-end caches for
    performance-critical subsystems and when we do, we implement the
    code directly into the subsystem.  We depend on our core slab allocator
    (i.e. malloc()) far more then you seem to want to depend on yours.
    FreeBSD seems to depend on its UMA type-stable zone allocator for the
    same thing but that has a lot of extra, unnecessary overhead.  The 
    MBUF allocator is a good example of this.  You guys are making several
    extra procedural calls that we aren't in the mbuf allocation path.  
    The way I see it, if it's important enough to require its own front-end
    cache over simply using the slab allocator, then you might as well go
    whole hog.

    But it is also important to make your slab allocator fast so there
    are fewer situations where you feel that you need to bypass it.  A
    front-end cache is not supposed to be a workaround for a slow
    slab allocator but instead is supposed to be an ultra-optimized
    implementation to reduce overhead in a critical subsystem.

:In doing so, my hope was to identify which areas will offer the most
:immediate performance benefits, be it simply cutting down on costly
:operations (such as the entropy harvesting code for Yarrow which appears
:to have found its way into our interrupt path), rethinking locking
:strategies, optimizing out/coalescing locking, optimizing out excess
:memory allocation, optimizing synchronization primitives with the same
:semantics, changing synchronization assumptions to offer weaker/stronger
:semantics, etc.
:...
:
:Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
:robert_at_fledge.watson.org      Principal Research Scientist, McAfee Research

    Well, I guess identifying the problem areas is good, but it isn't
    rocket science.  It should be glaringly obvious without requiring
    all that much actual testing.

    One big problem you face is that a great deal of the performance
    issues in FreeBSD are not from any single subsystem, but from overheads
    sprinkled throughout the entire codebase.  Taken singly, these overheads
    are not significant.  A single mutex could be argued to be not all that
    significant... but having to obtain and release 11 mutexes in a code
    path *IS* significant.  A single memory allocation might not be
    significant, but having to make three or four in a critical path can be.
    And so on, and so forth.

    Testing performance with little tweaks here and there is not going to
    give you any worthwhile results, IMHO.  Just fixing one thing isn't
    going to solve the performance problem.  You have to fix the entire
    path.  It isn't JUST the mutex overhead that's the problem.  It's 
    the mutex overhead, the scheduler overhead, all the myrid calls you
    have to make to pin the thread, or enter a critical section (if you
    have to do it too often)... it's the atomic-access requirement to the
    per-cpu %fs:globaldata data which makes it impossible to cache 
    per-cpu data.  It's the 4BSD/ULE scheduler being used to schedule
    kernel threads, its the thread switching overhead, the microtime calls
    in the switch path, coding requirements to deal with preemption, cpu
    migration, giant lock handling, uma zone allocator's callback API,
    and a dozen other things.  Taken singly these items produce a
    fractional degredation.  Taken singly these items aren't necessarily
    even 'bad'.  Taken together and you have... well, you have the 
    situation you find yourself in now.

    To really fix the problem you need to be willing to clean up ALL of
    these subsystems.  Well, first you need to recognize that they all need
    to be cleaned up, and I gather that some FreeBSD developers still don't
    recognize that as the problem.  Once you recognize that its a problem
    you then need to go and do the work.  Cleaning up the subsystems is only
    the first step... it gives you a nice solid reasonably high performing
    base to work with.  You still have to deal with the higher-level
    critical pathing issues once you've cleaned up the subsystems.  This is
    where things like front-end caches really show their stuff.

					    -Matt
					    Matthew Dillon 
					    <dillon_at_backplane.com>