Re: Examining the VM splay tree effectiveness

From: Andre Oppermann <andre_at_freebsd.org>
Date: Fri, 01 Oct 2010 11:21:03 +0200
On 01.10.2010 06:49, Matthew Dillon wrote:
>      I don't remember the reference but I read a comprehensive comparison
>      between various indexing methods about a year ago and the splay tree
>      did considerably better than a RB-tree.  The RB-tree actually did
>      fairly poorly.

It heavily depends on the access pattern of data structure.  Is
it lookup, modify or insert/delete dominated?  Or a mix of any
of them.  How heavily is the data structure shared across CPU's?
Without this information it is impossible to make a qualified choice.

Making general comparative statements on indexing methods without
taking the access pattern and SMP/CPU cache behavior into account
is going to lead to wrong approach 90% of the time. (Made that number
of up ;-)

>      Any binary tree-like structure makes fairly poor use of cpu caches.
>      Splay trees work somewhat better as long as there is some locality
>      of reference in the lookups, since the node being looked up is moved
>      to the root.  It isn't a bad trade-off.

Again, it hugely depends on how good the locality is and how expensive
the CPU cache line dirtying of the splay rotation is.  You can quickly
fall off an amortization *cliff* here.

I agree with binary tree structures being a bit less optimal for CPU
caches because the tree node is embedded with the data element.  On
the plus side not many other data structures are either.  And as long
as memory is only read it can be cached on multiple CPU's.  Touching
it throws it out everywhere else and causes a high latency memory access
on the next read.

>      On the negative side all binary-tree-like structures tend to be
>      difficult to MP-lock in a fine-grained manner (not that very many
>      structures are locked at that fine a grain anyway, but it's a data
>      point).  Splay-trees are impossible to lock at a fine-grain due to
>      the massive modifications made on any search and the movement
>      of the root, and there are MP access issues too.

I doubt that fine grained locking of such data structures is beneficial
in many cases.  Fine grained locking implies more locks, more bus lock
cycles, more memory barriers and more CPU cache dirtying.  As long as
a data structure's global lock is not significantly contended on and
based on that a finer locking doesn't lead to parallel operation it just
creates a lot of overhead for nothing.

>      --
>
>      What turned out to be the best indexing mechanism was a chained
>      hash table whos hoppers were small linear arrays instead of single
>      elements.  So instead of pointer-chaining each element you have a small
>      for() loop for 4-8 elements before you chain.  The structure being
>      indexed would NOT be integrated into the index directly, the index
>      would point at the final structure from the hopper.

This makes a lot of sense if the index is sufficiently small, lets say
one or two int's.  When you go beyond that the advantage quickly fades
away.

>      For our purposes such linear arrays would contain a pointer and
>      an indexing value in as small an element as possible (8-16 bytes),
>      the idea being that you make the absolute best use of your cache line
>      and L1 cache / memory burst.  One random access (initial hash index),
>      then linear accesses using a small indexing element, then one final
>      random access to get to the result structure and validate that
>      it's the one desired (at that point we'd be 99.9% sure that we have
>      the right structure because we have already compared the index value
>      stored in the hopper).  As a plus the initial hash index also makes
>      MP locking the base of the chains easier.

Agreed under the premise that the access pattern fits this behavior.

>      I don't use arrayized chained hash tables in DFly either, but only
>      because it's stayed fairly low on the priority list.  cpu isn't really
>      a major issue on systems these days.  I/O is the bigger problem.
>      RB-Trees are simply extremely convenient from the programming side,
>      which is why we still use them.

Agreed with the emphasis on including lock/atomic cycles and CPU
cache hierarchy into I/O.

>      I was surprised that splay trees did so well vs RB-trees, I never liked
>      the memory writes splay trees do let alone the impossibility of
>      fine-grained locking.  Originally I thought the writes would make
>      performance worse, but they actually don't.  Still, if I were to change
>      any topologies now I would definitely go with the chained-hash /
>      small-linear-array / chain / small-linear-array / chain mechanic.  It
>      seems to be the clear winner.

Without first studying the accesses pattern and applying it to
the various data structures this is a very speculative statement
to make.

-- 
Andre
Received on Fri Oct 01 2010 - 07:20:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:07 UTC