Re: Pthreads performance

From: Peter Edwards <peadar.edwards_at_gmail.com> Date: Sat, 12 Feb 2005 02:00:02 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:28 UTC

On Fri, 11 Feb 2005 17:41:26 -0500, David Schultz <das_at_freebsd.org> wrote:
> On Fri, Feb 11, 2005, Maxim Sobolev wrote:
> > Thank you for the analysis! Looks like you have at least some valid
> > points. I've modified the code to count how many times producer calls
> > malloc() to allocate a new slot, and got the following numbers:
> >
> > -bash-2.05b$ ./aqueue_linuxthreads -n 10000000
> > pusher started
> > poper started
> > total 237482 slots used
> > -bash-2.05b$ ./aqueue_kse -n 10000000
> > pusher started
> > poper started
> > total 403966 slots used
> > -bash-2.05b$ ./aqueue_thr -n 10000000
> > pusher started
> > poper started
> > total 223634 slots used
> > -bash-2.05b$ ./aqueue_c_r -n 10000000
> > pusher started
> > poper started
> > total 55589 slots used
> >
> > This suggests that indeed, it is unfair to compare KSE times to LT
> > times, since KSE have done almost 2x more malloc()s than LT. However, as
> > you can see, libthr have done comparable number of allocations, while
> > c_r about 4 times less, so that only malloc() cost can't fully explain
> > the difference in results.
> 
> The difference in the number of mallocs may be related to the way
> mutex unlocks work.  Some systems do direct handoff to the next
> waiting thread.  Suppose one thread does:
> 
>         pthread_mutex_lock()
>         pthread_mutex_unlock()
>         pthread_mutex_lock()
> 
> With direct handoff, the second lock operation would automatically
> cause an immediate context switch, since ownership of the mutex
> has already been transferred to the other thread.  Without direct
> handoff, the thread may be able to get the lock back immediately;
> in fact, this is almost certainly what will happen on a uniprocessor.
> Since the example code has no mechanism to ensure fairness, without
> direct handoff, one of the threads could perform thousands of
> iterations before the other one wakes up, and this could explain
> all the calls to malloc().
> 
> The part of this picture that doesn't fit is that I was under the
> impression that KSE uses direct handoff...

The direct handoff is probably fine for a directly contended mutex,
but for condition  variables, IMHO, it makes more sense to _not_ do
direct handoff. In a standard producer/consumer model, it seems better
to have the producer work to the point that it gets flow controlled,
and then let the consumer start processing the available data: i.e.,
rather than deal with 100 context switches of (produce->consume)x50 ,
it's likely that (produce)x50->(consume)x50 will reduce context
switching, and improve cacheing behaviour.
i.e., I'd rather not loose my quantum just because I created some
productive work for a consumer to process: It looses many locality of
reference benefits.

I think that's a much more realistic scenario for the use of condition
variables than the sample under discussion.

Disclaimer: This is based on instinct and limited experience rather
than rigorous research. :-)