Re: kqueue LOR

From: Attilio Rao <attilio_at_freebsd.org> Date: Wed, 13 Dec 2006 17:01:58 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:03 UTC

2006/12/13, Bruce Evans <bde_at_zeta.org.au>:
> On Tue, 12 Dec 2006, John Baldwin wrote:
>
> > On Tuesday 12 December 2006 13:43, Suleiman Souhlal wrote:
>
> >> Why is memory barrier usage not encouraged? As you said, they can be used to
> >> reduce the number of atomic (LOCKed) operations, in some cases.
> >> ...
> >> Admittedly, they are harder to use than atomic operations, but it might
> >> still worth having something similar.
>
> How would MI code know when using memory barriers is good?  This is already
> hard to know for atomic ops -- if there would more than a couple of atomic
> ops then it is probably better to use 1 mutex lock/unlock and no atomic
> ops, since this reduces the total number of atomic ops in most cases, but
> it is hard for MI code to know how many "a couple" is.  (This also depends
> on the SMP option -- without SMP, locking is automatic so atomic ops are
> very fast but mutexes are still slow since they do a lot more than an
> atomic op.)

and this could lead to not real SMPSAFE code, due to the fact you can
just manipulate (atomically)one memory word per time with memory
barrier (so atomic instructions/memory barriers are not suitable for
members structure handling, for example).

> > Memory barriers just specify ordering, they don't ensure a cache flush so
> > another CPU reads up to date values.  You can use memory barriers in
> > conjunction with atomic operations on a variable to ensure that you can
> > safely read other variables (which is what locks do).  For example, in this
>
> I thought that the acquire/release variants of atomic ops guarantee
> this.  They seem to be documented to do this, while mutexes don't seem
> to be documented to do this.  The MI (?) implementation of mutexes
> depends on atomic_cmpset_{acq,rel}_ptr() doing this.

The acquire/release semantic of the atomic istruction really only
guarantee ordering of istructions.
The fact is that even if cache syncronization is not done, I think
that we can assume, for every architecture, a failure on CAS
instructions if the CPU cache has stale values.

The hard part of our mutex implementation is:
278         while (!_obtain_lock(m, tid)) {
279                 lock_profile_obtain_lock_failed(&m->mtx_object, &contested);
280                 turnstile_lock(&m->mtx_object);
281                 v = m->mtx_lock;
282
283                 /*
284                  * Check if the lock has been released while spinning for
285                  * the turnstile chain lock.
286                  */
287                 if (v == MTX_UNOWNED) {
288                         turnstile_release(&m->mtx_object);
289                         cpu_spinwait();
290                         continue;
291                 }
292
293 #ifdef MUTEX_WAKE_ALL
294                 MPASS(v != MTX_CONTESTED);
295 #else
296                 /*
297                  * The mutex was marked contested on release. This
means that
298                  * there are other threads blocked on it.  Grab ownership of
299                  * it and propagate its priority to the current thread if
300                  * necessary.
301                  */
302                 if (v == MTX_CONTESTED) {
303                         m->mtx_lock = tid | MTX_CONTESTED;
304                         turnstile_claim(&m->mtx_object);
305                         break;
306                 }
307 #endif
308
309                 /*
310                  * If the mutex isn't already contested and a failure occurs
311                  * setting the contested bit, the mutex was either released
312                  * or the state of the MTX_RECURSED bit changed.
313                  */
314                 if ((v & MTX_CONTESTED) == 0 &&
315                     !atomic_cmpset_ptr(&m->mtx_lock, v, v |
MTX_CONTESTED)) {
316                         turnstile_release(&m->mtx_object);
317                         cpu_spinwait();
318                         continue;
319                 }
...

_obtain_lock is a simple CAS using a casted version of curthread (+
eventual flags) as tracking bitfield.
Now, if we assume a cache stale lock bitfield, for example, at line
315, let the CAS fail doesn't hurt too much since the loop is
restarted.
The failure of CAS for every architecture (if there is a cache datas
staling) might be one of the fundamentals of our locking model.

Attilio

-- 
Peace can only be achieved by understanding - A. Einstein