Re: proposed smp_rendezvous change

From: John Baldwin <jhb_at_freebsd.org>
Date: Tue, 17 May 2011 09:58:16 -0400
On Tuesday, May 17, 2011 9:46:34 am Andriy Gapon wrote:
> on 17/05/2011 14:56 John Baldwin said the following:
> > On 5/17/11 4:03 AM, Andriy Gapon wrote:
> >> Couldn't [Shouldn't] the whole:
> >>
> >>>>>       /* Ensure we have up-to-date values. */
> >>>>>       atomic_add_acq_int(&smp_rv_waiters[0], 1);
> >>>>>       while (smp_rv_waiters[0]<  smp_rv_ncpus)
> >>>>>           cpu_spinwait();
> >>
> >> be just replaced with:
> >>
> >> rmb();
> >>
> >> Or a proper MI function that does just a read memory barrier, if rmb() is not that.
> > 
> > No, you could replace it with:
> > 
> >     atomic_add_acq_int(&smp_rv_waiters[0], 1);
> 
> What about
> (void)atomic_load_acq(&smp_rv_waiters[0]);
> 
> In my opinion that should ensure that the hardware must post the latest value from
> a master CPU to memory of smp_rv_waiters[0] and a slave CPU gets it from there.
> And also, because of memory barriers inserted by store_rel on the master CPU and
> load_acq on the slave CPU, the latest values of all other smp_rv_* fields should
> become visible to the slave CPU.

No, it doesn't quite work that way.  It wouldn't work on Alpha for example.

All load_acq is a load with a memory barrier to order other loads after it.
It is still free to load stale data.  Only a read-modify-write operation
would actually block until it could access an up-to-date value.

> 
> > The key being that atomic_add_acq_int() will block (either in hardware or
> > software) until it can safely perform the atomic operation.  That means waiting
> > until the write to set smp_rv_waiters[0] to 0 by the rendezvous initiator is
> > visible to the current CPU.
> > 
> > On some platforms a write by one CPU may not post instantly to other CPUs (e.g. it
> > may sit in a store buffer).  That is fine so long as an attempt to update that
> > value atomically (using cas or a conditional-store, etc.) fails.  For those
> > platforms, the atomic(9) API is required to spin until it succeeds.
> > 
> > This is why the mtx code spins if it can't set MTX_CONTESTED for example.
> > 
> 
> Thank you for the great explanation!
> Taking sparc64 as an example, I think that atomic_load_acq uses a degenerate cas
> call, which should take care of hardware synchronization.

sparc64's load_acq() is stronger than the MI effect of load_acq().  On ia64
which uses ld.acq or Alpha (originally) which uses a membar and simple load,
the guarantees are only what I stated above (and would not be sufficient).

Note that Alpha borrowed heavily from MIPS, and the MIPS atomic implementation
is mostly identical to the old Alpha one (using conditional stores, etc.).

The MIPS atomic_load_acq():

#define ATOMIC_STORE_LOAD(WIDTH)                        \
static __inline  uint##WIDTH##_t                        \
atomic_load_acq_##WIDTH(__volatile uint##WIDTH##_t *p)  \
{                                                       \
        uint##WIDTH##_t v;                              \
                                                        \
        v = *p;                                         \
        mips_sync();                                    \
        return (v);                                     \
}                                                       \


-- 
John Baldwin
Received on Tue May 17 2011 - 11:58:18 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:14 UTC