Re: svn commit: r313268 - head/sys/kern [through -r313271 for atomic_fcmpset use and later: fails on PowerMac G5 "Quad Core"; -r313266 works]

From: Mateusz Guzik <mjguzik_at_gmail.com> Date: Sat, 18 Feb 2017 23:59:40 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC

On Sat, Feb 18, 2017 at 01:58:49PM -0800, Mark Millard wrote:
> On 2017-Feb-18, at 12:58 PM, Mateusz Guzik <mjguzik at gmail.com> wrote:
> > Well either the primitive itself is buggy or the somewhat (now) unusual
> > condition of not providing the failed value (but possibly a stale one)
> > is not handled correctly in locking code.
> > 
> > That said, I would start with putting barriers "on both sides" of
> > powerpc's fcmpset for debugging purposes and if the problem persists I
> > can add some debugs to locking priitmives.
> > 
> 
> I currently have the only powerpc64 that I have access
> to for now doing a test that will likely finish tonight
> sometime (if it has no problems).
> 
> Also I'm not so familiar with powerpc64 details as to be
> able insert proper barriers and the like off the top of
> my head: It is more of a research subject for me.
> 

This was a suggestion to jhibbits_at_. Looking at the code it is not hard
to slap them in for testing purposes, or maybe there is an "obvious now
that I look at it" braino in there, or maybe he has a better idea.

Now that I wrote it I can get myself access to powerpc boxes. While I
wont be able to run bsd on them, I can hack around in userapce and see.

That's unless jhibbits_at_ steps in. I have no clue about ppc.

> It looks like contexts like __rw_wlock_hard(c,v,tid,file,line)
> now needs the caller to do an equivalent of:
> 
> __rw_wlock_hard(c,RW_READ_VALUE(rwlock2rw(c)),file,line)
> 
> in order for the code behavior to match the old behavior
> that was based on the original local-v's initialization
> before v was used:
> 
> rw = rwlock2rw(c);
> v = RW_READ_VALUE(rw); /* this line no longer exists */
> 
> This means that checking for equivalence is no longer
> local to the routine but involves checking all the
> usage of the routine.
> 

Not reading the argument locally was the entire point of introducing
fcmpset. Otherwise the 'v' argument would be a waste of time.

Some primitives can attempt grabbing the lock and if they fail, we have
the lock value to work with (e.g. check who owns the lock and see if
they are running). In particular amd64 will give us the value it found.
An explicit read requires whoever owns the cachelilne to lose the
exclusive ownership and if the lock is contended (multiple cpus doing
fcmpset), this makes the cachelilne ping-pong between cores. This
destroys performance especially on systems with many cores and
especially so with multiple numa nodes.

Other primitives don't have inline variants. This concerns read-write
locks which try to:
retry:
r = lock_value(lock);
if (!locked(r)) {
	if (!cmpset(lock, r, r + ONE_READER))
		goto retry;
}

That is, if multiple cpus try to get the lock for reading, one will fail
and willl lbe forced to compute the new value to be set. The longer the
time between attempts the more likely it is other core showed up trying
to do the same thing with the same value, causing another failed
attempt.

So here there are no inlilnes so that the time is shorter and fcmpset
alllows NOT reading the lock value explicitely - it was already provided
by hardware.

Note this is still significantly slower than it has to be in principle -
the lock can 'blilndly increment by ONE_READER and see what happens',
but that requires several changes and is a subject for another e-mail.
I'm working on it though.

-- 
Mateusz Guzik <mjguzik gmail.com>