ARM pcpu changes, was Re: [patch] i386 pmap sysmaps_pcpu[] atomic access

From: Andrew Turner <andrew_at_fubar.geek.nz> Date: Tue, 26 Feb 2013 21:06:51 +1300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:35 UTC

On Mon, 25 Feb 2013 13:47:47 +0100
Olivier Houchard <cognet_at_ci0.org> wrote:

> On Mon, Feb 25, 2013 at 07:09:20PM +1300, Andrew Turner wrote:
> > On Thu, 21 Feb 2013 10:43:49 -0500
> > John Baldwin <jhb_at_freebsd.org> wrote:
> > 
> > > On Thursday, February 21, 2013 7:53:52 am Svatopluk Kraus wrote:
> > > > On Wed, Feb 20, 2013 at 4:22 PM, John Baldwin <jhb_at_freebsd.org>
> > > > wrote:
> > > > > On Wednesday, February 20, 2013 7:31:08 am Svatopluk Kraus
> > > > > wrote:
> > > > >> On Tue, Feb 19, 2013 at 7:51 PM, Konstantin Belousov
> > > > >> <kostikbel_at_gmail.com> wrote:
> > > > >> > On Mon, Feb 18, 2013 at 11:18:16PM +0100, Svatopluk Kraus
> > > > >> > wrote:
> > > > >> >> On Mon, Feb 18, 2013 at 9:36 PM, Konstantin Belousov
> > > > >> >> <kostikbel_at_gmail.com> wrote:
> > > > >> >> Well, I'm taking a part on porting FreeBSD to
> > > > >> >> ARM11mpcore. UP case was simple. SMP case is more complex
> > > > >> >> and rather new for me. Recently, I was solving a problem
> > > > >> >> with PCPU stuff. For example, PCPU_GET is implemented by
> > > > >> >> one instruction on i386 arch. So, a need of atomicity
> > > > >> >> with respect to interrupts can be overlooked. On
> > > > >> >> load-store archs, the implementation which works in SMP
> > > > >> >> case is not so simple. And what works in UP case (single
> > > > >> >> PCPU), not works in SMP case. Believe me, mysterious and
> > > > >> >> sporadic 'mutex not owned' assertions and others ones
> > > > >> >> caused by curthreads mess, it takes a while ...
> > > > >> > Note that PCPU_GET() is not needed to be atomic. The
> > > > >> > reason is that the code which uses its result would not be
> > > > >> > atomic (single-instruction or whatever, see below). Thus,
> > > > >> > either the preemption should not matter since the action
> > > > >> > with the per-cpu data is advisory, as is in the case of
> > > > >> > i386 pmap you noted. or some external measures should be
> > > > >> > applied in advance to the containing region (which you
> > > > >> > proposed, by bracing with sched_pin()).
> > > > >>
> > > > >> So, it's advisory in the case of i386 pmap... Well, if you
> > > > >> can live with that, I can too.
> > > > >>
> > > > >> >
> > > > >> > Also, note that it is not interrupts but preemption which
> > > > >> > is concern.
> > > > >>
> > > > >> Yes and no. In theory, yes, a preemption is a concern. In
> > > > >> FreeBSD, however, sched_pin() and critical_enter() and their
> > > > >> counterparts are implemented with help of curthread. And
> > > > >> curthread definition falls to PCPU_GET(curthread) if not
> > > > >> defined in other way. So, curthread should be atomic with
> > > > >> respect to interrupts and in general, PCPU_GET() should be
> > > > >> too. Note that spinlock_enter() definitions often (always)
> > > > >> use curthread too. Anyhow, it's defined in MD code and can
> > > > >> be defined for each arch separately.
> > > > >
> > > > > curthread is a bit magic. :)  If you perform a context switch
> > > > > during an interrupt (which will change 'curthread') you also
> > > > > change your register state. When you resume, the register
> > > > > state is also restored.  This means that while something like
> > > > > 'PCPU_GET(cpuid)' might be stale after you read it,
> > > > > 'curthread' never is.  However, it is true that actually
> > > > > reading curthread has to be atomic.  If your read of
> > > > > curthread looks like:
> > > > >
> > > > >         mov <pcpu reg>, r0
> > > > >         add <offset of pc_curthread>, r0
> > > > >         ld r0, r1
> > > > >
> > > > > Then that will indeed break.  Alpha used a fixed register for
> > > > > 'pcpu_reg' (as does ia64 IIRC).  OTOH, you might also be able
> > > > > to depend on the fact that pc_curthread is the first thing in
> > > > > 'struct pcpu' (and always will be, you could add a CTASSERT to
> > > > > future-proof).  In that case, you can remove the 'add'
> > > > > instruction and instead just do:
> > > > >
> > > > >         ld <pcpu reg>, r1
> > > > >
> > > > > which is fine.
> > > > 
> > > > Just for the record. There are three extra (coprocessor)
> > > > registers per a core in arm11mpcore (armv6k). Unfortunately
> > > > only one is Privileged. The other ones are respectively User
> > > > Read only and User Read Write. For now, we are using the
> > > > Privileged one to store pcpu pointer (curthread is correctly
> > > > set during each context switch). Thus, using a coprocessor
> > > > register means that reading of curthread is not a single
> > > > instruction implementation now. After we figured out the
> > > > curthread issue in SMP case, using a fixed (processor) register
> > > > for pcpu is an option. Meanwhile, we disable interrupts before
> > > > reading of curthread and enable them after. The same is done
> > > > for other PCPU stuff. For now we have not stable system enough
> > > > for profiling, however, when it will be, it would be
> > > > interesting to learn how various implementations of curthread
> > > > reading impact system performance.
> > > 
> > > curthread is read _a lot_, so I would expect this to hurt.  What
> > > we did on Alpha was to use a fixed register for pcpu access, but
> > > we used the equivalent of a coprocessor register to also store
> > > the value so we could set that fixed register on entry to the
> > > kernel (userland was free to use the register for its own
> > > purposes).
> > > 
> > 
> > The current code on ARM is not atomic, it loads the base address of
> > the pcpu data from the coprocessor then loads curthread from this
> > address. One solution I discussed with Olivier Houchard is to keep
> > the data in the coprocessor but to then load it into a dedicated
> > register when we enter the kernel.
> > 
> > Reading curthread would then be a single load instruction thanks to
> > ARM's ldr using an immediate offset [1], e.g. to load from r12 into
> > r1 it would be 'ldr r1, [r12]'.
> > 
> 
> Wouldn't it be doable to either use the extra coprocessor register for
> curthread, setting it when entering the kernel as John suggested, or
> just using the privileged one for curthread, and changing get_pcpu()
> to be __pcpu[cpuid], and then making sure pcpu accesses are atomic,
> either by disabling interrupts or using ldrex/strex ? I take it
> curthread (and curpcb, which we can get from curthread) is by far the
> most used, and it wouldn't hurt that badly to pessimize the other
> pcpu variables ?

Does anyone know if it is only curthread that needs to be atomic? If so
this should work. Reading the cpuid from the system currently is a
single instruction, however it appears the code will need to be reworked
for use with multiple levels of affinity, for example when there
are clusters of cores the current code will number two cores on
different clusters with the same id.

I don't think we need to use ldrex/strex to read/write the values, my
understanding is the rest should be safe to access without atomic
functions.

Andrew