Re: [patch] i386 pmap sysmaps_pcpu[] atomic access

From: Olivier Houchard <cognet_at_ci0.org>
Date: Tue, 26 Feb 2013 14:07:39 +0100
On Mon, Feb 25, 2013 at 04:26:14PM -0500, John Baldwin wrote:
> On Monday, February 25, 2013 7:47:47 am Olivier Houchard wrote:
> > On Mon, Feb 25, 2013 at 07:09:20PM +1300, Andrew Turner wrote:
> > > On Thu, 21 Feb 2013 10:43:49 -0500
> > > John Baldwin <jhb_at_freebsd.org> wrote:
> > > 
> > > > On Thursday, February 21, 2013 7:53:52 am Svatopluk Kraus wrote:
> > > > > On Wed, Feb 20, 2013 at 4:22 PM, John Baldwin <jhb_at_freebsd.org>
> > > > > wrote:
> > > > > > On Wednesday, February 20, 2013 7:31:08 am Svatopluk Kraus wrote:
> > > > > >> On Tue, Feb 19, 2013 at 7:51 PM, Konstantin Belousov
> > > > > >> <kostikbel_at_gmail.com> wrote:
> > > > > >> > On Mon, Feb 18, 2013 at 11:18:16PM +0100, Svatopluk Kraus
> > > > > >> > wrote:
> > > > > >> >> On Mon, Feb 18, 2013 at 9:36 PM, Konstantin Belousov
> > > > > >> >> <kostikbel_at_gmail.com> wrote:
> > > > > >> >> Well, I'm taking a part on porting FreeBSD to ARM11mpcore. UP
> > > > > >> >> case was simple. SMP case is more complex and rather new for
> > > > > >> >> me. Recently, I was solving a problem with PCPU stuff. For
> > > > > >> >> example, PCPU_GET is implemented by one instruction on i386
> > > > > >> >> arch. So, a need of atomicity with respect to interrupts can
> > > > > >> >> be overlooked. On load-store archs, the implementation which
> > > > > >> >> works in SMP case is not so simple. And what works in UP case
> > > > > >> >> (single PCPU), not works in SMP case. Believe me, mysterious
> > > > > >> >> and sporadic 'mutex not owned' assertions and others ones
> > > > > >> >> caused by curthreads mess, it takes a while ...
> > > > > >> > Note that PCPU_GET() is not needed to be atomic. The reason is
> > > > > >> > that the code which uses its result would not be atomic
> > > > > >> > (single-instruction or whatever, see below). Thus, either the
> > > > > >> > preemption should not matter since the action with the per-cpu
> > > > > >> > data is advisory, as is in the case of i386 pmap you noted. or
> > > > > >> > some external measures should be applied in advance to the
> > > > > >> > containing region (which you proposed, by bracing with
> > > > > >> > sched_pin()).
> > > > > >>
> > > > > >> So, it's advisory in the case of i386 pmap... Well, if you can
> > > > > >> live with that, I can too.
> > > > > >>
> > > > > >> >
> > > > > >> > Also, note that it is not interrupts but preemption which is
> > > > > >> > concern.
> > > > > >>
> > > > > >> Yes and no. In theory, yes, a preemption is a concern. In
> > > > > >> FreeBSD, however, sched_pin() and critical_enter() and their
> > > > > >> counterparts are implemented with help of curthread. And
> > > > > >> curthread definition falls to PCPU_GET(curthread) if not defined
> > > > > >> in other way. So, curthread should be atomic with respect to
> > > > > >> interrupts and in general, PCPU_GET() should be too. Note that
> > > > > >> spinlock_enter() definitions often (always) use curthread too.
> > > > > >> Anyhow, it's defined in MD code and can be defined for each arch
> > > > > >> separately.
> > > > > >
> > > > > > curthread is a bit magic. :)  If you perform a context switch
> > > > > > during an interrupt (which will change 'curthread') you also
> > > > > > change your register state. When you resume, the register state
> > > > > > is also restored.  This means that while something like
> > > > > > 'PCPU_GET(cpuid)' might be stale after you read it, 'curthread'
> > > > > > never is.  However, it is true that actually reading curthread
> > > > > > has to be atomic.  If your read of curthread looks like:
> > > > > >
> > > > > >         mov <pcpu reg>, r0
> > > > > >         add <offset of pc_curthread>, r0
> > > > > >         ld r0, r1
> > > > > >
> > > > > > Then that will indeed break.  Alpha used a fixed register for
> > > > > > 'pcpu_reg' (as does ia64 IIRC).  OTOH, you might also be able to
> > > > > > depend on the fact that pc_curthread is the first thing in
> > > > > > 'struct pcpu' (and always will be, you could add a CTASSERT to
> > > > > > future-proof).  In that case, you can remove the 'add'
> > > > > > instruction and instead just do:
> > > > > >
> > > > > >         ld <pcpu reg>, r1
> > > > > >
> > > > > > which is fine.
> > > > > 
> > > > > Just for the record. There are three extra (coprocessor) registers
> > > > > per a core in arm11mpcore (armv6k). Unfortunately only one is
> > > > > Privileged. The other ones are respectively User Read only and User
> > > > > Read Write. For now, we are using the Privileged one to store pcpu
> > > > > pointer (curthread is correctly set during each context switch).
> > > > > Thus, using a coprocessor register means that reading of curthread
> > > > > is not a single instruction implementation now. After we figured
> > > > > out the curthread issue in SMP case, using a fixed (processor)
> > > > > register for pcpu is an option. Meanwhile, we disable interrupts
> > > > > before reading of curthread and enable them after. The same is done
> > > > > for other PCPU stuff. For now we have not stable system enough for
> > > > > profiling, however, when it will be, it would be interesting to
> > > > > learn how various implementations of curthread reading impact
> > > > > system performance.
> > > > 
> > > > curthread is read _a lot_, so I would expect this to hurt.  What we
> > > > did on Alpha was to use a fixed register for pcpu access, but we used
> > > > the equivalent of a coprocessor register to also store the value so
> > > > we could set that fixed register on entry to the kernel (userland was
> > > > free to use the register for its own purposes).
> > > > 
> > > 
> > > The current code on ARM is not atomic, it loads the base address of the
> > > pcpu data from the coprocessor then loads curthread from this address.
> > > One solution I discussed with Olivier Houchard is to keep the data in
> > > the coprocessor but to then load it into a dedicated register when we
> > > enter the kernel.
> > > 
> > > Reading curthread would then be a single load instruction thanks to
> > > ARM's ldr using an immediate offset [1], e.g. to load from r12 into r1
> > > it would be 'ldr r1, [r12]'.
> > > 
> > 
> > Wouldn't it be doable to either use the extra coprocessor register for
> > curthread, setting it when entering the kernel as John suggested, or just
> > using the privileged one for curthread, and changing get_pcpu() to be
> > __pcpu[cpuid], and then making sure pcpu accesses are atomic, either by
> > disabling interrupts or using ldrex/strex ? I take it curthread (and curpcb,
> > which we can get from curthread) is by far the most used, and it wouldn't hurt
> > that badly to pessimize the other pcpu variables ?
> 
> I think you can store pcpu in the coprocessor register.  If you always copy
> it to a dedicated register on kernel entry (say r12) then 'ldr r1, [r12]' to
> read curthread should be atomic.  (Just put a CTASSERT() that
> offsetof(struct pcpu, pc_curthread) == 0 in arm's machdep.c along with a comment
> that 'curthread' depends on this.  You can then use a custom curthread function
> like we do on amd64, etc. to hardcode it as a single instruction (and mark it
> __pure2 while you are at it).
> 

Yes I was just trying to avoid dedicating a register to pcpu :)
I have no clue if performances-wise it's better to dedicate a register, or
to just optimize curthread/curpcb, and make access to other pcpu vars more
expensive.

Regards,

Olivier
Received on Tue Feb 26 2013 - 12:08:04 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:35 UTC