Re: panic: non-current pmap 0xffffa00020eab8f0 on Rpi3

From: Alan Cox <alan.l.cox_at_gmail.com> Date: Sun, 25 Oct 2020 22:24:30 -0500 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:25 UTC

On Sat, Oct 24, 2020 at 2:38 PM Mark Johnston <markj_at_freebsd.org> wrote:

> On Fri, Oct 23, 2020 at 06:32:25PM +0200, Michal Meloun wrote:
> >
> >
> > On 19.10.2020 22:39, Mark Johnston wrote:
> > > On Fri, Oct 16, 2020 at 11:53:56AM +0200, Michal Meloun wrote:
> > >>
> > >>
> > >> On 06.10.2020 15:37, Mark Johnston wrote:
> > >>> On Mon, Oct 05, 2020 at 07:10:29PM -0700, bob prohaska wrote:
> > >>>> Still seeing non-current pmap panics on the Pi3, this time a B+
> running
> > >>>> 13.0-CURRENT (GENERIC-MMCCAM) #0 71e02448ffb-c271826(master)
> > >>>> during a -j4 buildworld.  The backtrace reports
> > >>>>
> > >>>> panic: non-current pmap 0xffffa00020eab8f0
> > >>>
> > >>> Could you show the output of "show procvm" from the debugger?
> > >>
> > >> I see same panic too, in my case its very rare - typical scenario is
> > >> rebuild of kf5 ports (~250, 2 days of full load).  Any idea how to
> debug
> > >> this?
> > >> Michal
> > >
> > > I suspect that there is some race involving the pmap switching in
> > > vmspace_exit(), but I can't see it.  In the example below, presumably
> > > process 22604 on CPU 0 is also exiting?  Could you show the backtrace?>
> > > It would also be useful to see the value of PCPU_GET(curpmap) at the
> > > time of the panic.  I'm not sure if there's a way to get that from DDB,
> > > but I suspect it should be equal to &vmspace0->vm_pmap.
> > Mark,
> > I think that I found problem.
> > The PCPU_GET() is not (and is not supposed to be) an atomic operation,
> > it expects that thread is at least pinned.
> > This is not true for pmap_remove_pages() - so I think that the KASSERT
> > is racy and shoud be removed (or at least covered by
> > sched_pin()/sched_unpin() pair).
> > What do you think?
>
> I think you're right.  On amd64 curpmap is loaded using a single
> instruction so the assertion happens to work properly.  On arm64 we
> have:
>
>    0xffff0000007ff138 <+32>:      mov     x8, x18
>    0xffff0000007ff13c <+36>:      ldr     x8, [x8, #216]
>    0xffff0000007ff140 <+40>:      mov     x26, x0
>    0xffff0000007ff144 <+44>:      cmp     x8, x0
>
> Though, it looks like arm64's PCPU_GET could be modified to combine the
> first two instructions.
>
> To fix it, we could perhaps change the KASSERT to verify that pmap ==
> vmspace_pmap(curthread->td_proc->p_vmspace). ...
>

Just delete it.  It isn't useful.

...  The various
> implementations of pmap_remove_pages() have different flavours of the
> same check and it would be nice to unify them.  Using sched_pin() would
> also be fine I think.
>

The useful version exists on amd64, where we verify that the pmap is only
active on the processor performing pmap_remove_pages().  The reason being
that some implementations of pmap_remove_pages(), including amd64's and
arm64's, don't not use atomic RMW operations to simultaneously clear a PTE
and check the status of the dirty bit.

> > I think vmspace_exit() should issue a release fence with the cmpset and
> > > an acquire fence when handling the refcnt == 1 case,
> > Yep, true, fully agree.
>
> Alan pointed out in the review that pmap_remove_pages() acquires the
> pmap lock, which I missed, so I don't think the extra barriers are
> necessary after all.
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>