Re: smp_rendezvous runs with interrupts and preemption enabled on unicore systems

From: John Baldwin <jhb_at_freebsd.org>
Date: Wed, 2 Nov 2011 11:45:02 -0400
On Monday, October 31, 2011 7:43:03 pm Attilio Rao wrote:
> 2011/10/28  <mdf_at_freebsd.org>:
> > On Fri, Oct 28, 2011 at 8:37 AM, Ryan Stone <rysto32_at_gmail.com> wrote:
> >> I'm seeing issues on a unicore systems running a derivative of FreeBSD
> >> 8.2-RELEASE if something calls mem_range_attr_set.  It turns out that
> >> the root cause is a bug in smp_rendezvous_cpus.  The first part of
> >> smp_rendezvous_cpus attempts to short-circuit the non-SMP case(note
> >> that smp_started is never set to 1 on a unicore system):
> >>
> >>        if (!smp_started) {
> >>                if (setup_func != NULL)
> >>                        setup_func(arg);
> >>                if (action_func != NULL)
> >>                        action_func(arg);
> >>                if (teardown_func != NULL)
> >>                        teardown_func(arg);
> >>                return;
> >>        }
> >>
> >> The problem is that this runs with interrupts enabled, outside of a
> >> critical section.  My system runs with device_polling enabled with hz
> >> set to 2500, so its quite easy to wedge the system by having a thread
> >> run mem_range_attr_set.  That has to do a smp_rendezvous, and if a
> >> timer interrupt happens to go off half-way through the action_func and
> >> preempt this thread, the system ends up deadlocked(although once it's
> >> wedged, typing at the serial console stands a good chance of unwedging
> >> the system.  Go figure).
> 
> I'm not entirely sure why this exactly breaks though (do you see that
> happening with a random rendezvous callback or it is always the
> same?), because that just becames a simple function calling on cpu0,
> even if I think that there is still a bug as smp_rendezvous callbacks
> may expect to have interrupts and preemption disabled and the
> short-circuit breaks that assumption.
> 
> >> I know that smp_rendezvous was reworked substantially on HEAD, but by
> >> inspection it looks like the bug is still present, as the
> >> short-circuit behaviour is still there.
> >>
> >> I am not entirely sure of the best way to fix this.  Is it as simple
> >> as doing a spinlock_enter before setup_func and a spinlock_exit after
> >> teardown_func?  It seems to boot fine, but I'm not at all confident
> >> that I understand the nuances of smp_rendezvous to be sure that there
> >> aren't side effects that I don't know about.
> >
> > Looks like Max didn't have time to upstream this fix:
> >
> >  struct mtx smp_ipi_mtx;
> > +MTX_SYSINIT(smp_ipi_mtx, &smp_ipi_mtx, "smp rendezvous", MTX_SPIN);
> >
> > ...
> >
> >  static void
> >  mp_start(void *dummy)
> >  {
> >
> > -       mtx_init(&smp_ipi_mtx, "smp rendezvous", NULL, MTX_SPIN);
> >
> > ...
> >
> >        if (!smp_started) {
> > +               mtx_lock_spin(&smp_ipi_mtx);
> >                if (setup_func != NULL)
> >                        setup_func(arg);
> >                if (action_func != NULL)
> >                        action_func(arg);
> >                if (teardown_func != NULL)
> >                        teardown_func(arg);
> > +               mtx_unlock_spin(&smp_ipi_mtx);
> >                return;
> >        }
> 
> I don't think that we strictly need the lock here, my preferred
> solution would be (only test-compiled):
> Index: sys/kern/subr_smp.c
> ===================================================================
> --- sys/kern/subr_smp.c (revision 226972)
> +++ sys/kern/subr_smp.c (working copy)
> _at__at_ -415,13 +415,16 _at__at_ smp_rendezvous_cpus(cpuset_t map,
>  {
>         int curcpumap, i, ncpus = 0;
> 
> +       /* Look comments in the !SMP case. */
>         if (!smp_started) {
> +               spinlock_enter();
>                 if (setup_func != NULL)
>                         setup_func(arg);
>                 if (action_func != NULL)
>                         action_func(arg);
>                 if (teardown_func != NULL)
>                         teardown_func(arg);
> +               spinlock_exit();
>                 return;
>         }
> 
> _at__at_ -666,12 +669,18 _at__at_ smp_rendezvous_cpus(cpuset_t map,
>         void (*teardown_func)(void *),
>         void *arg)
>  {
> +       /*
> +        * In the !SMP case we just need to ensure the same initial 
conditions
> +        * as the SMP case.
> +        */
> +       spinlock_enter();
>         if (setup_func != NULL)
>                 setup_func(arg);
>         if (action_func != NULL)
>                 action_func(arg);
>         if (teardown_func != NULL)
>                 teardown_func(arg);
> +       spinlock_exit();
>  }
> 
>  void
> _at__at_ -681,12 +690,15 _at__at_ smp_rendezvous(void (*setup_func)(void *),
>                void *arg)
>  {
> 
> +       /* Look comments in the smp_rendezvous_cpus() case. */
> +       spinlock_enter();
>         if (setup_func != NULL)
>                 setup_func(arg);
>         if (action_func != NULL)
>                 action_func(arg);
>         if (teardown_func != NULL)
>                 teardown_func(arg);
> +       spinlock_exit();
>  }

I think this is fine, and probably correct.

-- 
John Baldwin
Received on Wed Nov 02 2011 - 17:05:02 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:19 UTC