Re: proposed smp_rendezvous change

From: John Baldwin <jhb_at_FreeBSD.org> Date: Sat, 14 May 2011 11:25:36 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:14 UTC

On 5/13/11 9:43 AM, Andriy Gapon wrote:
>
> This is a change in vein of what I've been doing in the xcpu branch and it's
> supposed to fix the issue by the recent commit that (probably unintentionally)
> stress-tests smp_rendezvous in TSC code.
>
> Non-essential changes:
> - ditch initial, and in my opinion useless, pre-setup rendezvous in
> smp_rendezvous_action()

As long as IPIs ensure all data is up to date (I think this is certainly 
true on x86) that is fine.  Presumably sending an IPI has an implicit 
store barrier on all other platforms as well?

> Essential changes (the fix):
> - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully
> done with rendezvous (i.e. it's not going to access any members of smp_rv_*
> pseudo-structure)
> - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use
> the smp_rv_* pseudo-structure too early

Hmmm, so this is not actually sufficient.  NetApp ran into a very 
similar race with virtual CPUs in BHyVe.  In their case because virtual 
CPUs are threads that can be preempted, they have a chance at a longer race.

The problem that they see is that even though the values have been 
updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] 
to zero before one of the other CPUs notices that it has finished.

A more concrete example:

Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator.

All three CPUs run the rendezvous down to the point of increasing 
smp_rv_waiters[2].  CPU 1 is the last one to rendezvous for some reason 
so he notices smp_rv_waiters[2] being correct first and exits the 
rendezvous.  It immediately does a new rendezvous.  Even with your 
change, it will see that smp_rv_waiters[2] is correct and will proceed 
to clear it before starting the next rendezvous.

Meanwhile one of the other CPUs never sees the final update by CPU 1 to 
smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU 
1's two writes were coalesced, or in the case of the hypervisor, CPU 2 
or 3's backing thread is preempted and both writes have posted before 
the thread backing CPU 2 or 3 gets to run again).

At this point that CPU (call it CPU 2) will spin forever.  When CPU 1 
sends a second rendezvous IPI it will be held in the local APIC of CPU 2 
because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump 
smp_rv_waiters[2] and the entire system will deadlock.

NetApp's solution is to add a monotonically increasing generation count 
to the rendezvous data set, which is cached in the rendezvous handler 
and to exit the last synchronization point if either smp_rv_waiters[2] 
is high enough, or the generation count has changed.

I think this would also handle the case the TSC changes have provoked. 
I'm not sure if this would be sufficient for the error case Max Laier 
has encountered.

NetApp's patch:

            Extra protection for consecutive smp_rendezvous() calls.

            We need this because interrupts are not really disabled when
            executing the smp_rendezvous_action() when running inside a
            virtual machine.

            In particular it is possible for the last cpu to increment
            smp_rv_waiters[2] so that the exit rendezvous condition is
     satisfied and then get interrupted by a hardware interrupt.

            When the execution of the interrupted vcpu continues it is
            possible that one of the other vcpus that did *not* get
     interrupted exited the old smp_rendezvous() and started a
            new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is
            again reset to 0. This would mean that the vcpu that got
            interrupted will spin forever on the exit rendezvous.

            We protect this by spinning on the exit rendezvous only if
            the generation of the smp_rendezvous() matches what we
            started out with before incrementing 'smp_rv_waiters[2]'.

Differences ...

==== //private/xxxx/sys/kern/subr_smp.c#3 (text) ====

_at__at_ -127,6 +127,7 _at__at_
  static void (*volatile smp_rv_teardown_func)(void *arg);
  static void * volatile smp_rv_func_arg;
  static volatile int smp_rv_waiters[3];
+static volatile int smp_rv_generation;

  /*
   * Shared mutex to restrict busywaits between smp_rendezvous() and
_at__at_ -418,6 +419,7 _at__at_
  void
  smp_rendezvous_action(void)
  {
+    int generation;
      void* local_func_arg = smp_rv_func_arg;
      void (*local_setup_func)(void*)   = smp_rv_setup_func;
      void (*local_action_func)(void*)   = smp_rv_action_func;
_at__at_ -457,11 +459,14 _at__at_
      if (local_action_func != NULL)
          local_action_func(local_func_arg);

+    generation = atomic_load_acq_int(&smp_rv_generation);
+
      /* spin on exit rendezvous */
      atomic_add_int(&smp_rv_waiters[2], 1);
      if (local_teardown_func == smp_no_rendevous_barrier)
                  return;
-    while (smp_rv_waiters[2] < smp_rv_ncpus)
+    while (smp_rv_waiters[2] < smp_rv_ncpus &&
+           generation == smp_rv_generation)
          cpu_spinwait();

      /*
_at__at_ -565,6 +570,8 _at__at_
      /* obtain rendezvous lock */
      mtx_lock_spin(&smp_ipi_mtx);

+    atomic_add_acq_int(&smp_rv_generation, 1);
+
      /* set static function pointers */
      smp_rv_ncpus = ncpus;
      smp_rv_setup_func = setup_func;

-- 
John Baldwin