Re: proposed smp_rendezvous change

From: John Baldwin <jhb_at_freebsd.org> Date: Mon, 16 May 2011 14:21:27 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:14 UTC

On Monday, May 16, 2011 1:46:33 pm Max Laier wrote:
> I'd like to propose that we move forward with fixing the race, before 
> redesigning the API - please.
> 
> We agree that there is a problem with the exit rendezvous.  Let's fix that.  
> We agree that the generation approach proposed by NetAPP is the right way to 
> go?  Can we check it in, then?  Again, I'd like some comments in the code if 
> at all possible.

How about this:

Index: subr_smp.c
===================================================================
--- subr_smp.c	(revision 221939)
+++ subr_smp.c	(working copy)
_at__at_ -111,6 +111,7 _at__at_ static void (*volatile smp_rv_action_func)(void *a
 static void (*volatile smp_rv_teardown_func)(void *arg);
 static void *volatile smp_rv_func_arg;
 static volatile int smp_rv_waiters[3];
+static volatile int smp_rv_generation;

 /* 
  * Shared mutex to restrict busywaits between smp_rendezvous() and
_at__at_ -311,39 +312,62 _at__at_ restart_cpus(cpumask_t map)
 void
 smp_rendezvous_action(void)
 {
-	void* local_func_arg = smp_rv_func_arg;
-	void (*local_setup_func)(void*)   = smp_rv_setup_func;
-	void (*local_action_func)(void*)   = smp_rv_action_func;
-	void (*local_teardown_func)(void*) = smp_rv_teardown_func;
+	void *local_func_arg;
+	void (*local_setup_func)(void*);
+	void (*local_action_func)(void*);
+	void (*local_teardown_func)(void*);
+	int generation;

 	/* Ensure we have up-to-date values. */
 	atomic_add_acq_int(&smp_rv_waiters[0], 1);
 	while (smp_rv_waiters[0] < smp_rv_ncpus)
 		cpu_spinwait();

-	/* setup function */
+	/* Fetch rendezvous parameters after acquire barrier. */
+	local_func_arg = smp_rv_func_arg;
+	local_setup_func = smp_rv_setup_func;
+	local_action_func = smp_rv_action_func;
+	local_teardown_func = smp_rv_teardown_func;
+	generation = smp_rv_generation;
+
+	/*
+	 * If requested, run a setup function before the main action
+	 * function.  Ensure all CPUs have completed the setup
+	 * function before moving on to the action function.
+	 */
 	if (local_setup_func != smp_no_rendevous_barrier) {
 		if (smp_rv_setup_func != NULL)
 			smp_rv_setup_func(smp_rv_func_arg);
-
-		/* spin on entry rendezvous */
 		atomic_add_int(&smp_rv_waiters[1], 1);
 		while (smp_rv_waiters[1] < smp_rv_ncpus)
                 	cpu_spinwait();
 	}

-	/* action function */
 	if (local_action_func != NULL)
 		local_action_func(local_func_arg);

-	/* spin on exit rendezvous */
+	/*
+	 * Signal that the main action has been completed.  If a
+	 * full exit rendezvous is requested, then all CPUs will
+	 * wait here until all CPUs have finished the main action.
+	 *
+	 * Note that the write by the last CPU to finish the action
+	 * may become visible to different CPUs at different times.
+	 * As a result, the CPU that initiated the rendezvous may
+	 * exit the rendezvous and drop the lock allowing another
+	 * rendezvous to be initiated on the same CPU or a different
+	 * CPU.  In that case the exit sentinel may be cleared before
+	 * all CPUs have noticed causing those CPUs to hang forever.
+	 * Workaround this by using a generation count to notice when
+	 * this race occurs and to exit the rendezvous in that case.
+	 */
 	atomic_add_int(&smp_rv_waiters[2], 1);
 	if (local_teardown_func == smp_no_rendevous_barrier)
                 return;
-	while (smp_rv_waiters[2] < smp_rv_ncpus)
+	while (smp_rv_waiters[2] < smp_rv_ncpus &&
+	    generation == smp_rv_generation)
 		cpu_spinwait();

-	/* teardown function */
 	if (local_teardown_func != NULL)
 		local_teardown_func(local_func_arg);
 }
_at__at_ -374,10 +398,11 _at__at_ smp_rendezvous_cpus(cpumask_t map,
 	if (ncpus == 0)
 		panic("ncpus is 0 with map=0x%x", map);

-	/* obtain rendezvous lock */
 	mtx_lock_spin(&smp_ipi_mtx);

-	/* set static function pointers */
+	atomic_add_acq_int(&smp_rv_generation, 1);
+
+	/* Pass rendezvous parameters via global variables. */
 	smp_rv_ncpus = ncpus;
 	smp_rv_setup_func = setup_func;
 	smp_rv_action_func = action_func;
_at__at_ -387,18 +412,25 _at__at_ smp_rendezvous_cpus(cpumask_t map,
 	smp_rv_waiters[2] = 0;
 	atomic_store_rel_int(&smp_rv_waiters[0], 0);

-	/* signal other processors, which will enter the IPI with interrupts off */
+	/*
+	 * Signal other processors, which will enter the IPI with
+	 * interrupts off.
+	 */
 	ipi_selected(map & ~(1 << curcpu), IPI_RENDEZVOUS);

 	/* Check if the current CPU is in the map */
 	if ((map & (1 << curcpu)) != 0)
 		smp_rendezvous_action();

+	/*
+	 * If the caller did not request an exit barrier to be enforced
+	 * on each CPU, ensure that this CPU waits for all the other
+	 * CPUs to finish the rendezvous.
+	 */
 	if (teardown_func == smp_no_rendevous_barrier)
 		while (atomic_load_acq_int(&smp_rv_waiters[2]) < ncpus)
 			cpu_spinwait();

-	/* release lock */
 	mtx_unlock_spin(&smp_ipi_mtx);
 }

> A second commit should take care of the entry sync - which may or may not be 
> required on some architectures.  If we decide that we don't need it, we should 
> remove it.  Otherwise, we must move all the assignments from global smp_rv_* 
> to the stack in smp_rendezvous_action to after the sync.  Otherwise we should 
> just kill it.  In any case, it would be nice to clean this up.

The patch also fixes this in the cautious fashion (not removing the early
rendezvous).

> After that, I have one more bugfix for rmlock(9) - one of the three consumers 
> of this API (that I am aware of).  As it uses a spinlock inside its IPI 
> action, we have to mark smp_rendezvous_action a critical section otherwise the 
> spin unlock might yield ... with the expected consequences.

Yes, we need to fix that.  Humm, it doesn't preempt when you do a critical_exit()
though?  Or do you use a hand-rolled critical exit that doesn't do a deferred
preemption?

Actually, I'm curious how the spin unlock inside the IPI could yield the
CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I guess that is ok
as long as the critical_exit() just defers the preemption to the end of the IPI
handler.

> It is arguable if you should be allowed to use spinlocks in the rendevous at 
> all, but that's beside the point.

I'm less convinced this is bad since we don't use NMIs for these, so we know the
interrupted thread doesn't have any spin locks in the scheduler, etc.

-- 
John Baldwin