[RFC] kern/kern_timeout.c rewrite in progress

From: Hans Petter Selasky <hps_at_selasky.org> Date: Mon, 29 Dec 2014 21:03:24 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:54 UTC

Hi,

I recently came across a class of errors which lead me into 
investigating the "kern/kern_timeout.c" and its subsystem. From what I 
can see new features like the SMP awareness has been "added" instead of 
fully "integrated". When going into the cornercases I've uncovered that 
the internal callout statemachine can sometimes report wrong values via 
its callout_active() and callout_pending() bits to its clients, which in 
turn can make the clients behave badly. I further did an investigation 
on how the safety of callout migration between CPU's is maintained. When 
I looked into the code and found stuff like "volatile" and "while()" 
loops to figure which CPU a callout belongs I understood that such logic 
completely undermines the cleverness found in the turnstiles of mutexes 
and decided to go through all of the logic inside "kern_timeout.c". Also 
static code analysis is harder when we don't use the basic mutexes and 
condition variables available in the kernel.

First of all we need to make some driving rules for everyone:

1) A new feature called direct callbacks which execute the timer 
callbacks from the fast interrupt handler was added. All these callbacks 
_must_ be associated with a regular spinlocks, to maintain a safe 
callout_drain(). Else they should only be executed on CPU0.

2) All Giant locked callbacks should only execute on CPU0 to avoid 
congestion.

3) Callbacks using read-only locks for its callback should also only 
execute on CPU0 to avoid multiple instances pending for completion on 
multiple CPU's, because read-only locks can be entered multiple times. 
 From what I can see, there are currently no consumers of this feature 
in the kernel.

Secondly, we need a way to drain callbacks asynchronously, for example 
like in the TCP stack. Currently the TCP shutdown sequence for a 
connection looks like this:

...
void
pcb_teardown(pcb)
{
    lock(pcb);
    callout_stop(&pcb->c1);
    callout_stop(&pcb->c2);
    callout_stop(&pcb->c3);
    unlock(pcb);

    free(pcb);
}

I guess some of you see what is wrong: Use after free.

...

With a new function I've added, scenarios like this can be solved more 
elegantly and without having to fear use after free situations!

static void
pcb_callback(void *pcb)
{
   lock(pcb);
   do_free = (--(pcb->refcount) == 0);
   unlock(pcb);
   if (do_free == 0)
	return;
   destroy_mtx(pcb);
   free(pcb);
}

void
pcb_teardown(pcb)
{
    lock(pcb);
    pcb->refcount = 4;
    if (callout_drain_async(&pcb->c1, pcb_callback, pcb) == 0)
	pcb->refcount--;
    if (callout_drain_async(&pcb->c2, pcb_callback, pcb) == 0)
	pcb->refcount--;
    if (callout_drain_async(&pcb->c3, pcb_callback, pcb) == 0)
	pcb->refcount--;
    unlock(pcb);

    pcb_callback(pcb);
}

Please find attached a patch for 11-current as of today. Comments and 
feedback is appreciated.

BTW: Happy New Year everyone!

--HPS