Re: panic: Stray timeout

From: Andriy Gapon <avg_at_FreeBSD.org> Date: Mon, 31 Aug 2015 19:46:43 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:59 UTC

On 30/08/2015 22:09, Andriy Gapon wrote:
> On 30/08/2015 19:16, Konstantin Belousov wrote:
>> This is strange, I do not think that could be a right explanation of this
>> issue.  The taskqueue callout is initialized with the mutex, which means
>> that the callout_stop() caller
>> - must own the mutex;
>> - is synchronous with the callout.
>> In other words, callout cannot be running when taskqueue_cancel_timeout()
>> calls callout_stop(), it only can be dequeued but the callout function
>> is not yet called.  If callout_stop() is performed meantime, between
>> dropping callout_cpu lock, and locking the mutex, it must be not run.
> 
> Thank you for the explanation.  I am not familiar with the code and I
> misinterpreted the manual page and thought that callout_stop() might be unable
> to stop the callout even if it was initialized with the mutex.  I see my mistake
> now.

I had to look at the code, of course.
Could the following logic actually be buggy?

int
callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision,
    void (*ftn)(void *), void *arg, int cpu, int flags)
{
...
        if (cc_exec_curr(cc, direct) == c) {
                /*
                 * We're being asked to reschedule a callout which is
                 * currently in progress.  If there is a lock then we
                 * can cancel the callout if it has not really started.
                 */
                if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
                        cancelled = cc_exec_cancel(cc, direct) = true;

I have a suspicion that the condition should check for !cc_exec_cancel:
- it seems that cc_exec_cancel is set to true when the callback execution starts
- logically, it wouldn't make sense to check if X is true and then set it to true

The code is quite hard to understand in a short time, so I could be
misinterpreting what's going on, but I'd like to check my suspicion.

If this is a bug indeed, then it might make the following scenario possible:

- softclock() is being executed in thread T1 and the control flow is in
softclock_call_cc() after CC_UNLOCK() and before lc_lock(c_lock)
- thread T2 is executing callout_reset_sbt_on on the same callout as T1
- so, I think, T2 will see cc_exec_cancel == false and thus, because of the
suspected bug, cc_exec_cancel will remain false
- T2 will proceed to callout_cc_add() and will happily (re-)schedule the callout
- after T2 releases the callout lock, T1 will continue in softclock_call_cc()
and because cc_exec_cancel is false the callout's callback function will get called

In effect, there will be an extra call of the callout.
If that's so that could explain the triggered assertion in taskqueue_timeout_func().

P.S. I am going to add an assertion in callout_cc_add that CALLOUT_ACTIVE is not
set and I will see if I can get anything interesting from it.

P.P.S. On a further look, it seems that there is a bug indeed and it's probably
just a typo that got introduced in r278469:
-               if (c->c_lock != NULL && !cc->cc_exec_entity[direct].cc_cancel)
-                       cancelled = cc->cc_exec_entity[direct].cc_cancel = true;
-               if (cc->cc_exec_entity[direct].cc_waiting) {
+               if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
+                       cancelled = cc_exec_cancel(cc, direct) = true;
+               if (cc_exec_waiting(cc, direct)) {

-- 
Andriy Gapon