Re: panic: Stray timeout

From: Hans Petter Selasky <hps_at_selasky.org> Date: Mon, 31 Aug 2015 21:10:10 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:59 UTC

Hi Andriy,

On 08/31/15 18:46, Andriy Gapon wrote:
> On 30/08/2015 22:09, Andriy Gapon wrote:
>> On 30/08/2015 19:16, Konstantin Belousov wrote:
>>> This is strange, I do not think that could be a right explanation of this
>>> issue.  The taskqueue callout is initialized with the mutex, which means
>>> that the callout_stop() caller
>>> - must own the mutex;
>>> - is synchronous with the callout.
>>> In other words, callout cannot be running when taskqueue_cancel_timeout()
>>> calls callout_stop(), it only can be dequeued but the callout function
>>> is not yet called.  If callout_stop() is performed meantime, between
>>> dropping callout_cpu lock, and locking the mutex, it must be not run.
>>
>> Thank you for the explanation.  I am not familiar with the code and I
>> misinterpreted the manual page and thought that callout_stop() might be unable
>> to stop the callout even if it was initialized with the mutex.  I see my mistake
>> now.
>
> I had to look at the code, of course.
> Could the following logic actually be buggy?
>
> int
> callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision,
>      void (*ftn)(void *), void *arg, int cpu, int flags)
> {
> ...
>          if (cc_exec_curr(cc, direct) == c) {
>                  /*
>                   * We're being asked to reschedule a callout which is
>                   * currently in progress.  If there is a lock then we
>                   * can cancel the callout if it has not really started.
>                   */
>                  if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
>                          cancelled = cc_exec_cancel(cc, direct) = true;
>
> I have a suspicion that the condition should check for !cc_exec_cancel:
> - it seems that cc_exec_cancel is set to true when the callback execution starts
> - logically, it wouldn't make sense to check if X is true and then set it to true
>
> The code is quite hard to understand in a short time, so I could be
> misinterpreting what's going on, but I'd like to check my suspicion.

While touching this topic, is the following code more readable or not 
(see the callout_restart_async function):

https://svnweb.freebsd.org/base/projects/hps_head/sys/kern/kern_timeout.c?revision=287248&view=markup

>
> If this is a bug indeed, then it might make the following scenario possible:
>
> - softclock() is being executed in thread T1 and the control flow is in
> softclock_call_cc() after CC_UNLOCK() and before lc_lock(c_lock)
> - thread T2 is executing callout_reset_sbt_on on the same callout as T1
> - so, I think, T2 will see cc_exec_cancel == false and thus, because of the
> suspected bug, cc_exec_cancel will remain false
> - T2 will proceed to callout_cc_add() and will happily (re-)schedule the callout
> - after T2 releases the callout lock, T1 will continue in softclock_call_cc()
> and because cc_exec_cancel is false the callout's callback function will get called

Correct.

>
> In effect, there will be an extra call of the callout.
> If that's so that could explain the triggered assertion in taskqueue_timeout_func().
>
> P.S. I am going to add an assertion in callout_cc_add that CALLOUT_ACTIVE is not
> set and I will see if I can get anything interesting from it.
>
> P.P.S. On a further look, it seems that there is a bug indeed and it's probably
> just a typo that got introduced in r278469:
> -               if (c->c_lock != NULL && !cc->cc_exec_entity[direct].cc_cancel)
> -                       cancelled = cc->cc_exec_entity[direct].cc_cancel = true;
> -               if (cc->cc_exec_entity[direct].cc_waiting) {
> +               if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
> +                       cancelled = cc_exec_cancel(cc, direct) = true;
> +               if (cc_exec_waiting(cc, direct)) {
>

Looks like a bug to me trying to make the code more readable removed a 
not in there!

--HPS