Re: kevent has bug?

From: John-Mark Gurney <jmg_at_funkthat.com>
Date: Wed, 2 Apr 2014 09:45:43 -0700
Konstantin Belousov wrote this message on Wed, Apr 02, 2014 at 15:07 +0300:
> On Wed, Apr 02, 2014 at 04:06:16PM +0900, Kohji Okuno wrote:
> > From: John-Mark Gurney <jmg_at_funkthat.com>
> > Date: Tue, 1 Apr 2014 23:15:51 -0700
> > > Kohji Okuno wrote this message on Wed, Apr 02, 2014 at 11:45 +0900:
> > >> I think, kevent() has a bug.
> > >> I tested sample programs by attached sources.
> > >> This sample tests about EVFILT_SIGNAL.
> > >> 
> > >> I build sample programs by the following commands.
> > >> % gcc -O2 -o child child.c
> > >> % gcc -O2 -o parent parent.c
> > >> 
> > >> The expected result is the following.
> > >> % ./parent
> > >> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
> > >> OK
> > >> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
> > >> OK
> > >> 
> > >> But, sometimes the result was the following.
> > >> % ./parent
> > >> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
> > >> 
> > >> This result means the number of times the signal has occured was
> > >> incorrect.
> > > 
> > > I was able to reproduce this...
> > > 
> > >> In case of EVFILT_SIGNAL, according to `man kevent', `data' retuns the
> > >> number of times the signal has occurred since the last call to
> > >> kevent(). This `data' is recorded by filt_signal() (This is f_event in
> > >> struct filterops).
> > >> 
> > >> The system call kevent()'s events are processed by kqueue_scan() in
> > >> kern_event.c. In kqueue_scan(), kn->kn_fop->f_event() is allways
> > >> called after KN_INFLUX is set to kn->kn_status.
> > >> 
> > >> On the other hand, kernel events are occured by knote() in
> > >> kern_event.c. (In EVFILT_SIGNAL, knote() is called from tdsendsignal()
> > >> in kern_sig.c.) In knote(), kn->kn_fop->f_event() is called only when
> > >> KN_INFLUX is not set in kn->kn_status.
> > >> 
> > >> In race condition between kqueue_scan() and knote(),
> > >> kn->kn_fop->f_event() from knote() may not be called, I think.

Looks like I misunderstood your point, sorry...

> > > Considering that both are called w/ a lock, that cannot happen..
> > > KN_LIST_LOCK(kn) locks the same lock that is asserted that is held
> > > by knote...
> > > 
> > >> In knote(), because the context holds knlist's lock, the context can
> > >> not sleep. So, KN_INFLUX should not be set on calling
> > >> kn->kn_fop->f_event() in kqueue_scan(), I think.
> > > 
> > > No, it needs to be set:
> > >  * Setting the KN_INFLUX flag enables you to unlock the kq that this knote
> > >  * is on, and modify kn_status as if you had the KQ lock.
> > > 
> > > As this comment says, _INFLUX allows you to unlock the KQ w/o fear that
> > > the knote will disappear out from under you causing you to dereference
> > > possibly free'd memory..
> > > 
> > > If you just tried to lock the list lock w/o unlocking the KQ lock, you
> > > could end up w/ a dead lock, as you aren't maintaining lock order
> > > properly..  The correct lock order if knlist -> kq...
> > > 
> > >> What do you think about this issue?
> > > 
> > > This is a real issue, but not due to the race you described
> > > above...
> > 
> > I beleave it's the result of the race.
> > 
> > Could you try to add printf() in knote()?
> > Please refer to attached patch.
> > 
> > 
> > > I have verified on my machine that it isn't because there is a knote
> > > waiting that isn't getting woken up, and the knote on my hung process
> > > has data == 0, so it definately lost one of the signals:
> > > (kgdb) print $14.kq_knhash[20].slh_first[0]
> > > $20 = {kn_link = {sle_next = 0x0}, kn_selnext = {sle_next = 0x0},
> > >   kn_knlist = 0xfffff8005a9c5840, kn_tqe = {tqe_next = 0xfffff801fdab4500,
> > >     tqe_prev = 0xfffff8004bb10038}, kn_kq = 0xfffff8004bb10000, kn_kevent = {
> > >     ident = 20, filter = -6, flags = 32, fflags = 0, data = 0, udata = 0x0},
> > >   kn_status = 0, kn_sfflags = 0, kn_sdata = 0, kn_ptr = {
> > >     p_fp = 0xfffff8005a9c54b8, p_proc = 0xfffff8005a9c54b8,
> > >     p_aio = 0xfffff8005a9c54b8, p_lio = 0xfffff8005a9c54b8,
> > >     p_v = 0xfffff8005a9c54b8}, kn_fop = 0xffffffff81405ef0, kn_hook = 0x0,
> > >   kn_hookid = 0}
> > > 
> > > If you want to find this yourself, you can run kgdb on a live system,
> > > switch to the thread of the parent (info threads, thread XXX), and
> > > do:
> > > frame 7
> > > 
> > > (or a frame that has td, which is struct thread *), then:
> > > print *(struct kqueue *)td->td_proc[0].p_fd[0].fd_ofiles[3].fde_file[0].f_data
> > > 
> > > This will give you the struct kqueue * of the parent, and then:
> > > print $XX.kq_knhash[0]_at_63
> > > 
> > > to figure out where the knote is in the hash, and then you can print
> > > it out yourself...
> > > 
> > > I'm going to take a look at this a bit more later... I'm thinking of
> > > using dtrace to collect the stacks where filt_signal is called, and
> > > match them up...  dtrace might even be able to get us the note's data
> > > upon return helping to make sure things got tracked properly...
> > > 
> > > Thanks for finding this bug!  Hopefully we can find a solution to it..
> 
> > diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
> > index b3fb23d..7791447 100644
> > --- a/sys/kern/kern_event.c
> > +++ b/sys/kern/kern_event.c
> > _at__at_ -1868,6 +1868,8 _at__at_ knote(struct knlist *list, long hint, int lockflags)
> >  		if ((kn->kn_status & KN_INFLUX) != KN_INFLUX) {
> >  			KQ_LOCK(kq);
> >  			if ((kn->kn_status & KN_INFLUX) == KN_INFLUX) {
> > +				if (hint & NOTE_SIGNAL)
> > +					printf("Aee2\n");
> >  				KQ_UNLOCK(kq);
> >  			} else if ((lockflags & KNF_NOKQLOCK) != 0) {
> >  				kn->kn_status |= KN_INFLUX;
> > _at__at_ -1886,6 +1888,10 _at__at_ knote(struct knlist *list, long hint, int lockflags)
> >  				KQ_UNLOCK(kq);
> >  			}
> >  		}
> > +		else {
> > +			if (hint & NOTE_SIGNAL)
> > +				printf("Aee1\n");
> > +		}
> >  		kq = NULL;
> >  	}
> >  	if ((lockflags & KNF_LISTLOCKED) == 0)
> 
> I think you are right, there is a race between setting influx before the
> knote lock is owned in the kqueue_scan(), and knote() itself.  The issue
> is that the influx from scan must not prevent knote() from waking up
> the waiters.

Well, it's not that its preventing waking up the waiter, but failing to
register the event on the knote because of the _INFLUX flag...

> Patch below fixed your test case for me, also tools/regression/kqueue did
> not noticed a breakage.  I tried to describe the situation in the
> comment in knote().  Also, I removed unlocked check for the KN_INFLUX
> in knote, since it seems to be an optimization for rare case, and is
> the race on its own.

Comments below...

> diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
> index b3fb23d..380f1ff 100644
> --- a/sys/kern/kern_event.c
> +++ b/sys/kern/kern_event.c

[...]

> _at__at_ -1506,7 +1506,7 _at__at_ retry:
>  			KQ_LOCK(kq);
>  			kn = NULL;
>  		} else {
> -			kn->kn_status |= KN_INFLUX;
> +			kn->kn_status |= KN_INFLUX | KN_SCAN;
>  			KQ_UNLOCK(kq);
>  			if ((kn->kn_status & KN_KQUEUE) == KN_KQUEUE)
>  				KQ_GLOBAL_LOCK(&kq_global, haskqglobal);

Is there a reason you don't add the KN_SCAN to the other cases in
kqueue_scan that set the _INFLUX flag?

[...]

> _at__at_ -1865,28 +1866,33 _at__at_ knote(struct knlist *list, long hint, int lockflags)
>  	 */
>  	SLIST_FOREACH(kn, &list->kl_list, kn_selnext) {
>  		kq = kn->kn_kq;
> -		if ((kn->kn_status & KN_INFLUX) != KN_INFLUX) {
> +		KQ_LOCK(kq);
> +		if ((kn->kn_status & (KN_INFLUX | KN_SCAN)) == KN_INFLUX) {
> +			/*
> +			 * Do not process the influx notes, except for
> +			 * the influx coming from the kq unlock in the
> +			 * kqueue_scan().  In the later case, we do
> +			 * not interfere with the scan, since the code
> +			 * fragment in kqueue_scan() locks the knlist,
> +			 * and cannot proceed until we finished.
> +			 */

We might want to add a marker node, and reprocess the list from the
marker node, because this might introduce other races in the code too...
but the problem with that is that knote is expected to keep the list
locked throughout the call if called w/ it already locked, and so we
can't do that, without major work... :(

I added a similar comment in knote_fork:
                 * XXX - Why do we skip the kn if it is _INFLUX?  Does this
                 * mean we will not properly wake up some notes?

and it looks like it was true...

So, upon reading the other _INFLUX cases, it looks like we should change
_SCAN to be, _CHANGING or something similar, and any place we don't end
up dropping the knote, we set this flag also...  Once such case is at
the end of kqueue_register, just before the label done_ev_add, where we
update the knote w/ new udata and other fields..  Or change the logic
of the flag, and set it for all the cases we are about to drop the
knote..

> +			KQ_UNLOCK(kq);
> +		} else if ((lockflags & KNF_NOKQLOCK) != 0) {
> +			kn->kn_status |= KN_INFLUX;
> +			KQ_UNLOCK(kq);
> +			error = kn->kn_fop->f_event(kn, hint);
>  			KQ_LOCK(kq);

I believe we can drop this unlock/lock pair as it's safe to hold the
KQ lock over f_event, we do that in knote_fork...

> -			if ((kn->kn_status & KN_INFLUX) == KN_INFLUX) {
> -				KQ_UNLOCK(kq);
> -			} else if ((lockflags & KNF_NOKQLOCK) != 0) {
> -				kn->kn_status |= KN_INFLUX;
> -				KQ_UNLOCK(kq);
> -				error = kn->kn_fop->f_event(kn, hint);
> -				KQ_LOCK(kq);
> -				kn->kn_status &= ~KN_INFLUX;
> -				if (error)
> -					KNOTE_ACTIVATE(kn, 1);
> -				KQ_UNLOCK_FLUX(kq);
> -			} else {
> -				kn->kn_status |= KN_HASKQLOCK;
> -				if (kn->kn_fop->f_event(kn, hint))
> -					KNOTE_ACTIVATE(kn, 1);
> -				kn->kn_status &= ~KN_HASKQLOCK;
> -				KQ_UNLOCK(kq);
> -			}
> +			kn->kn_status &= ~KN_INFLUX;
> +			if (error)
> +				KNOTE_ACTIVATE(kn, 1);
> +			KQ_UNLOCK_FLUX(kq);
> +		} else {
> +			kn->kn_status |= KN_HASKQLOCK;
> +			if (kn->kn_fop->f_event(kn, hint))
> +				KNOTE_ACTIVATE(kn, 1);
> +			kn->kn_status &= ~KN_HASKQLOCK;
> +			KQ_UNLOCK(kq);
>  		}
> -		kq = NULL;
>  	}
>  	if ((lockflags & KNF_LISTLOCKED) == 0)
>  		list->kl_unlock(list->kl_lockarg); 
> diff --git a/sys/sys/event.h b/sys/sys/event.h
> index bad8c9e..3b765c0 100644
> --- a/sys/sys/event.h
> +++ b/sys/sys/event.h
> _at__at_ -207,6 +207,7 _at__at_ struct knote {
>  #define KN_MARKER	0x20			/* ignore this knote */
>  #define KN_KQUEUE	0x40			/* this knote belongs to a kq */
>  #define KN_HASKQLOCK	0x80			/* for _inevent */
> +#define	KN_SCAN		0x100			/* flux set in kqueue_scan() */
>  	int			kn_sfflags;	/* saved filter flags */
>  	intptr_t		kn_sdata;	/* saved data field */
>  	union {

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."
Received on Wed Apr 02 2014 - 14:45:45 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:48 UTC