Re: thread suspension when dumping core

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Thu, 9 Jun 2016 07:34:55 +0300
On Wed, Jun 08, 2016 at 11:17:44PM +0200, Jilles Tjoelker wrote:
> On Wed, Jun 08, 2016 at 04:56:35PM +0300, Konstantin Belousov wrote:
> > On Wed, Jun 08, 2016 at 06:35:08AM -0700, Mark Johnston wrote:
> > > On Wed, Jun 08, 2016 at 07:30:55AM +0300, Konstantin Belousov wrote:
> > > > On Tue, Jun 07, 2016 at 11:19:19PM +0200, Jilles Tjoelker wrote:
> > > > > I also wonder whether we may be overengineering things here. Perhaps
> > > > > the advlock sleep can simply turn off TDF_SBDRY.
> > > > Well, this was the very first patch suggested.  I would be fine with that,
> > > > but again, out-of-tree code seems to be not quite fine with that local
> > > > solution.
> 
> > > In our particular case, we could possibly use a similar approach. In
> > > general, it seems incorrect to clear TDF_SBDRY if the thread calling
> > > sx_sleep() has any locks held. It is easy to verify that all callers of
> > > lf_advlock() are safe in this respect, but this kind of auditing is
> > > generally hard. In fact, I believe the sx_sleep that led to the problem
> > > described in D2612 is the same as the one in my case. That is, the
> > > sleeping thread may or may not hold a vnode lock depending on context.
> 
> > I do not think that in-tree code sleeps with a vnode lock held in
> > the lf_advlock().  Otherwise, system would hang in lock cascade by
> > an attempt to obtain an advisory lock.  I think we can even assert
> > this with witness.
> 
> > There is another sleep, which Jilles mentioned, in lf_purgelocks(),
> > called from vgone(). This sleep indeed occurs under the vnode lock, and
> > as such must be non-suspendable. The sleep waits until other threads
> > leave the lf_advlock() for the reclaimed vnode, and they should leave in
> > deterministic time due to issued wakeups.  So this sleep is exempt from
> > the considerations, and TDF_SBDRY there is correct.
> 
> > I am fine with either the braces around sx_sleep() in lf_advlock() to
> > clear TDF_SBDRY (sigdeferstsop()), or with the latest patch I sent,
> > which adds temporal override for TDF_SBDRY with TDF_SRESTART. My
> > understanding is that you prefer the later. If I do not mis-represent
> > your position, I understand why you do prefer that.
> 
> The TDF_SRESTART change does fix some more problems such as umount -f
> getting stuck in lf_purgelocks().
> 
> However, it introduces some subtle issues that may not necessarily be a
> sufficient objection.
I did not see an objection in the notes below, but rather I read them
as an argument to return EINTR from the stop attempts from lf_advlock().

> 
> Firstly, adding this closes the door on fixing signal handling for
> fcntl(F_SETLKW). Per POSIX, any caught signal interrupts
> fcntl(F_SETLKW), even if SA_RESTART is set for the signal, and the Linux
> man page documents the same. Our man page has documented that SA_RESTART
> behaves normally with fcntl(F_SETLKW) since at least FreeBSD 2.0. This
> could normally be fixed via  if (error == ERESTART) error = EINTR;  but
> that is no longer possible if there are [ERESTART] errors that should
> still restart.
I added the translation to fcntl handler.

> 
> Secondly, fcntl(F_SETLKW) restarting after a stop may actually be
> observable, contrary to what I wrote before. This is due to the fair
> queuing. Suppose thread A has locked byte 1 a while ago and thread B is
> trying to lock byte 1 and 2 right now. Then thread C will be able to
> lock byte 2 iff thread B has not blocked yet. If thread C will not be
> allowed to lock byte 2 and will block on it, the TDF_SRESTART change
> will cause it to be awakened if thread B is stopped. When thread B
> resumes, the region to be locked will be recomputed. This scenario
> unambiguously violates the POSIX requirement but I don't know how bad it
> is.
Indeed.

> 
> Note that all these threads must be in separate processes because of
> fcntl locks' strange semantics.

So below is the next level of over-engineering the stuff. I made it
unified on sigdeferstop() to avoid blowing up the KPI. I am not sure what
modes are needed by onefs, so I added SIGDEFERSTOP_ERESTART despite it
is not used by the kernel.

lf_advlock() is not stoppable at all, with EINTR return.  In the previous
patches, only if the caller of lf_advlock() already set TDF_SBDRY, the
function modified the behaviour.

I considered adding td_sbdry member to struct thread for managing sbdry
stops, but did not for now.  If one more flag appear to be used, I will
change that.

diff --git a/sys/fs/fifofs/fifo_vnops.c b/sys/fs/fifofs/fifo_vnops.c
index 716faa3..fb3eb90 100644
--- a/sys/fs/fifofs/fifo_vnops.c
+++ b/sys/fs/fifofs/fifo_vnops.c
_at__at_ -194,11 +194,10 _at__at_ fifo_open(ap)
 		if ((ap->a_mode & FREAD) && fip->fi_writers == 0) {
 			gen = fip->fi_wgen;
 			VOP_UNLOCK(vp, 0);
-			stops_deferred = sigallowstop();
+			stops_deferred = sigdeferstop(SIGDEFERSTOP_OFF);
 			error = msleep(&fip->fi_readers, PIPE_MTX(fpipe),
 			    PDROP | PCATCH | PSOCK, "fifoor", 0);
-			if (stops_deferred)
-				sigdeferstop();
+			sigallowstop(stops_deferred);
 			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 			if (error != 0 && gen == fip->fi_wgen) {
 				fip->fi_readers--;
_at__at_ -222,11 +221,10 _at__at_ fifo_open(ap)
 		if ((ap->a_mode & FWRITE) && fip->fi_readers == 0) {
 			gen = fip->fi_rgen;
 			VOP_UNLOCK(vp, 0);
-			stops_deferred = sigallowstop();
+			stops_deferred = sigdeferstop(SIGDEFERSTOP_OFF);
 			error = msleep(&fip->fi_writers, PIPE_MTX(fpipe),
 			    PDROP | PCATCH | PSOCK, "fifoow", 0);
-			if (stops_deferred)
-				sigdeferstop();
+			sigallowstop(stops_deferred);
 			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 			if (error != 0 && gen == fip->fi_rgen) {
 				fip->fi_writers--;
diff --git a/sys/kern/kern_descrip.c b/sys/kern/kern_descrip.c
index 7f0ef8d..58a37f9 100644
--- a/sys/kern/kern_descrip.c
+++ b/sys/kern/kern_descrip.c
_at__at_ -648,6 +648,16 _at__at_ kern_fcntl(struct thread *td, int fd, int cmd, intptr_t arg)
 			PROC_UNLOCK(p->p_leader);
 			error = VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_SETLK,
 			    flp, flg);
+			/*
+			 * Do not automatically restart the lock
+			 * acquire, for two reasons.  First, POSIX
+			 * requires signal delivery to return EINTR.
+			 * Second, fairness of the lock acquision,
+			 * ensured by queueing in lf_advlock(), would
+			 * be defeated by the retry.
+			 */
+			if (cmd == F_SETLKW && error == ERESTART)
+				error = EINTR;
 			break;
 		case F_WRLCK:
 			if ((fp->f_flag & FWRITE) == 0) {
_at__at_ -659,6 +669,8 _at__at_ kern_fcntl(struct thread *td, int fd, int cmd, intptr_t arg)
 			PROC_UNLOCK(p->p_leader);
 			error = VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_SETLK,
 			    flp, flg);
+			if (cmd == F_SETLKW && error == ERESTART)
+				error = EINTR;
 			break;
 		case F_UNLCK:
 			error = VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_UNLCK,
diff --git a/sys/kern/kern_lockf.c b/sys/kern/kern_lockf.c
index a0a3789..970a38d 100644
--- a/sys/kern/kern_lockf.c
+++ b/sys/kern/kern_lockf.c
_at__at_ -1378,7 +1378,7 _at__at_ lf_setlock(struct lockf *state, struct lockf_entry *lock, struct vnode *vp,
     void **cookiep)
 {
 	static char lockstr[] = "lockf";
-	int priority, error;
+	int error, priority, stops_deferred;
 
 #ifdef LOCKF_DEBUG
 	if (lockf_debug & 1)
_at__at_ -1466,7 +1466,9 _at__at_ lf_setlock(struct lockf *state, struct lockf_entry *lock, struct vnode *vp,
 		}
 
 		lock->lf_refs++;
+		stops_deferred = sigdeferstop(SIGDEFERSTOP_EINTR);
 		error = sx_sleep(lock, &state->ls_lock, priority, lockstr, 0);
+		sigallowstop(stops_deferred);
 		if (lf_free_lock(lock)) {
 			error = EDOOFUS;
 			goto out;
diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c
index 75a1259..ee5d253 100644
--- a/sys/kern/kern_sig.c
+++ b/sys/kern/kern_sig.c
_at__at_ -2596,41 +2596,81 _at__at_ tdsigcleanup(struct thread *td)
 
 }
 
+static int
+sigdeferstop_curr_flags(int cflags)
+{
+
+	MPASS((cflags & (TDF_SEINTR | TDF_SERESTART)) == 0 ||
+	    (cflags & TDF_SBDRY) != 0);
+	return (cflags & (TDF_SBDRY | TDF_SEINTR | TDF_SERESTART));
+}
+
 /*
- * Defer the delivery of SIGSTOP for the current thread.  Returns true
- * if stops were deferred and false if they were already deferred.
+ * Defer the delivery of SIGSTOP for the current thread, according to
+ * the requested mode.  Returns previous flags, which must be restored
+ * by sigallowstop().
+ *
+ * TDF_SBDRY, TDF_SEINTR, and TDF_SERESTART flags are only set and
+ * cleared by the current thread, which allow the lock-less read-only
+ * accesses below.
  */
 int
-sigdeferstop(void)
+sigdeferstop(int mode)
 {
 	struct thread *td;
+	int cflags, nflags;
 
 	td = curthread;
-	if (td->td_flags & TDF_SBDRY)
-		return (0);
-	thread_lock(td);
-	td->td_flags |= TDF_SBDRY;
-	thread_unlock(td);
-	return (1);
+	cflags = sigdeferstop_curr_flags(td->td_flags);
+	switch (mode) {
+	case SIGDEFERSTOP_NOP:
+		nflags = cflags;
+		break;
+	case SIGDEFERSTOP_OFF:
+		nflags = 0;
+		break;
+	case SIGDEFERSTOP_SILENT:
+		nflags = (cflags | TDF_SBDRY) & ~(TDF_SEINTR | TDF_SERESTART);
+		break;
+	case SIGDEFERSTOP_EINTR:
+		nflags = (cflags | TDF_SBDRY | TDF_SEINTR) & ~TDF_SERESTART;
+		break;
+	case SIGDEFERSTOP_ERESTART:
+		nflags = (cflags | TDF_SBDRY | TDF_SERESTART) & ~TDF_SEINTR;
+		break;
+	default:
+		panic("sigdeferstop: invalid mode %x", mode);
+		break;
+	}
+	if (cflags != nflags) {
+		thread_lock(td);
+		td->td_flags = (td->td_flags & ~cflags) | nflags;
+		thread_unlock(td);
+	}
+	return (cflags);
 }
 
 /*
- * Permit the delivery of SIGSTOP for the current thread.  This does
- * not immediately suspend if a stop was posted.  Instead, the thread
- * will suspend either via ast() or a subsequent interruptible sleep.
+ * Restores the STOP handling mode, typically permitting the delivery
+ * of SIGSTOP for the current thread.  This does not immediately
+ * suspend if a stop was posted.  Instead, the thread will suspend
+ * either via ast() or a subsequent interruptible sleep.
  */
-int
-sigallowstop(void)
+void
+sigallowstop(int prev)
 {
 	struct thread *td;
-	int prev;
+	int cflags;
 
+	KASSERT((prev & ~(TDF_SBDRY | TDF_SEINTR | TDF_SERESTART)) == 0,
+	    ("sigallowstop: incorrect previous mode %x", prev));
 	td = curthread;
-	thread_lock(td);
-	prev = (td->td_flags & TDF_SBDRY) != 0;
-	td->td_flags &= ~TDF_SBDRY;
-	thread_unlock(td);
-	return (prev);
+	cflags = sigdeferstop_curr_flags(td->td_flags);
+	if (cflags != prev) {
+		thread_lock(td);
+		td->td_flags = (td->td_flags & ~cflags) | prev;
+		thread_unlock(td);
+	}
 }
 
 /*
diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c
index 9af377e..c85b153 100644
--- a/sys/kern/kern_thread.c
+++ b/sys/kern/kern_thread.c
_at__at_ -899,7 +899,7 _at__at_ thread_suspend_check(int return_instead)
 {
 	struct thread *td;
 	struct proc *p;
-	int wakeup_swapper;
+	int wakeup_swapper, r;
 
 	td = curthread;
 	p = td->td_proc;
_at__at_ -932,7 +932,21 _at__at_ thread_suspend_check(int return_instead)
 		if ((td->td_flags & TDF_SBDRY) != 0) {
 			KASSERT(return_instead,
 			    ("TDF_SBDRY set for unsafe thread_suspend_check"));
-			return (0);
+			switch (td->td_flags & (TDF_SEINTR | TDF_SERESTART)) {
+			case 0:
+				r = 0;
+				break;
+			case TDF_SEINTR:
+				r = EINTR;
+				break;
+			case TDF_SERESTART:
+				r = ERESTART;
+				break;
+			default:
+				panic("both TDF_SEINTR and TDF_SERESTART");
+				break;
+			}
+			return (r);
 		}
 
 		/*
diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c
index 6d1ac70..eb44087 100644
--- a/sys/kern/subr_trap.c
+++ b/sys/kern/subr_trap.c
_at__at_ -160,7 +160,7 _at__at_ userret(struct thread *td, struct trapframe *frame)
 	    ("userret: Returning with with pinned thread"));
 	KASSERT(td->td_vp_reserv == 0,
 	    ("userret: Returning while holding vnode reservation"));
-	KASSERT((td->td_flags & TDF_SBDRY) == 0,
+	KASSERT((td->td_flags & (TDF_SBDRY | TDF_SEINTR | TDF_SERESTART)) == 0,
 	    ("userret: Returning with stop signals deferred"));
 	KASSERT(td->td_su == NULL,
 	    ("userret: Returning with SU cleanup request not handled"));
diff --git a/sys/sys/mount.h b/sys/sys/mount.h
index f11f8f5..5438140 100644
--- a/sys/sys/mount.h
+++ b/sys/sys/mount.h
_at__at_ -653,15 +653,15 _at__at_ vfs_statfs_t	__vfs_statfs;
 
 #define	VFS_PROLOGUE(MP)	do {					\
 	struct mount *mp__;						\
-	int _enable_stops;						\
+	int _prev_stops;						\
 									\
 	mp__ = (MP);							\
-	_enable_stops = (mp__ != NULL &&				\
-	    (mp__->mnt_vfc->vfc_flags & VFCF_SBDRY) && sigdeferstop())
+	_prev_stops = sigdeferstop((mp__ != NULL &&			\
+	    (mp__->mnt_vfc->vfc_flags & VFCF_SBDRY) != 0) ?		\
+	    SIGDEFERSTOP_SILENT : SIGDEFERSTOP_NOP);
 
 #define	VFS_EPILOGUE(MP)						\
-	if (_enable_stops)						\
-		sigallowstop();						\
+	sigallowstop(_prev_stops);					\
 } while (0)
 
 #define	VFS_MOUNT(MP) ({						\
diff --git a/sys/sys/proc.h b/sys/sys/proc.h
index 629f7e8..1619d34 100644
--- a/sys/sys/proc.h
+++ b/sys/sys/proc.h
_at__at_ -395,9 +395,9 _at__at_ do {									\
 #define	TDF_NEEDRESCHED	0x00010000 /* Thread needs to yield. */
 #define	TDF_NEEDSIGCHK	0x00020000 /* Thread may need signal delivery. */
 #define	TDF_NOLOAD	0x00040000 /* Ignore during load avg calculations. */
-#define	TDF_UNUSED19	0x00080000 /* --available-- */
+#define	TDF_SERESTART	0x00080000 /* ERESTART on stop attempts. */
 #define	TDF_THRWAKEUP	0x00100000 /* Libthr thread must not suspend itself. */
-#define	TDF_UNUSED21	0x00200000 /* --available-- */
+#define	TDF_SEINTR	0x00200000 /* EINTR on stop attempts. */
 #define	TDF_SWAPINREQ	0x00400000 /* Swapin request due to wakeup. */
 #define	TDF_UNUSED23	0x00800000 /* --available-- */
 #define	TDF_SCHED0	0x01000000 /* Reserved for scheduler private use */
diff --git a/sys/sys/signalvar.h b/sys/sys/signalvar.h
index e574ec3..176bb2a 100644
--- a/sys/sys/signalvar.h
+++ b/sys/sys/signalvar.h
_at__at_ -325,9 +325,21 _at__at_ extern struct mtx	sigio_lock;
 #define	SIGPROCMASK_PROC_LOCKED	0x0002
 #define	SIGPROCMASK_PS_LOCKED	0x0004
 
+/*
+ * Modes for sigdeferstop().  Manages behaviour of
+ * thread_suspend_check() in the region delimited by
+ * sigdeferstop()/sigallowstop().  Must be restored to
+ * SIGDEFERSTOP_OFF before returning to userspace.
+ */
+#define	SIGDEFERSTOP_NOP	0 /* continue doing whatever is done now */
+#define	SIGDEFERSTOP_OFF	1 /* stop ignoring STOPs */
+#define	SIGDEFERSTOP_SILENT	2 /* silently ignore STOPs */
+#define	SIGDEFERSTOP_EINTR	3 /* ignore STOPs, return EINTR */
+#define	SIGDEFERSTOP_ERESTART	4 /* ignore STOPs, return ERESTART */
+
 int	cursig(struct thread *td);
-int	sigdeferstop(void);
-int	sigallowstop(void);
+int	sigdeferstop(int mode);
+void	sigallowstop(int prev);
 void	execsigs(struct proc *p);
 void	gsignal(int pgid, int sig, ksiginfo_t *ksi);
 void	killproc(struct proc *p, char *why);
Received on Thu Jun 09 2016 - 02:35:05 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:05 UTC