Re: thread suspension when dumping core

From: Jilles Tjoelker <jilles_at_stack.nl> Date: Mon, 13 Jun 2016 00:24:52 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:05 UTC

On Thu, Jun 09, 2016 at 07:34:55AM +0300, Konstantin Belousov wrote:
> On Wed, Jun 08, 2016 at 11:17:44PM +0200, Jilles Tjoelker wrote:
> > On Wed, Jun 08, 2016 at 04:56:35PM +0300, Konstantin Belousov wrote:
> > > On Wed, Jun 08, 2016 at 06:35:08AM -0700, Mark Johnston wrote:
> > > > On Wed, Jun 08, 2016 at 07:30:55AM +0300, Konstantin Belousov wrote:
> > > > > On Tue, Jun 07, 2016 at 11:19:19PM +0200, Jilles Tjoelker wrote:
> > > > > > I also wonder whether we may be overengineering things here. Perhaps
> > > > > > the advlock sleep can simply turn off TDF_SBDRY.
> > > > > Well, this was the very first patch suggested.  I would be
> > > > > fine with that, but again, out-of-tree code seems to be not
> > > > > quite fine with that local solution.

> > > > In our particular case, we could possibly use a similar approach. In
> > > > general, it seems incorrect to clear TDF_SBDRY if the thread calling
> > > > sx_sleep() has any locks held. It is easy to verify that all callers of
> > > > lf_advlock() are safe in this respect, but this kind of auditing is
> > > > generally hard. In fact, I believe the sx_sleep that led to the problem
> > > > described in D2612 is the same as the one in my case. That is, the
> > > > sleeping thread may or may not hold a vnode lock depending on context.

> > > I do not think that in-tree code sleeps with a vnode lock held in
> > > the lf_advlock().  Otherwise, system would hang in lock cascade by
> > > an attempt to obtain an advisory lock.  I think we can even assert
> > > this with witness.

> > > There is another sleep, which Jilles mentioned, in lf_purgelocks(),
> > > called from vgone(). This sleep indeed occurs under the vnode lock, and
> > > as such must be non-suspendable. The sleep waits until other threads
> > > leave the lf_advlock() for the reclaimed vnode, and they should leave in
> > > deterministic time due to issued wakeups.  So this sleep is exempt from
> > > the considerations, and TDF_SBDRY there is correct.

> > > I am fine with either the braces around sx_sleep() in lf_advlock() to
> > > clear TDF_SBDRY (sigdeferstsop()), or with the latest patch I sent,
> > > which adds temporal override for TDF_SBDRY with TDF_SRESTART. My
> > > understanding is that you prefer the later. If I do not mis-represent
> > > your position, I understand why you do prefer that.

> > The TDF_SRESTART change does fix some more problems such as umount -f
> > getting stuck in lf_purgelocks().

> > However, it introduces some subtle issues that may not necessarily be a
> > sufficient objection.

> I did not see an objection in the notes below, but rather I read them
> as an argument to return EINTR from the stop attempts from lf_advlock().

With these changes, the question is which bugs you want to fix and which
bugs you want to leave unfixed or introduce.

* The status quo sometimes deadlocks or almost deadlocks with thread
  suspension.

* Restarting the locking syscall after thread suspension using ERESTART
  causes the suspended thread to be moved to the end of the queue and
  the locked area to be re-evaluated which violates POSIX but probably
  does not break applications.

* Interrupting the locking syscall after thread suspension with EINTR
  breaks applications with the reasonable assumption that [EINTR] can
  only occur because of signals, whenever a locking attempt is
  suspended. This requires a change to the man page and probably some
  patches to ports.

This could be avoided by allowing a thread to return to the user
boundary (but not beyond) while staying in the queue but that is a
fairly heavy API change. The thread would not be allowed to execute user
code (a signal handler) since it may take indefinitely long before the
thread would continue waiting for the lock, which matches exactly
POSIX's specification of F_SETLKW not restarting after signal handlers.

> > Firstly, adding this closes the door on fixing signal handling for
> > fcntl(F_SETLKW). Per POSIX, any caught signal interrupts
> > fcntl(F_SETLKW), even if SA_RESTART is set for the signal, and the Linux
> > man page documents the same. Our man page has documented that SA_RESTART
> > behaves normally with fcntl(F_SETLKW) since at least FreeBSD 2.0. This
> > could normally be fixed via  if (error == ERESTART) error = EINTR;  but
> > that is no longer possible if there are [ERESTART] errors that should
> > still restart.

> I added the translation to fcntl handler.

> > Secondly, fcntl(F_SETLKW) restarting after a stop may actually be
> > observable, contrary to what I wrote before. This is due to the fair
> > queuing. Suppose thread A has locked byte 1 a while ago and thread B is
> > trying to lock byte 1 and 2 right now. Then thread C will be able to
> > lock byte 2 iff thread B has not blocked yet. If thread C will not be
> > allowed to lock byte 2 and will block on it, the TDF_SRESTART change
> > will cause it to be awakened if thread B is stopped. When thread B
> > resumes, the region to be locked will be recomputed. This scenario
> > unambiguously violates the POSIX requirement but I don't know how bad it
> > is.

> Indeed.

Hmm, changing ERESTART to EINTR doesn't really solve this. In most
cases, either the application will retry with pretty much the same
effect as if ERESTART was kept, or the application will consider the
lock attempt failed and break (for example because it treats all error
conditions equally or because it treats [EINTR] as a timeout in a naive
SIGALRM-based timeout scheme).

> > Note that all these threads must be in separate processes because of
> > fcntl locks' strange semantics.

> So below is the next level of over-engineering the stuff. I made it
> unified on sigdeferstop() to avoid blowing up the KPI. I am not sure what
> modes are needed by onefs, so I added SIGDEFERSTOP_ERESTART despite it
> is not used by the kernel.

> lf_advlock() is not stoppable at all, with EINTR return.  In the previous
> patches, only if the caller of lf_advlock() already set TDF_SBDRY, the
> function modified the behaviour.

> I considered adding td_sbdry member to struct thread for managing sbdry
> stops, but did not for now.  If one more flag appear to be used, I will
> change that.

I suppose flock(2) can be restarted transparently (it is restarted after
SA_RESTART signal handlers without violating any standards, even though
temporarily taking an exclusive lock attempt from the queue may allow
shared lock attempts to proceed), so lf_advlock() can use
SIGDEFERSTOP_ERESTART which will be translated in fcntl if you want
that.

-- 
Jilles Tjoelker