Re: thread suspension when dumping core

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Mon, 13 Jun 2016 08:03:34 +0300
On Mon, Jun 13, 2016 at 12:24:52AM +0200, Jilles Tjoelker wrote:
> On Thu, Jun 09, 2016 at 07:34:55AM +0300, Konstantin Belousov wrote:
> > On Wed, Jun 08, 2016 at 11:17:44PM +0200, Jilles Tjoelker wrote:
> > > On Wed, Jun 08, 2016 at 04:56:35PM +0300, Konstantin Belousov wrote:
> > > > On Wed, Jun 08, 2016 at 06:35:08AM -0700, Mark Johnston wrote:
> > > > > On Wed, Jun 08, 2016 at 07:30:55AM +0300, Konstantin Belousov wrote:
> > > > > > On Tue, Jun 07, 2016 at 11:19:19PM +0200, Jilles Tjoelker wrote:
> > > > > > > I also wonder whether we may be overengineering things here. Perhaps
> > > > > > > the advlock sleep can simply turn off TDF_SBDRY.
> > > > > > Well, this was the very first patch suggested.  I would be
> > > > > > fine with that, but again, out-of-tree code seems to be not
> > > > > > quite fine with that local solution.
> 
> > > > > In our particular case, we could possibly use a similar approach. In
> > > > > general, it seems incorrect to clear TDF_SBDRY if the thread calling
> > > > > sx_sleep() has any locks held. It is easy to verify that all callers of
> > > > > lf_advlock() are safe in this respect, but this kind of auditing is
> > > > > generally hard. In fact, I believe the sx_sleep that led to the problem
> > > > > described in D2612 is the same as the one in my case. That is, the
> > > > > sleeping thread may or may not hold a vnode lock depending on context.
> 
> > > > I do not think that in-tree code sleeps with a vnode lock held in
> > > > the lf_advlock().  Otherwise, system would hang in lock cascade by
> > > > an attempt to obtain an advisory lock.  I think we can even assert
> > > > this with witness.
> 
> > > > There is another sleep, which Jilles mentioned, in lf_purgelocks(),
> > > > called from vgone(). This sleep indeed occurs under the vnode lock, and
> > > > as such must be non-suspendable. The sleep waits until other threads
> > > > leave the lf_advlock() for the reclaimed vnode, and they should leave in
> > > > deterministic time due to issued wakeups.  So this sleep is exempt from
> > > > the considerations, and TDF_SBDRY there is correct.
> 
> > > > I am fine with either the braces around sx_sleep() in lf_advlock() to
> > > > clear TDF_SBDRY (sigdeferstsop()), or with the latest patch I sent,
> > > > which adds temporal override for TDF_SBDRY with TDF_SRESTART. My
> > > > understanding is that you prefer the later. If I do not mis-represent
> > > > your position, I understand why you do prefer that.
> 
> > > The TDF_SRESTART change does fix some more problems such as umount -f
> > > getting stuck in lf_purgelocks().
> 
> > > However, it introduces some subtle issues that may not necessarily be a
> > > sufficient objection.
> 
> > I did not see an objection in the notes below, but rather I read them
> > as an argument to return EINTR from the stop attempts from lf_advlock().
> 
> With these changes, the question is which bugs you want to fix and which
> bugs you want to leave unfixed or introduce.
> 
> * The status quo sometimes deadlocks or almost deadlocks with thread
>   suspension.
> 
> * Restarting the locking syscall after thread suspension using ERESTART
>   causes the suspended thread to be moved to the end of the queue and
>   the locked area to be re-evaluated which violates POSIX but probably
>   does not break applications.
> 
> * Interrupting the locking syscall after thread suspension with EINTR
>   breaks applications with the reasonable assumption that [EINTR] can
>   only occur because of signals, whenever a locking attempt is
>   suspended. This requires a change to the man page and probably some
>   patches to ports.
> 
> This could be avoided by allowing a thread to return to the user
> boundary (but not beyond) while staying in the queue but that is a
> fairly heavy API change. The thread would not be allowed to execute user
> code (a signal handler) since it may take indefinitely long before the
> thread would continue waiting for the lock, which matches exactly
> POSIX's specification of F_SETLKW not restarting after signal handlers.
What do you mean by this ?  Are you proposing to leave our lockf_entry
in the tree of the lockf graph, while allowing thread to proceed to
user boundary, and making ast handler to re-enter the lf_advlock() ?
At least, I cannot make any other sense of the text.

But IMO that would not solve the problem.  We could declare the sleep
in lf_setlock() as stoppable (by clearing TDF_SBDRY around) and that
gives the same effect without complications.  But I do not think that
this is acceptable, since our lockf_entry is in graph, and could block
arbitrary other processes participating in the adv locking while holding
kernel resources.

Of course, user thread can be suspended on boundary while already fully
owning a lock, and then we get a similar situation, at the outer look.
But it is quite different from the view on internal kernel structures,
which are in consistent state (as opposed to the granted but not taken
lock entry).

With the 'kernel-colored glasses' on, I think that spurious EINTRs are
acceptable.

> 
> > > Firstly, adding this closes the door on fixing signal handling for
> > > fcntl(F_SETLKW). Per POSIX, any caught signal interrupts
> > > fcntl(F_SETLKW), even if SA_RESTART is set for the signal, and the Linux
> > > man page documents the same. Our man page has documented that SA_RESTART
> > > behaves normally with fcntl(F_SETLKW) since at least FreeBSD 2.0. This
> > > could normally be fixed via  if (error == ERESTART) error = EINTR;  but
> > > that is no longer possible if there are [ERESTART] errors that should
> > > still restart.
> 
> > I added the translation to fcntl handler.
> 
> > > Secondly, fcntl(F_SETLKW) restarting after a stop may actually be
> > > observable, contrary to what I wrote before. This is due to the fair
> > > queuing. Suppose thread A has locked byte 1 a while ago and thread B is
> > > trying to lock byte 1 and 2 right now. Then thread C will be able to
> > > lock byte 2 iff thread B has not blocked yet. If thread C will not be
> > > allowed to lock byte 2 and will block on it, the TDF_SRESTART change
> > > will cause it to be awakened if thread B is stopped. When thread B
> > > resumes, the region to be locked will be recomputed. This scenario
> > > unambiguously violates the POSIX requirement but I don't know how bad it
> > > is.
> 
> > Indeed.
> 
> Hmm, changing ERESTART to EINTR doesn't really solve this. In most
> cases, either the application will retry with pretty much the same
> effect as if ERESTART was kept, or the application will consider the
> lock attempt failed and break (for example because it treats all error
> conditions equally or because it treats [EINTR] as a timeout in a naive
> SIGALRM-based timeout scheme).
> 
> > > Note that all these threads must be in separate processes because of
> > > fcntl locks' strange semantics.
> 
> > So below is the next level of over-engineering the stuff. I made it
> > unified on sigdeferstop() to avoid blowing up the KPI. I am not sure what
> > modes are needed by onefs, so I added SIGDEFERSTOP_ERESTART despite it
> > is not used by the kernel.
> 
> > lf_advlock() is not stoppable at all, with EINTR return.  In the previous
> > patches, only if the caller of lf_advlock() already set TDF_SBDRY, the
> > function modified the behaviour.
> 
> > I considered adding td_sbdry member to struct thread for managing sbdry
> > stops, but did not for now.  If one more flag appear to be used, I will
> > change that.
> 
> I suppose flock(2) can be restarted transparently (it is restarted after
> SA_RESTART signal handlers without violating any standards, even though
> temporarily taking an exclusive lock attempt from the queue may allow
> shared lock attempts to proceed), so lf_advlock() can use
> SIGDEFERSTOP_ERESTART which will be translated in fcntl if you want
> that.

I changed the wrapper around sx_xlock() in the lf_setlock() function to
	stops_deferred = sigdeferstop((lock->lf_flags & F_FLOCK) != 0 ?
	    SIGDEFERSTOP_ERESTART : SIGDEFERSTOP_EINTR);
	error = sx_sleep(lock, &state->ls_lock, priority, lockstr, 0);
	sigallowstop(stops_deferred);
which does what you described above for flock(2).  The translation
ERESTART->EINTR for fcntl(2) is left in place.

I looked at the important consumers of the interfaces, namely sendmail.
In mail.local, the code fails on any error from the lockf/flock/fcntl,
i.e. EINTR is not handled.  In the daemon code, EINTR causes simple
restart of the same lock request.

So answering your question in the start of the reply, I _believe_ that
the code that cares about fair grant order of adv lock, should be
similarly careful about EINTR. Broken code which fails on EINTR, cannot
be detected automatically, and really not deserves to be anxious about
for the suspension case.
Received on Mon Jun 13 2016 - 03:03:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:05 UTC