Re: crash in sb-concurrency tests after r216641 on x86-64/freebsd9/sb-thread

From: Jilles Tjoelker <jilles_at_stack.nl>
Date: Sat, 19 Nov 2011 17:59:08 +0100
On Wed, Oct 12, 2011 at 12:00:07AM +0000, Nali Toja wrote:
> After r216641 sbcl built with sb-thread dies on mailbox tests. It also
> dies when I try to complete a symbol in slime. The workaround seems to
> be to revert libthr to r216640.

>   http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/154050
>   http://svn.freebsd.org/changeset/base/216641
>   http://www.freshports.org/lang/sbcl # or see ports/161444 for sbcl-1.0.52

> Any clue whether it's a FreeBSD bug or a SBCL bug? I've Bcc'd sbcl-bugs_at_
> in case it's the latter one.

[snip]
>   Fatal error 'thread was already on queue.' at line 222 in file /usr/src/lib/libthr/thread/thr_cond.c (errno = 0)
[snip]
>   (gdb) bt
>   #0  0x0000000800c5c7ec in _umtx_op_err () at /usr/src/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:37
>   #1  0x0000000800c5423e in _thr_umtx_timedwait_uint (mtx=0x8006d4ea8, id=0, clockid=0, abstime=0x0, shared=0) at /usr/src/lib/libthr/thread/thr_umtx.c:214
>   #2  0x0000000800c5c04b in _thr_sleep (curthread=0x828d5d400, clockid=0, abstime=0x0) at /usr/src/lib/libthr/thread/thr_kern.c:212
>   #3  0x0000000800c5f5dd in cond_wait_user (cvp=0x828fdf850, mp=0x828fe0970, abstime=0x0, cancel=1) at /usr/src/lib/libthr/thread/thr_cond.c:243
>   #4  0x0000000800c5f856 in cond_wait_common (cond=0x8480f0008, mutex=0x8480f0000, abstime=0x0, cancel=1) at /usr/src/lib/libthr/thread/thr_cond.c:299
>   #5  0x0000000800c5f8b7 in __pthread_cond_wait (cond=0x8480f0008, mutex=0x8480f0000) at /usr/src/lib/libthr/thread/thr_cond.c:313
>   #6  0x00000008009e9fa0 in pthread_cond_wait_exp (p0=0x8480f0008, p1=0x8480f0000) at /usr/src/lib/libc/gen/_pthread_stubs.c:217
>   #7  0x0000000000413574 in wait_for_thread_state_change (thread=0x8480f0010, state=16) at thread.h:53
>   #8  0x00000000004133a8 in sig_stop_for_gc_handler (signal=31, info=0x847eef630, context=0x847eef2c0) at interrupt.c:1265
>   #9  0x000000000041427d in low_level_handle_now_handler (signal=31, info=0x847eef630, void_context=0x847eef2c0) at interrupt.c:1729
>   #10 0x00007ffffffff023 in ?? ()
>   #11 0x0000000000414220 in low_level_unblock_me_trampoline () at interrupt.c:1723
>   #12 0x000000100154c990 in ?? ()
>   #13 0x000000000063eaa0 in interrupt_handlers ()
>   #14 0x0000000200411d4f in ?? ()
>   #15 0x0000001003375721 in ?? ()
>   #16 0x38b485040000001a in ?? ()
>   #17 0x00000000000a81f0 in ?? ()
>   #18 0x0000000000000000 in ?? ()
>   #19 0x0000000847eef840 in ?? ()
>   #20 0x0000001003af2a2f in ?? ()
>   #21 0x0000002004e9c3e1 in ?? ()
>   #22 0x0000000800c58570 in _sigprocmask (how=Could not find the frame base for "_sigprocmask".
>   ) at /usr/src/lib/libthr/thread/thr_sig.c:584
>   Previous frame inner to this frame (corrupt stack?)
>   (gdb) f 7
>   #7  0x0000000000413574 in wait_for_thread_state_change (thread=0x8480f0010, state=16) at thread.h:53
>   53              pthread_cond_wait(thread->state_cond, thread->state_lock);
>   (gdb) l
>   48      static inline void
>   49      wait_for_thread_state_change(struct thread *thread, lispobj state)
>   50      {
>   51          pthread_mutex_lock(thread->state_lock);
>   52          while (thread->state == state)
>   53              pthread_cond_wait(thread->state_cond, thread->state_lock);
>   54          pthread_mutex_unlock(thread->state_lock);
>   55      }
>   56
>   57      extern pthread_key_t lisp_thread;

The cause of the trouble appears to be that pthread_cond_wait() is
interrupted by a signal handler and the signal handler calls
pthread_cond_wait() again (no matter whether it is on the same or a
different condition variable). POSIX forbids this because (like most of
the pthread functions) pthread_cond_wait() is not async-signal-safe.

While the pre-r216641 code is not async-signal-safe either, it would
usually work fine. With the r216641 code, the second call to
pthread_cond_wait() aborts immediately with the 'thread was already on
queue' message.

The immediate issue could be fixed in libthr fairly easily by enabling
its code to postpone signal handlers also during pthread_cond_wait() (a
signal will still interrupt the wait). However, this does not fix issues
due to signal handlers interrupting other pthread functions which may
still cause erratic undefined behaviour. Therefore, it may not be
desirable to do this.

An alternative is to use pthread_suspend_np(). This function will wait
for the thread to stop before returning, although it may stop almost
anywhere. I have not tried this but calling it on a thread in
pthread_cond_wait() should be safe.

Ideally, it would not be necessary to stop all other threads while
collecting garbage, but this may be hard to fix.

-- 
Jilles Tjoelker
Received on Sat Nov 19 2011 - 15:59:12 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:20 UTC