Why don't we check for preemption when we unlend priority?

From: Ryan Stone <rysto32_at_gmail.com> Date: Fri, 25 Jan 2013 15:01:14 -0500 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC

I'm having a problem where userland threads are running with a loaned (via
mutex priority propagation) priority even after they have released the
mutex.  This is causing the low-priority userland thread to starve
high-priority interrupt threads.  One scenario is(which I am seeing on
FreeBSD 8.2, but the priority lending code does not seem to have changed in
HEAD):

1) I have several swi threads that are pulling packets off of interfaces
and forwarding them.  Each swi thread is pinned to a CPU.

2) A remote user initiates an scp of a large file to this machine.

3) sshd (which handles the scp) takes a mutex on its socket

4) An interrupt or taskqueue thread belonging to the driver that is
receiving the scp traffic tries to take the same mutex.  It can't and so it
lends its priority to the sshd thread.

5) My swi thread is woken.  It is pinned to the same CPU as sshd, but it
has a lower priority than the lent priority, so it is placed on the
runqueue.

6) The sshd thread releases the mutex.  sched_unlend_prio is called to set
its priority back to a normal user priority.  However, it does not check
whether any higher-priority threads are waiting in the runqueue.

7) The sshd thread is allowed to run for its full quantum (100ms).  This is
easily long enough for my swi thread to start and I drop packets.

This is similar to a problem fixed by jhb_at_ a while ago, which was discussed
here:
http://lists.freebsd.org/pipermail/freebsd-arch/2010-December/010835.html

However in my case it is a thread at an ithread priority that is started,
not a userland realtime thread.

I don't see why sched_unlend_prio can't check for preemption.  I know that
in jhb's case we didn't want to check for preemption in userret for
performance reasons.  In my case, sched_unlend_prio already holds the
relevant scheduler lock (and hopefully mutex contention is a rare case
anyway), so I don't see that the same performance argument applies.

The following proof-of-concept patch for ULE resolves the starvation for
me.  Any comments?

diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index 01559ee..2659614 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
_at__at_ -1586,6 +1586,22 _at__at_ sched_pctcpu_update(struct td_sched *ts)
        ts->ts_ftick = ts->ts_ltick - SCHED_TICK_TARG;
 }

+static void
+sched_check_preempt(struct tdq *tdq, struct thread *td)
+{
+
+       KASSERT(TD_IS_RUNNING(td), ("thread is not running"));
+       TDQ_LOCK_ASSERT(tdq, MA_OWNED);
+       KASSERT(tdq == TDQ_CPU(td->td_sched->ts_cpu), ("tdq does not td"));
+
+       if (tdq == TDQ_SELF()) {
+               if (td->td_priority > tdq->tdq_lowpri &&
+                   sched_shouldpreempt(tdq->tdq_lowpri, td->td_priority,
0))
+                       td->td_owepreempt = 1;
+       } else
+               tdq_notify(tdq, td);
+}
+
 /*
  * Adjust the priority of a thread.  Move it to the appropriate run-queue
  * if necessary.  This is the back-end for several priority related
_at__at_ -1635,8 +1651,10 _at__at_ sched_thread_priority(struct thread *td, u_char prio)
                td->td_priority = prio;
                if (prio < tdq->tdq_lowpri)
                        tdq->tdq_lowpri = prio;
-               else if (tdq->tdq_lowpri == oldpri)
+               else if (tdq->tdq_lowpri == oldpri) {
                        tdq_setlowpri(tdq, td);
+                       sched_check_preempt(tdq, td);
+               }
                return;
        }
        td->td_priority = prio;