Re: sio vs sched_lock LOR (was: Re: kern/68442: panic - acquiring duplicate lock of same type: "sleepq chain")

From: Bruce Evans <bde_at_zeta.org.au> Date: Thu, 1 Jul 2004 13:54:42 +1000 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:59 UTC

On Tue, 29 Jun 2004, Colin Percival wrote:

> At 08:39 29/06/2004, Daniel Lang wrote:
> >login: lock order reversal
> > 1st 0xc070a0c0 sched lock (sched lock) _at_ /usr/src/sys/kern/kern_proc.c:672
> > 2nd 0xc0745d40 sio (sio) _at_ /usr/src/sys/dev/sio/sio.c:3205
> >Stack backtrace:
> >backtrace(ffffffff,c0712948,c0712b00,c06e25c4,c07358dc) at 0xc051e066 = backtra2
> >witness_checkorder(c0745d40,9,c06b1e34,c85,3f8) at 0xc05393c4 = witness_checkor4
> >_mtx_lock_spin_flags(c0745d40,0,c06b1e34,c85,c0712498) at 0xc0516e9e = _mtx_loce
> >siocnputc(c06f9c40,63) at 0xc064375f = siocnputc+0x6b
> >cnputc(63) at 0xc05483ed = cnputc+0x4d
> >putchar(63,e5bd87e0) at 0xc053387a = putchar+0x92
> >kvprintf(c069d236,c05337e8,e5bd87e0,a,e5bd8800) at 0xc0533a87 = kvprintf+0x77
> >printf(c069d236,32cf1,0,32ce6,0,dd8,c9115828) at 0xc0533763 = printf+0x43
> >calcru(c91156e0,e5bd8ad0,e5bd8ad8,0,e5bd8a10) at 0xc051cb9e = calcru+0x1f2
> >[...]
>
> Regardless of the questionable validity of the other problems this machine
> is encountering,

The bug in calcru() was supposed to be fixed.  Anyway, printf() must work
from anywhere (including nested in itself).

> this LOR is real (and ugly).  calcru() asserts that it is
> holding sched_lock, but it can printf() about negative runtime or runtime
> going backwards.  With a serial console attached, these messages end up at
> siocnputc(), which needs sio_lock.
>
> Now, I presume that there is some reason why the locking order hard-coded
> into witness makes it illegal to pick up sio_lock while holding sched_lock,
> but I can't see it.  Maybe someone else can explain?

This is mianly just a bug in the witness order.  Console output must work
at any time, including when the "any" lock is held, so any locks used by
console output routines must be legal to pick up while holding the "any"
lock.

I had 2 different fixes for this and one related one queued:

(1) Witnessing of sio_lock can be broken by turning it off.  Note that
non-quietness of witnessing of sio lock is already turned off.

%%%
Index: sio.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/sio/sio.c,v
retrieving revision 1.442
diff -u -2 -r1.442 sio.c
--- sio.c	25 Jun 2004 10:51:33 -0000	1.442
+++ sio.c	26 Jun 2004 23:11:13 -0000
_at__at_ -526,7 +644,7 _at__at_
 	while (sio_inited != 2)
 		if (atomic_cmpset_int(&sio_inited, 0, 1)) {
+			/* XXX might need MTX_QUIET instead of MTX_NOWITNESS.*/
 			mtx_init(&sio_lock, sio_driver_name, NULL,
-			    (comconsole != -1) ?
-			    MTX_SPIN | MTX_QUIET : MTX_SPIN);
+				 MTX_SPIN | MTX_NOWITNESS);
 			atomic_store_rel_int(&sio_inited, 2);
 		}
%%%

The patch also fixes some style bugs (non-non-KNF continuation indents).
It uses MTX_NOWITNESS unconditionally to avoid related problems when
the port is a debugger port.

(2) My version doesn't actually use sio_lock.  It disables interrupts for
the !SMP case and also uses home made spin locking based on atomic_cmpset()
in the SMP case.  Neither of these is witnessed, so the problem goes away
in the same way as for MTX_NOWITNESS.

(3) It is an error to acquire a lock while ddb is active, but sio console
output acquires sio_lock (if it is not already held) for ddb output.  The
lock should only be acquired if !db_active.

I'm surprised that this LOR isn't more common.  Console output is done
with sched_lock any time code that holds sched_lock is debugged.  I
normally wouldn't see the problem since I don't use WITNESS much.

Bruce