Re: stray irq13 at runtime

From: Bruce Evans <bde_at_zeta.org.au> Date: Mon, 31 May 2004 00:31:29 +1000 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:55 UTC

On Sun, 30 May 2004, Bruce Evans wrote:

> On Sat, 29 May 2004, Kris Kennaway wrote:
>
> > Since updating the i386 package machines the other day, they've all
> > experienced the following:
> >
> > May 29 21:24:53 <user.err> gohan28 kernel: stray irq13
> >
> > irq13: npx0                            2          0
> > stray irq13                            1          0
> >
> > This is not appearing during boot - those machines have been up for
> > hours before the interrupt occurs.

> ...
> I haven't figured out why the APIC case normally delivers both a normal
> (fast) interrupt and stray interrupt when we don't wait for the one
> interrupt that actually occurs.  One is counted as stray because it
> occurs after the bus_teardown_intr(), but both of them seem to occur
> after that.  So there seems to be a race or double counting somewhere.

I have now figured this out.  There is double counting.  Interrupts
are supposed to be counted per-device (more precisely, per group of
devices sharing an interrupt at a given time), with interrupts that
have no handler in effect being counted as for the special "stray"
device and counts being maintained until reboot for all previous
combinations of devices.  This has been broken.  Interrupts are now
counted per-vector and reported as being for the last group of devices
using the interrupt (so history is lost if the combination is changed),
and then if their are no devices already using the interrupt, interrupts
are counted again as "stray".  In this case and some others, the stray
interrupts really did come from the last group of devices causing the
interrupt, but they shouldn't be counted twice.

I can duplicate your counts of 2 and 1 and explain them as follows:
- configure without "device apic" so that the other bug suite doesn't
  complicate things.  This gives initial counts of 1 for npx0 and and
  stray irq13.
- run any program that causes an unmasked NPX exception.  This also
  causes an unmasked irq13 (because the recent optimization for edge
  triggering leaves irq13 enabled even when its handler has been torn
  down).  The irq13 is double-counted as for npx0 and stray irq13.xi
  Further unmasked NPX exceptions don't cause further irq13 because
  the first one was not properly handled.  The npx0 busy latch remains
  set, so further irq13's are masked by that although not by the PIC.

Further irq13s for unmasked NPX exceptions don't happen for the APIC
case, although one wants to happen according to the PIC's IRR.

Summary:
- this bug really was harmless
- statistics for interrupt handling are more broken than I thought.

Bruce