Re: panic: arithmetic trap in fpurstor() in sys/i386/isa/npx.c

From: Bruce Evans <bde_at_zeta.org.au>
Date: Thu, 8 Jul 2004 18:25:51 +1000 (EST)
On Thu, 1 Jul 2004, Eric van Gyzen wrote:

> Bruce et al.:
>
> I apologize for reviving this old problem.  It became irrelevant to me for a
> few months, but now it's relevant again.
>
> Backing out rev 1.216 of vm_machdep.c fixed the problem.  I can no longer
> panic these machines.
>
> Would you still like to see the value and contents of union savefpu *addr?

I think this is the same problem that was reported in PR 68058 and
fixed in revs 1.150 and 1.151 of npx.c, so there is no need for further
investigation.  Backing out rev. 1.216 of vm_machdep.c would have
helped by removing one trigger for the problem (one at exit time), but
the problem can also be triggered by signal handling so the fixes in npx.c
should be applied to 5.2R to fix it completely (or you can get a simpler
1 line fix from the PR followup).

> Bruce Evans wrote:
> > On Thu, 19 Feb 2004, Eric van Gyzen wrote:
> > > I can reliably panic 5.2-RELEASE GENERIC running on three different AMD
> > > Athlon CPUs with:
> > >
> > >   # echo 'q()' | R --no-save
> > >
> > > R is ports/math/R-letter, and q() just tells R to quit.  This does not
> > > happen on an AthlonMP or P3 running the same kernel.  It did not happen

The known problem affects systems without SSE (or with SSE disabled).
AthlonXP and up and P3 have SSE.  You Athlons with the problem are
presumably older.

> > > on the same three Athlon machines while running 5.1-RELEASE.  Some simple
> > > gdb debugging follows.  If you need more info, please ask; I don't debug
> > > the kernel very often, so I'm not sure what to provide.  :-/

5.1R has the known problem, but not the trigger for it at exit time.

> > Try backing out rev.1.216 of vm_machdep.c.  I don't see exactly how this
> > commit could cause the problem, but it is the only related thing that has
> > changed since 5.1, and the first part of it has several bugs (it is a
> > layering violation and is missing explicit disabling of interrupts).

It triggered the problem by not accidentally initializing the npx state at
exit time (this was its main point -- the initialization is mainly just a
pessimization), so state with a pending error in it was passed to the next
process to use the CPU; then since initialization for the new process had
become too lazy, the the error in the old state bit the next process.

> > > panic: arithmetic trap
> > > ...
> > > (kgdb) list *0xc07e07b4
> > > 0xc07e07b4 is in fpurstor (/usr/src/sys/i386/isa/npx.c:986).
> > > [snip]
> > >
> > > (kgdb) list 976,987
> > > 976     static void
> > > 977     fpurstor(addr)
> > > 978             union savefpu *addr;
> > > 979     {
> > > 980
> > > 981     #ifdef CPU_ENABLE_SSE
> > > 982             if (cpu_fxsr)
> > > 983                     fxrstor(addr);
> > > 984             else
> > > 985     #endif
> > > 986                     frstor(addr);
> > > 987     }

The simplest fix is to add an fnclex() to the else clause here (non-SSE
non-fxsr case only).  Since you found that it was the frstor() and not
the fxrstor that trapped, it is clear that you saw the problem on old
Athlons (or CPU_ENABLE_SSE is not configured).

> > frstror() can only cause an arithmetic trap on broken CPUs.  I doubt
> > that Athlons are that broken, so this trap is mysterious.  frstor()
> > doesn't even trap for plain i386's; it may cause a bogus IRQ13 which
> > the kernel has to be careful not to turn into an arithmetic trap.

Actually, frstor() is broken as designed and can trap on all CPUs and
NPX's that have it.  fxrstor() works right, so there is no problem in
the cpu_fxsr case.

Bruce
Received on Thu Jul 08 2004 - 06:25:56 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:00 UTC