Re: 8.0RC2 amd64 - kernel panic running make buildworld

From: Kai Gallasch <gallasch_at_free.de>
Date: Wed, 4 Nov 2009 01:17:16 +0100
Am Tue, 03 Nov 2009 10:42:40 +0000
schrieb Gavin Atkinson <gavin_at_FreeBSD.org>:

> On Sat, 2009-10-31 at 23:15 +0100, Kai Gallasch wrote:
> > Hi.
> > 
> > I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
> > 
> > When I try to do a make buildworld or make buildkernel the server
> > reboots without any message left in the logs. The same happens
> > when building bigger ports (for example ruby18 or perl58)

> First place I think I'd start id by running memtest86 on the machine
> overnight.  This sounds like possible hardware issue to me, it would
> be good to see if we can confirm that that is the case.

I will do so tomorrow. Following actions I have already taken to rule
out a hardware problem:

- ran several passes with diagnostic software from the manufacturer
- reset BIOS settings to default
- upgraded BIOS to newest release
- booted server from 2 year old backup BIOS
- took out the only pair of RAM modules that was different from the
  rest of the modules
- installed freebsd 7.2-STABLE on the server to repeat the kernel
  panic (no panic with 7.2)
- installed 8.0-BETA4 (crash)

Besides: The server was in production with 7.2 for some time, without
showing any such problems.

> > Now my idea was to install the old 8.0-BETA4 and upgrade to RC2
> > through makeworld + buildkernel (gdb+witness). But no luck. When
> > trying to upgrade to RC2 the 8.0-BETA4 also crashes. At least
> > 8.0-BETA4 has debug
> > + witness active in the GENERIC kernel..
> > 
> > So below some debug output of 8.0-BETA4 crashing. Has a vfs/ffs LOR
> > problem with the BETA4 already been fixed?
> 
> The debug output you included were just lock order reversals, and
> don't seem to be related to your crash.

Sorry for causing possible confusion about this. I realized this after
my mail was already out.

> I think 8.0-BETA4 still had the debugger compiled in (you can test by
> pressing ctrl-alt-escape ion the console, if you do drop to the
> debugger, give the "c" command to continue).
> 
> If the debugger is compiled in, then the spontaneous reboot without
> dropping to the debugger suggests even more that it may be hardware
> related.  If you do get to the debugger, a copy of all of the messages
> on screen and the output of the "bt" command would be very useful.
> When you do your kernel recompile, please include full debugging,
> including WITNESS, INVARIANTS, KDB, DDB etc.

In the meantime I managed it to install a RELENG_8 world +  GENERIC
kernel with all debug options enabled on the crashing server. (mounted
/usr/src and /usr/obj on another server running 8.0RC1 through NFS and
did buildworld + buildkernel over there..)

So now I have a debug kernel available with dumpev + dumpdir defined.

Here are my latest findings on this issue:

- Running a makeworld in about 80% leads to a server crash without
  the server writing a crashdump to dumpdir. The server just reboots..
- In about 20% of the cases makeworld gets stuck in a not terminating
  process that eats up 100% cpu. This process cannot be killed. When
  restarting makeworld the server then reboots again
- It makes no difference doing makeworld -j1 or -j8, result is the same
 
> It depends what the bug is to be honest.  So far there isn't really
> enough information to determine the cause, and therefore there isn't
> really enough info for a PR.

Mark Atkinson also commented on my mail and he gave the
hint: "If vm.pmap.pg_ps_enabled is 1 in 8.0-rc2, you might try
rebooting with c in /boot/loader.conf and try
another buildworld."

So I thought why not and just tried it - and surprise:

Disabling vm.pmap.pg_ps_enabled=1 in loader.conf resolves my problem
with 8.0RC2 crashing when doing a makeworld..

After successful buildworld and buildkernel I rebooted the server
again with commented out vm.pmap.pg_ps_enabled=0 and the problem
was there again. And then I disabled the option again in loader.conf,
rebooted + make buildworld .. no problem.

Seems to be deterministic. With vm.pmap.pg_ps_enabled=1 the server
crashes without being able to write crashdumps to dumpdev. (at least on
this specific Proliant DL385G2 server)

--Kai.


-- 
You need more time; and you probably always will.
Received on Tue Nov 03 2009 - 23:17:36 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:57 UTC