Re: 8.0RC2 amd64 - kernel panic running make buildworld

From: Kai Gallasch <gallasch_at_free.de> Date: Sat, 14 Nov 2009 02:21:21 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:58 UTC

Am Fri, 13 Nov 2009 15:55:42 +0200
schrieb Andriy Gapon <avg_at_icyb.net.ua>:

> on 13/11/2009 15:48 Kai Gallasch said the following:
> > Am Fri, 13 Nov 2009 10:08:45 +0200
> > schrieb Andriy Gapon <avg_at_icyb.net.ua>:
> >> Kai,
> >> I have a hunch, could you please try the following _sledgehammer_
> >> patch (only kernel build/install is needed):
> >> diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
> >> index 44b71f3..a456609 100644
> >> --- a/sys/amd64/amd64/pmap.c
> >> +++ b/sys/amd64/amd64/pmap.c
> >> _at__at_ -2981,6 +2981,7 _at__at_ setpte:
> >>  	 * Map the superpage.
> >>  	 */
> >>  	pde_store(pde, PG_PS | newpde);
> >> +	pmap_invalidate_all(pmap);
> >>
> >>  	pmap_pde_promotions++;
> >>  	CTR2(KTR_PMAP, "pmap_promote_pde: success for va %#lx"
> >>
> >> This will slow down an act of promotion to a superpage, but should
> >> not have any visible impact on overall performance.
> > 
> > Andriy,
> > 
> > I tried the patch with c
> > hw.mca.enabled="1" , rebuilt the kernel (although normally I never
> > build kernels on Friday 13th :-) and ran buildworld -j8 for five
> > times in a row. No sign of a machine check exception, no reboot.
> 
> I think that this is good news.
> This is not a fix, but the fact that it helps should help us find a
> proper solution.

Hi. The patch did help for surviving a makeworld.

But now I have another machine check exception with this server. It
happened with your patch active, and vm.pmap.pg_ps_enabled="1". I
copied data from a remote server by NFS mount to the instable server.
Destination was a local ZFS filesystem.

----------------

sonnenkraft:~ # MCA: CPU 7 UNCOR PCC OVER DTLB L1 error
MCA: Address 0xff800d860000

Fatal trap 28: machine check trap while in kernel mode
cpuid = 7; apic id = 07
instruction pointer	= 0x20:0xffffffff80e5f0b2
stack pointer	        = 0x28:0xffffff8241f8d7d0
frame pointer	        = 0x28:0xffffff8241f8da40
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, IOPL = 0
current process		= 0 (spa_zio_1)
[thread pid 0 tid 100193 ]
Stopped at      lzjb_compress+0x162:    leal    0x1(%rdx),%edi
db> bt
Tracing pid 0 tid 100193 td 0xffffff000732aab0
lzjb_compress() at lzjb_compress+0x162
zio_compress_data() at zio_compress_data+0xbe
zio_write_bp_init() at zio_write_bp_init+0xc2
zio_execute() at zio_execute+0x77
zio_ready() at zio_ready+0x124
zio_execute() at zio_execute+0x77
taskq_run() at taskq_run+0x13
taskqueue_run() at taskqueue_run+0x91
taskqueue_thread_loop() at taskqueue_thread_loop+0x3f
fork_exit() at fork_exit+0x12a
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff8241f8dd30, rbp = 0 ---

----------------

After this I again tried copying to local zfs through nfs - and
again an exception.

When setting vm.pmap.pg_ps_enabled="0" in loader.conf and rebooting the
server survives the nfs copying and stays stable.

--Kai.