Re: ALPHA4 panic in VM

From: Mark Johnston <markj_at_freebsd.org>
Date: Wed, 19 Sep 2018 17:20:34 -0400
On Wed, Sep 19, 2018 at 02:11:56PM -0700, Steve Kargl wrote:
> On Wed, Sep 19, 2018 at 05:02:11PM -0400, Mark Johnston wrote:
> > On Wed, Sep 19, 2018 at 01:01:52PM -0700, Steve Kargl wrote:
> > > I have the kernel and core file if more information is needed.
> > > 
> > > % cat info.2
> > > Dump header from device: /dev/ada0p3
> >    Architecture: amd64
> > >   Architecture Version: 2
> > >   Dump Length: 2348281856
> > >   Blocksize: 512
> > >   Compression: none
> > >   Dumptime: Wed Sep 19 12:29:59 2018
> > >   Hostname: troutmask.apl.washington.edu
> > >   Magic: FreeBSD Kernel Dump
> > >   Version String: FreeBSD 12.0-ALPHA4 #0 r338505: Thu Sep  6 13:45:34 PDT 2018
> > >     kargl_at_troutmask.apl.washington.edu:/usr/obj/usr/src/amd64.amd64/sys/SPEW
> > >   Panic String: page fault
> > >   Dump Parity: 2676008548
> > >   Bounds: 2
> > >   Dump Status: good
> > > 
> > > % more core.txt.2
> > > Fatal trap 12: page fault while in kernel mode
> > > cpuid = 1; apic id = 11
> > > fault virtual address   = 0xffffb8000719a428
> > 
> > This seems to be the result of a bit-flip.  cred is 0xffffb8000719a400,
> > which is almost but not quite in the direct map.  In particular we have:
> > 
> > (kgdb) frame 10                                                                                                                 
> > #10 0xffffffff8083e07d in vm_object_destroy (object=<optimized out>) at /usr/src/sys/vm/vm_object.c:703            
> > 703                     swap_release_by_cred(object->charge, object->cred);                     
> > (kgdb) p object            
> > $8 = <optimized out>                                                                                                    
> > (kgdb) p *(vm_object_t)$r13                                                                            
> > $9 = {
> > ...
> >   cred = 0xffffb8000719a400,
> >   charge = 28672,
> >   umtx_data = 0x0
> > }
> > (kgdb) p *(struct ucred *)0xfffff8000719a400
> > $10 = {
> >   cr_ref = 5737, 
> >   cr_uid = 1001, 
> >   cr_ruid = 1001, 
> >   cr_svuid = 1001, 
> >   cr_ngroups = 7, 
> >   cr_rgid = 1001, 
> >   cr_svgid = 1001, 
> >   cr_uidinfo = 0xfffff80007285500, 
> >   cr_ruidinfo = 0xfffff80007285500, 
> >   cr_prison = 0xffffffff80a9de10 <prison0>, 
> > ... <more sane-looking ucred fields>
> > 
> > That is, flipping one of the bits in the fault address leads me to a
> > valid ucred.  This could in principle be the result of a software bug,
> > but I'd be more inclined to suspect the hardware.
> 
> Mark,
> 
> Thanks for looking into the problem.  This system has
> been running for probably 2 years or so without issues.
> I guess it's time to pull out memtest86+ (or similar)
> to see if hardware is starting to fail.

I'm not sure whether you're using ECC RAM, but if not, the system is
susceptible to silent random bit flips.
Received on Wed Sep 19 2018 - 19:20:40 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:18 UTC