Re: Fatal trap 12: page fault on Acer Chromebook 720 (peppy)

From: Michael Gmelin <freebsd_at_grem.de> Date: Wed, 15 Aug 2018 00:51:06 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:17 UTC

On Wed, 6 Jun 2018 01:06:25 +0200
Michael Gmelin <freebsd_at_grem.de> wrote:

> On Tue, 5 Jun 2018 16:11:35 +0300
> Konstantin Belousov <kostikbel_at_gmail.com> wrote:
> 
> > On Mon, Jun 04, 2018 at 11:17:56PM +0200, Michael Gmelin wrote:  
> > > 
> > > 
> > > On Mon, 4 Jun 2018 14:06:55 +0300
> > > Konstantin Belousov <kostikbel_at_gmail.com> wrote:
> > >     
> > > > On Mon, Jun 04, 2018 at 12:46:32AM +0200, Michael Gmelin
> > > > wrote:    
> > > > > 
> [...]
> > > > > > > > > This machine comes with it by default (my model was
> > > > > > > > > delivered with SeaBIOS 20131018_145217-build121-m2).
> > > > > > > > > So I didn't flash anything (didn't feel like bricking
> > > > > > > > > it). 
> > > > > > > > > >           
> > > > > > > > > > > kernel trap 12 with interrupts disabled
> > > > > > > > > > > 
> > > > > > > > > > > Fatal trap 12: page fault while in kernel mode 
> > > > > > > > > > > cpuid = 0; apic id = 00
> > > > > > > > > > > fault virtual address    = 0xfffff80001000000
> > > > > > > > > > > fault code               = supervisor write data,
> > > > > > > > > > > protection violation instruction pointer      =
> > > > > > > > > > > 0x20:Oxffffffff8102955f stack pointer            =
> > > > > > > > > > > 0x28:0xffffffff82a79be0 frame pointer            =
> > > > > > > > > > > 0x28:0xffffffff82a79c10 code segment             =
> > > > > > > > > > > base Ox0, limit Oxfffff, type Ox1b = DPL 0, pres
> > > > > > > > > > > 1, long 1, def32 0, gran 1 processor
> > > > > > > > > > > eflags         = resume, IOPL = 0 current
> > > > > > > > > > > process          = 0 () [ thread pid 0 tid 0 ]
> > > > > > > > > > > Stopped at      native_start_all_aps+0x08f:
> > > > > > > > > > > movq %rax,(%rsi)            
> > > > > > > > > > Look up the source line number for this address.
> > > > > > > > > >           
> > > > > > > > > 
> > > > > > > > > I guess that's sys/amd64/amd64/support.S line 854 (in
> > > > > > > > > rdmsr), called by native_start_all_aps. Any additional
> > > > > > > > > hints how I can track it down?          
> > > > > > > > Why did you decided that this is rdmsr_safe() ? First,
> > > > > > > > native_start_all_aps() does not call rdmsr, second the
> > > > > > > > ddb report clearly indicates that the fault occured
> > > > > > > > acessing DMAP in native_start_all_aps().
> > > > > > > > 
> > > > > > > > Just look up the source line by the address
> > > > > > > > native_start_all_aps+0x08f.        
> > > > > > > 
> > > > > > > Okay, according to kgbd this should be here:
> > > > > > > 
> > > > > > > https://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=333368&view=markup#l369
> > > > > > > 
> > > > > > > 364
> > > > > > > 365    /* Create the initial 1GB replicated page tables */
> > > > > > > 366    for (i = 0; i < 512; i++) {
> > > > > > > 367            /* Each slot of the level 4 pages points to
> > > > > > > the same level 3 page */ 368            pt4[i] =
> > > > > > > (u_int64_t)(uintptr_t)(mptramp_pagetables + PAGE_SIZE);
> > > > > > > 369 pt4[i] |= PG_V | PG_RW | PG_U; 370
> > > > > > > 371            /* Each slot of the level 3 pages points to
> > > > > > > the same level 2 page */ 372            pt3[i] =
> > > > > > > (u_int64_t)(uintptr_t)(mptramp_pagetables + (2 *
> > > > > > > PAGE_SIZE)); 373            pt3[i] |= PG_V | PG_RW | PG_U;
> > > > > > > 374 375            /* The level 2 page slots are mapped
> > > > > > > with 2MB pages for 1GB. */ 376            pt2[i] = i * (2
> > > > > > > * 1024 * 1024); 377            pt2[i] |= PG_V | PG_RW |
> > > > > > > PG_PS | PG_U; 378    }
> > > > > > > 
> > > > > > > -m        
> > > > > > You have fault on write due to read-only mapping of the
> > > > > > portion of the direct map, which maps the kernel text.  It
> > > > > > is consistent with the faulting address.  It is not clear
> > > > > > if it is something new on your machine, or before the
> > > > > > kernel text was silently corrupted, since ro protection is
> > > > > > somewhat recent.
> > > > > > 
> > > > > > It seems that mp_bootaddress() selected the bad place for
> > > > > > the bootstrap page tables. Even more, we do not include the
> > > > > > kernel text into the physmem[] array, so it is not clear
> > > > > > how did it happen. This code was also changed recently.
> > > > > > 
> > > > > > Can you add the print of the physmap[] array somewhere
> > > > > > before the panic, to see what is the kernel idea of the
> > > > > > available memory ? It should be already done if you have
> > > > > > serial console and set debug.late_console tunable to
> > > > > > 0.      
> > > > > 
> > > > > This is a sad little machine without any kind of serial
> > > > > console.
> > > > > 
> > > > > Physmap looks like this after calling getmemsize():
> > > > > 
> > > > > [0]: 0x10000
> > > > > [1]: 0x30000
> > > > > [2]: 0x40000
> > > > > [3]: 0x9e000
> > > > > [4]: 0x100000
> > > > > [5]: 0xf00000
> > > > > [6]: 0x1003000
> > > > > [7]: 0x7bf7a000
> > > > > 
> > > > > Physical memory chunks logged in cpu_startup are:
> > > > > 
> > > > > 0x0000000000010000 - 0x000000000002ffff, 141072 bytes (32
> > > > > pages) 0x0000000000040000 - 0x000000000009dfff, 385024 bytes
> > > > > (94 pages)      
> > > > These two chunks reports are consistent with the physmap[0-1,
> > > > 2-3].   
> > > > > 0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256
> > > > > pages) 0x0000000002c00000 - 0x0000000075467fff, 1921417216
> > > > > bytes (469096 pages) 0x0000000100000000 - 0x00000001005e7fff,
> > > > > 6193152 bytes (1512 pages)      
> > > > But these three looks completely unrelated to the rest of the
> > > > physmap, perhaps except the physmap[4].  We allocate boot pages
> > > > from the top of the last physmap chunk, but I am certain that we
> > > > do not consume that much memory for boot to make physmap[7] from
> > > > the last reported address.
> > > > 
> > > > Are you sure that there are no typos  in the values above ?    
> > > 
> > > Double checked the numbers. I changed it a bit more,
> > > so that debug output appears all on one page. Please see here for
> > > the results:
> > > 
> > > https://gist.github.com/grembo/cebb9f7e2a98c37a51bee1e508f7d890    
> > Ok, I have a guess what is going on.  Does the result of the quirks
> > end up as hw.physmem tunable passed to kernel ?  It seems that there
> > is physmap[] element pointing outside the DMAP-mapped region.
> > 
> > Perhaps print the dmap limit too, to see whether I am on the right
> > track.  
> 
> I didn't print the dmap limit yet, but I tested your patch:
> 
> > 
> > Try the following change.   It lacks i386 bits.
> > 
> > diff --git a/sys/amd64/amd64/machdep.c b/sys/amd64/amd64/machdep.c
> > index e5c69ed91fa..bd6bbf04006 100644
> > --- a/sys/amd64/amd64/machdep.c
> > +++ b/sys/amd64/amd64/machdep.c
> > _at__at_ -1254,7 +1254,7 _at__at_ getmemsize(caddr_t kmdp, u_int64_t first)
> >  	 * in real mode mode (e.g. SMP bare metal).
> >  	 */
> >  	if (init_ops.mp_bootaddress)
> > -		init_ops.mp_bootaddress(physmap, &physmap_idx);
> > +		init_ops.mp_bootaddress(physmap, &physmap_idx,
> > first); 
> >  	/*
> >  	 * Maxmem isn't the "maximum memory", it's one larger than
> > the diff --git a/sys/amd64/amd64/mp_machdep.c
> > b/sys/amd64/amd64/mp_machdep.c index 30146142087..292a6cefa91 100644
> > --- a/sys/amd64/amd64/mp_machdep.c
> > +++ b/sys/amd64/amd64/mp_machdep.c
> > _at__at_ -103,7 +103,8 _at__at_ static int	start_ap(int apic_id);
> >   * Calculate usable address in base memory for AP trampoline code.
> >   */
> >  void
> > -mp_bootaddress(vm_paddr_t *physmap, unsigned int *physmap_idx)
> > +mp_bootaddress(vm_paddr_t *physmap, unsigned int *physmap_idx,
> > +    vm_paddr_t dmap_limit)
> >  {
> >  	unsigned int i;
> >  	bool allocated;
> > _at__at_ -117,8 +118,9 _at__at_ mp_bootaddress(vm_paddr_t *physmap, unsigned int
> > *physmap_idx)
> >  		 * store the initial page tables. Note that it
> > needs to be
> >  		 * aligned to a page boundary.
> >  		 */
> > -		if (physmap[i] >= GiB(4) ||
> > -		    (physmap[i + 1] - round_page(physmap[i])) <
> > (PAGE_SIZE * 3))
> > +		if (physmap[i] >= GiB(4) || physmap[i + 1] -
> > +		    round_page(physmap[i]) < PAGE_SIZE * 3 ||
> > +		    physmap[i + 1] - PAGE_SIZE * 3 > dmap_limit)
> >  			continue;
> >  
> >  		allocated = true;
> > diff --git a/sys/amd64/include/smp.h b/sys/amd64/include/smp.h
> > index 2ecfe62cf9f..24f0580fe51 100644
> > --- a/sys/amd64/include/smp.h
> > +++ b/sys/amd64/include/smp.h
> > _at__at_ -58,7 +58,7 _at__at_ void	invlpg_pcid_handler(void);
> >  void	invlrng_invpcid_handler(void);
> >  void	invlrng_pcid_handler(void);
> >  int	native_start_all_aps(void);
> > -void	mp_bootaddress(vm_paddr_t *, unsigned int *);
> > +void	mp_bootaddress(vm_paddr_t *, unsigned int *,
> > vm_paddr_t); 
> >  #endif /* !LOCORE */
> >  #endif /* SMP */
> > diff --git a/sys/x86/include/init.h b/sys/x86/include/init.h
> > index 880cabaa949..58bbe0a5fd6 100644
> > --- a/sys/x86/include/init.h
> > +++ b/sys/x86/include/init.h
> > _at__at_ -41,7 +41,7 _at__at_ struct init_ops {
> >  	void	(*early_clock_source_init)(void);
> >  	void	(*early_delay)(int);
> >  	void	(*parse_memmap)(caddr_t, vm_paddr_t *, int *);
> > -	void	(*mp_bootaddress)(vm_paddr_t *, unsigned int
> > *);
> > +	void	(*mp_bootaddress)(vm_paddr_t *, unsigned int *,
> > vm_paddr_t); int	(*start_all_aps)(void);
> >  	void	(*msi_init)(void);
> >  };  
> 
> With the patch I could boot without problems and the machine appears
> to be stable (ran some high load & memory intensive tests - by the
> way, the machine only has 2gb of ram [even though 4g are reported on
> boot - usable memory appears to be reported ok]).
> 
> Thanks,
> Michael
> 

Hi,

Reviving this old thread, since I just updated to r337818 and a similar
problem is happening again. Since the fix in r334799 (review
https://reviews.freebsd.org/D15675) (mp_)machdep.c have been touched,
so maybe this is related
(https://svnweb.freebsd.org/base?view=revision&revision=334799).

Please see the screenshot of the panic below:
https://gist.github.com/grembo/78d0f2a100dd4f16775b85a118769658

This is me not digging any deeper, hoping that this is something
obvious. Please let me know if you need more input.

Thanks,
Michael

-- 
Michael Gmelin