Re: Fatal trap 12: page fault on Acer Chromebook 720 (peppy)

From: Michael Gmelin <freebsd_at_grem.de>
Date: Mon, 20 Aug 2018 00:45:12 +0200
On Sun, 19 Aug 2018 19:16:42 +0300
Konstantin Belousov <kostikbel_at_gmail.com> wrote:

> On Sun, Aug 19, 2018 at 04:59:51PM +0200, Michael Gmelin wrote:
> > 
> > 
> > On Fri, 17 Aug 2018 10:02:08 +0100
> > John Baldwin <jhb_at_FreeBSD.org> wrote:
> >   
> > > On 8/17/18 9:54 AM, Michael Gmelin wrote:  
> > > > 
> > > >     
> > > >> On 17. Aug 2018, at 08:17, John Baldwin <jhb_at_FreeBSD.org>
> > > >> wrote: 
> > > >>> On 8/16/18 1:58 PM, Michael Gmelin wrote:
> > > >>>
> > > >>>    
> > > >>>> On 15. Aug 2018, at 15:55, Konstantin Belousov
> > > >>>> <kostikbel_at_gmail.com <mailto:kostikbel_at_gmail.com>> wrote:   
> > > >>>>> On Wed, Aug 15, 2018 at 03:52:37PM +0200, Michael Gmelin
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>    
> > > >>>>>>> On 15. Aug 2018, at 15:04, Konstantin Belousov
> > > >>>>>>> <kostikbel_at_gmail.com <mailto:kostikbel_at_gmail.com>> wrote:
> > > >>>>>>>
> > > >>>>>>> On Wed, Aug 15, 2018 at 12:51:06AM +0200, Michael Gmelin
> > > >>>>>>> wrote: Reviving this old thread, since I just updated to
> > > >>>>>>> r337818 and a similar problem is happening again. Since
> > > >>>>>>> the fix in r334799 (review
> > > >>>>>>> https://reviews.freebsd.org/D15675) (mp_)machdep.c have
> > > >>>>>>> been touched, so maybe this is related
> > > >>>>>>> (https://svnweb.freebsd.org/base?view=revision&revision=334799).
> > > >>>>>>>
> > > >>>>>>> Please see the screenshot of the panic below:
> > > >>>>>>> https://gist.github.com/grembo/78d0f2a100dd4f16775b85a118769658
> > > >>>>>>>
> > > >>>>>>> This is me not digging any deeper, hoping that this is
> > > >>>>>>> something obvious. Please let me know if you need more
> > > >>>>>>> input.    
> > > >>>>>>
> > > >>>>>> I do not see how recent mp_machdep.c changes could affect
> > > >>>>>> this. Can you try newest kernel but old loader ?    
> > > >>>>>
> > > >>>>> I will try (but that will take a while). Oh, also, it still
> > > >>>>> boots in save mode/with smp disabled.    
> > > >>>>
> > > >>>> Right, this is because the access to that address through
> > > >>>> DMAP is only needed when configuring AP startup resources.
> > > >>>>
> > > >>>> Also, I think it is safe to suggest that the bisect is
> > > >>>> needed.    
> > > >>>
> > > >>> Using an older loader didn???t help, but I identified the
> > > >>> problem:
> > > >>>
> > > >>> https://svnweb.freebsd.org/base?view=revision&revision=334952
> > > >>>
> > > >>> modified the code you introduced in
> > > >>>
> > > >>> https://svnweb.freebsd.org/base?view=revision&revision=334799
> > > >>>
> > > >>> By correcting units to pages it also broke booting the
> > > >>> Chromebook as a side effect - so the previous fix just worked
> > > >>> due to a bug it seems.
> > > >>>
> > > >>> Is there an easy way to output the content of physmap at that
> > > >>> point (debug.late_console=0 doesn???t work) - like an existing
> > > >>> buffer I could use, or would this be more elaborate (I did
> > > >>> something complicated last time but didn???t save it, so any
> > > >>> simple solution would be preferred).    
> > > >>
> > > >> How about reverting the commit for now so you get a working
> > > >> console and print out the physmap array values along with
> > > >> Maxmem later in the boot (or just use kgdb to examine them
> > > >> once the system is running)?   
> > > > 
> > > > This is before the system has a working console (part of calling
> > > > getmem...), disabling late console makes it hang, physmap
> > > > changes afterwards, so running kgdb later doesn???t help. Last
> > > > time I kept a copy of physmap and logged it later to know the
> > > > original content. I can do that again, I just thought maybe
> > > > there is a simple mechanism I???m not aware of that would save
> > > > me some time.    
> > > 
> > > I thought we only modified phys_avail[], but saving a copy of
> > > physmap[] and dumping it from kgdb is probably the simplest thing
> > > to do.
> > >   
> > 
> > Okay, so I had some time to investigate a bit more:
> > 
> > Before calling init_ops.mp_bootaddress in getmemsize (machdep.c),
> > physmap looks like this:
> > 
> > physmap_idx: 8
> > i mem atop
> > 0 0x0 0x0
> > 1 0x30000 0x30
> > 2 0x40000 0x40
> > 3 0x9e400 0x9e
> > 4 0x100000 0x100
> > 5 0xf00000 0xf00
> > 6 0x1000000 0x1000
> > 7 0x7bf7a000 0x7bf7a
> > 8 0x100000000 0x100000
> > 9 0x100600000 0x100600
> > 10 0x0 0x0
> > Maxmem: 0x100600000 0x100600
> > 
> > Without using atop (the "buggy" version that actually boots without
> > crashing), the loop in mp_bootaddress looks like this:
> > 
> > i, physmap[i], physmap[i + 1], atop(physmap[i + 1]), Maxmem
> > 8 0x100000000 0x100600000 0x100600 0x100600 
> > 6 0x1000000 0x7bf7a000 0x7bf7a 0x100600 
> > 4 0x100000 0xf00000 0xf00 0x100600 
> > 2 0x40000 0x9e400 0x9e 0x100600 
> > 
> > And physmap looks like this afterwards:
> > 
> > physmap_idx: 8
> > i mem atop
> > 0 0x0 0x0
> > 1 0x30000 0x30
> > 2 0x43000 0x43 <-- here
> > 3 0x9e400 0x9e
> > 4 0x100000 0x100
> > 5 0xf00000 0xf00
> > 6 0x1000000 0x1000
> > 7 0x7bf7a000 0x7bf7a
> > 8 0x100000000 0x100000
> > 9 0x100600000 0x100600
> > 10 0x0 0x0
> > mptramp_pagetables is 0x40000
> > 
> > So a three page gap was made at 0x40000 (atop(idx 2) is now 0x43
> > instead of 0x40)
> > 
> > In the current version (using atop), the loop in mp_bootaddress
> > looks like this:
> > 
> > i, physmap[i], physmap[i + 1], atop(physmap[i + 1]), Maxmem
> > 8 0x100000000 0x100600000 0x100600 0x100600 
> > 6 0x1000000 0x7bf7a000 0x7bf7a 0x100600 
> > 
> > And physmap looks like this afterwards:
> > 
> > physmap_idx: 8
> > i mem atop
> > 0 0x0 0x0
> > 1 0x30000 0x30
> > 2 0x40000 0x40
> > 3 0x9e400 0x9e
> > 4 0x100000 0x100
> > 5 0xf00000 0xf00
> > 6 0x1003000 0x1003 <-- here
> > 7 0x7bf7a000 0x7bf7a
> > 8 0x100000000 0x100000
> > 9 0x100600000 0x100600
> > 10 0x0 0x0
> > mptramp_pagetables: 0x1000000
> > 
> > So a three page gap was made at 0x1000000 (atop(idx 6) is now
> > 0x1003 instead of 0x1000)
> > 
> > When changing the code to require a page below 0x1000:
> > 
> >   if (physmap[i] >= GiB(4) || physmap[i + 1] -
> >       round_page(physmap[i]) < PAGE_SIZE * 3 ||
> >       atop(physmap[i + 1]) > Maxmem
> >       || atop(physmap[i + 1]) > 0x1000) // <--- this
> >       continue;
> > 
> > The system boots just fine. It uses page 0x100
> > for the bootstrap code in this case:
> > 
> > i, physmap[i], physmap[i + 1], atop(physmap[i + 1]), Maxmem
> > 8 0x100000000 0x100600000 0x100600 0x100600 
> > 6 0x1000000 0x7bf7a000 0x7bf7a 0x100600 
> > 4 0x100000 0xf00000 0xf00 0x100600 
> > 
> > Physmap looks like this:
> > physmap_idx: 8
> > i mem atop
> > 0 0x0 0x0
> > 1 0x30000 0x30
> > 2 0x40000 0x40
> > 3 0x9e400 0x9e
> > 4 0x103000 0x103 <-- here
> > 5 0xf00000 0xf00
> > 6 0x1000000 0x1000
> > 7 0x7bf7a000 0x7bf7a
> > 8 0x100000000 0x100000
> > 9 0x100600000 0x100600
> > 10 0x0 0x0
> > mptramp_pagetables: 0x100000
> > 
> > So for some reason it's crashing when using pages 0x1000 - 0x1003
> > for the bootstrap code, while it boots okay when using 0x40 - 0x43
> > and 0x100 - 0x103.
> > 
> > Any ideas?  
> I in fact misread the page fault state decoding in your photo.
> It is curiously protection violation on write, instead of non-present
> page access.
> 
> Compile ddb into your kernel, then on fault do
> db> x/x dmaplimit
> db> x/x dmaplimit+4
> db> show pte <fault virtual address>  

This was a bit more complicated, as the keyboard doesn't work in ddb at
that point (neither internal, nor USB).

I ended up hacking sys/ddb/db_script.c to execute these commands on
kdb.enter.trap (tunable support for scripting would be cool).

Anyway, dmaplimit is 40000000, dmaplimit+4 is 1

See here for a screenshot (also including the output of "show pte
0xfffff80001000000"):

https://gist.github.com/grembo/78d0f2a100dd4f16775b85a118769658#file-ddb1-png

> 
> Also show me the verbose dmesg lines with CPU features identification.
> 

CPU: Intel(R) Celeron(R) 2955U _at_ 1.40GHz (1396.80-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x40651  Family=0x6  Model=0x45  Stepping=1
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x4ddaebbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,XSAVE,OSXSAVE,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x21<LAHF,ABM>
  Structured Extended Features=0x2603<FSGSBASE,TSCADJ,ERMS,INVPCID,NFPUSG>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: (disabled in BIOS) PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 4301258752 (4102 MB)
avail memory = 1907445760 (1819 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <CORE   COREBOOT>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
random: unblocking device.
ioapic0 <Version 2.0> irqs 0-39 on motherboard
Launching APs: 1
Timecounter "TSC" frequency 1396795536 Hz quality 1000

-m

-- 
Michael Gmelin
Received on Sun Aug 19 2018 - 20:45:16 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:17 UTC