On Sat, May 09, 2020 at 11:33:40PM +0300, Andriy Gapon wrote: > On 09/05/2020 19:50, Konstantin Belousov wrote: > > On Sat, May 09, 2020 at 07:16:27PM +0300, Andriy Gapon wrote: > >> On 09/05/2020 19:13, Konstantin Belousov wrote: > >>> On Sat, May 09, 2020 at 06:52:24PM +0300, Andriy Gapon wrote: > >>>> On 08/05/2020 19:15, Konstantin Belousov wrote: > >>>>> On Fri, May 08, 2020 at 06:53:24PM +0300, Andriy Gapon wrote: > >>>>>> > >>>>>> I have a reproducible panic with a custom kernel without option NUMA while using > >>>>>> amdgpu driver from linuxkpi-based drm: > >>>>>> > >>>>>> panic: address 41ec00000 beyond the last segment > >>>>>> > >>>>>> I did some quick debugging and the panic happens when Xorg server tries to > >>>>>> access a frame buffer (or something like that). There is a page fault that gets > >>>>>> satisfied by ttm with a fictitious page. > >>>>>> > >>>>>> The stack trace is: > >>>>>> #11 0xffffffff808031a3 in panic (fmt=0xffffffff8119a998 <cnputs_mtx> > >>>>>> "5\003ʀ\377\377\377\377") at /usr/devel/git/motil/sys/kern/kern_shutdown.c:839 > >>>>>> #12 0xffffffff80bbc552 in pmap_enter (pmap=<optimized out>, va=34504441856, > >>>>>> m=<optimized out>, prot=<optimized out>, flags=<optimized out>, psind=<optimized > >>>>>> out>) at /usr/devel/git/motil/sys/amd64/amd64/pmap.c:6035 > >>>>>> #13 0xffffffff80b288be in vm_fault_populate (fs=<optimized out>) at > >>>>>> /usr/devel/git/motil/sys/vm/vm_fault.c:519 > >>>>>> #14 vm_fault_allocate (fs=<optimized out>) at > >>>>>> /usr/devel/git/motil/sys/vm/vm_fault.c:1032 > >>>>>> #15 vm_fault (map=<optimized out>, vaddr=<optimized out>, fault_type=<optimized > >>>>>> out>, fault_flags=<optimized out>, m_hold=<optimized out>) at > >>>>>> /usr/devel/git/motil/sys/vm/vm_fault.c:1342 > >>>>>> #16 0xffffffff80b26e7e in vm_fault_trap (map=0xfffffe0017cd39e8, > >>>>>> vaddr=<optimized out>, fault_type=<optimized out>, fault_flags=0, > >>>>>> signo=0xfffffe00a810dbc4, ucode=0xfffffe00a810dbc0) at > >>>>>> /usr/devel/git/motil/sys/vm/vm_fault.c:589 > >>>>>> #17 0xffffffff80bcf89c in trap_pfault (frame=0xfffffe00a810dc00, > >>>>>> usermode=<optimized out>, signo=<optimized out>, ucode=0xffffffff80853250 > >>>>>> <putchar>) at /usr/devel/git/motil/sys/amd64/amd64/trap.c:821 > >>>>>> #18 0xffffffff80bceeec in trap (frame=0xfffffe00a810dc00) at > >>>>>> /usr/devel/git/motil/sys/amd64/amd64/trap.c:34 > >>>>>> > >>>>>> > >>>>>> The line number in pmap_enter() is incorrect, I guess because of optimizations. > >>>>>> The assert seems to be reached via pmap_enter -> CHANGE_PV_LIST_LOCK_TO_PHYS -> > >>>>>> PHYS_TO_PV_LIST_LOCK -> pa_index(). > >>>>>> > >>>>>> The panic in correct in that the page is fictitious and its physical address is > >>>>>> beyond the end of real physical memory. > >>>>>> It seems that NUMA PHYS_TO_PV_LIST_LOCK() is aware of such pages, but !NUMA one > >>>>>> is not. > >>>>> > >>>>> I think you can remove this assert. pa_index() is always taken by > >>>>> % NVP_LIST_LOCKS, because fictitious mappings are not promoted. > >>>>> > >>>>> Try that and commit if it works for you. > >>>> > >>>> I tried this change: > >>>> diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c > >>>> index 4deed86a76d1a..b834b7f0388b7 100644 > >>>> --- a/sys/amd64/amd64/pmap.c > >>>> +++ b/sys/amd64/amd64/pmap.c > >>>> _at__at_ -345,7 +345,7 _at__at_ pmap_pku_mask_bit(pmap_t pmap) > >>>> #define NPV_LIST_LOCKS MAXCPU > >>>> > >>>> #define PHYS_TO_PV_LIST_LOCK(pa) \ > >>>> - (&pv_list_locks[pa_index(pa) % NPV_LIST_LOCKS]) > >>>> + (&pv_list_locks[((pa) >> PDRSHIFT) % NPV_LIST_LOCKS]) > >>>> #endif > >>>> > >>>> #define CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa) do { \ > >>>> > >>>> It fixed the original problem, but I got a new panic. > >>>> "DI already started" in pmap_remove() -> pmap_delayed_invl_start_u(). > >>>> I guess that !NUMA variant does not get much testing, so I'll probably just > >>>> stick with the default. > >>> Why didn't you just removed the KASSERT from pa_index ? > >> > >> Well, I thought it might be useful in the NUMA case. > >> pa_index() definition is shared between both cases. > > Might be define the macro two times, for NUMA/non-NUMA. non-NUMA case > > does not need the assert, because users take it mod NPV_LIST_LOCKS. > > > > I still don't see how that could help with "DI already started" panic. Might be not, might be it would help due to pmap_delayed_invl_genp(). But I would more worry about this 'already started' issue, because this must not happen. Can you remove the assert from the macro and provide backtrace of 'DI already started' panic ?Received on Sat May 09 2020 - 18:47:51 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC