Re: CURRENT crashes with nvidia GPU BLOB : vm_radix_insert: key 23c078 is already present

From: Gary Jennejohn <gljennjohn_at_googlemail.com> Date: Sat, 10 Aug 2013 10:37:05 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:40 UTC

On Fri, 9 Aug 2013 10:12:37 -0700
David Wolfskill <david_at_catwhisker.org> wrote:

> On Fri, Aug 09, 2013 at 07:32:51AM +0200, O. Hartmann wrote:
> > ...
> > > > On 8 August 2013 11:10, O. Hartmann <ohartman_at_zedat.fu-berlin.de>
> > > > wrote:
> > > > > The most recent CURRENT doesn't work with the x11/nvidia-driver
> > > > > (which is at 319.25 in the ports and 325.15 from nVidia).
> > > > >
> > > > > After build- and installworld AND successfully rebuilding port
> > > > > x11/nvidia-driver, the system crashes immediately after a reboot
> > > > > as soon the kernel module nvidia.ko seems to get loaded (in my
> > > > > case, I load nvidia.ko via /etc/rc.conf.local since the nVidia
> > > > > BLOB doesn't load cleanly everytime when loaded
> > > > > from /boot/loader.conf).
> > > > >
> > > > > The crash occurs on systems with default compilation options set
> > > > > while building world and with settings like -O3 -march=native. It
> > > > > doesn't matter.
> > > > >
> > > > > FreeBSD and the port x11/nvidia-driver has been compiled with
> > > > > CLANG.
> > > > >
> > > > > Most recent FreeBSD revision still crashing is r254097.
> > > > >
> > > > > When vmcore is saved, I always see something like
> > > > >
> > > > > savecore: reboot after panic: vm_radix_insert: key 23c078 is
> > > > > already present
> > > > >
> > > > >
> > > > > Does anyone has any idea what's going on?
> > > > >
> > > > > Thanks for helping in advance,
> > > > >
> > > > > Oliver
> > > 
> > > I'm seeing a complete deadlock on my T520 with today's current and
> > > latest portsnap'd versions of ports for the nvidia-driver updates.
> > > 
> > > A little bisection and help from others seems to point the finger at
> > > Jeff's r254025
> > > 
> > > I'm getting a complete deadlock on X starting, but loading the module
> > > seems to have no ill effects.
> > > 
> > > Sean
> > 
> > Rigth, I loaded the module also via /boot/loader.conf and it loads
> > cleanly. I start xdm and then the deadlock occurs.
> > 
> > I tried recompiling the whole xorg suite via "portmaster -f xorg xdm",
> > it took a while, but no effect, still dying.
> > .....
> 
> Sorry to be rather late to the party; the Internet connection I'm using
> at the moment is a bit flaky.  (I'm out of town.)
> 
> I managed to get head/i386 _at_r254135 built and booting ... by removing
> the "options DEBUG_MEMGUARD" from my kernel.
> 
> However, that merely prevented a (very!) early panic, and got me to the
> point where trying to start xdm with the x11/nvidia-driver as the
> display driver causes an immediate reboot (no crash dump, despite
> 'dumpdev="AUTO"' in /etc/rc.conf).  No drop to debugger, either.
> 
> Booting & starting xdm with the nv driver works -- that's my present
> environment as I am typing this.
> 
> However, the panic with DEBUG_MEMGUARD may offer a clue.  Unfortunately,
> it's early enough that screen lock/scrolling doesn't work, and I only
> had the patience to write down partof the panic information.  (This is
> on my laptop; no serial console, AFAICT -- and no device to capture the
> output if I did, since I'm not at home.)
> 
> The top line of the screen (at the panic) reads:
> 
> s/kern/subr_vmem.c:1050
> 
> The backtrace has the expected stuff near the top (about kbd, panic, and
> memguard stuff); just below that is:
> 
> vmem_alloc(c1226100,6681000,2,c1820cc0,3b5,...) at 0xc0ac5673=vmem_alloc+0x53/frame 0xc1820ca0
> 
> Caveat: that was hand-transcribed from the screen to papaer, then
> hand-transcribed from paper to this email message.  And my highest grade
> in "Penmanship" was a D+.
> 
> Be that as it may, here's the relevant section of subr_vmem.c with line
> numbers (cut/pasted, so tabs get munged):
> 
>    1039 /*
>    1040  * vmem_alloc: allocate resource from the arena.
>    1041  */
>    1042 int
>    1043 vmem_alloc(vmem_t *vm, vmem_size_t size, int flags, vmem_addr_t *addrp)
>    1044 {
>    1045         const int strat __unused = flags & VMEM_FITMASK;
>    1046         qcache_t *qc;
>    1047 
>    1048         flags &= VMEM_FLAGS;
>    1049         MPASS(size > 0);
>    1050         MPASS(strat == M_BESTFIT || strat == M_FIRSTFIT);
>    1051         if ((flags & M_NOWAIT) == 0)
>    1052                 WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, "vmem_alloc");
>    1053
>    1054         if (size <= vm->vm_qcache_max) {
>    1055                 qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift];
>    1056                 *addrp = (vmem_addr_t)uma_zalloc(qc->qc_cache, flags);
>    1057                 if (*addrp == 0)
>    1058                         return (ENOMEM);
>    1059                 return (0);
>    1060         }
>    1061
>    1062         return vmem_xalloc(vm, size, 0, 0, 0, VMEM_ADDR_MIN, VMEM_ADDR_MAX,
>    1063             flags, addrp);
>    1064 }
> 
> 
> This is at r254025.
> 

The REINPLACE_CMD at line 160 of nvidia-driver/Makefile is incorrect.

How do I know that?  Because I made a patch which results in a working
nvidia-driver-319.32 with r254050.  That's what I'm running right now.

Here's the patch (loaded with :r in vi, so all spaces etc. are correct):

--- src/nvidia_subr.c.orig	2013-08-09 11:32:26.000000000 +0200
+++ src/nvidia_subr.c	2013-08-09 11:33:23.000000000 +0200
_at__at_ -945,7 +945,7 _at__at_
         return ENOMEM;
     }

-    address = kmem_alloc_contig(kernel_map, size, flags, 0,
+    address = kmem_alloc_contig(kmem_arena, size, flags, 0,
             sc->dma_mask, PAGE_SIZE, 0, attr);
     if (!address) {
         status = ENOMEM;
_at__at_ -994,7 +994,7 _at__at_
         os_flush_cpu_cache();

     if (at->pte_array[0].virtual_address != NULL) {
-        kmem_free(kernel_map,
+        kmem_free(kmem_arena,
                 at->pte_array[0].virtual_address, at->size);
         malloc_type_freed(M_NVIDIA, at->size);
     }
_at__at_ -1021,7 +1021,7 _at__at_
     if (at->attr != VM_MEMATTR_WRITE_BACK)
         os_flush_cpu_cache();

-    kmem_free(kernel_map, at->pte_array[0].virtual_address,
+    kmem_free(kmem_arena, at->pte_array[0].virtual_address,
             at->size);
     malloc_type_freed(M_NVIDIA, at->size);

_at__at_ -1085,7 +1085,7 _at__at_
     }

     for (i = 0; i < count; i++) {
-        address = kmem_alloc_contig(kernel_map, PAGE_SIZE, flags, 0,
+        address = kmem_alloc_contig(kmem_arena, PAGE_SIZE, flags, 0,
                 sc->dma_mask, PAGE_SIZE, 0, attr);
         if (!address) {
             status = ENOMEM;
_at__at_ -1139,7 +1139,7 _at__at_
     for (i = 0; i < count; i++) {
         if (at->pte_array[i].virtual_address == 0)
             break;
-        kmem_free(kernel_map,
+        kmem_free(kmem_arena,
                 at->pte_array[i].virtual_address, PAGE_SIZE);
         malloc_type_freed(M_NVIDIA, PAGE_SIZE);
     }
_at__at_ -1169,7 +1169,7 _at__at_
         os_flush_cpu_cache();

     for (i = 0; i < count; i++) {
-        kmem_free(kernel_map,
+        kmem_free(kmem_arena,
                 at->pte_array[i].virtual_address, PAGE_SIZE);
         malloc_type_freed(M_NVIDIA, PAGE_SIZE);
     }

The primary differences are
1) use kmem_arena instead of kernel_map everywhere.  The REINPLACE_CMD
   uses kernel_arena
2) DO NOT use kva_free, but kmem_free as previously

To use the patch
Delete or comment out the 4 lines starting at 160 in Makefile
Run ``make patch''
cd work/NVIDIA-FreeBSD-x86_64-319.32/src
patch < [wherever the patch is]
cd ../../..
make deinstall install clean
kldunload the old nvidia.ko
kldload the new nvidia.ko
start X

-- 
Gary Jennejohn