Re: Kernel crash during video transcoding

From: Alexandre Levy <a13xlevy_at_gmail.com> Date: Sun, 16 Aug 2020 16:28:33 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

Hi,

I looked at the crash dump and the code more closely:

#18 0xffffffff82be1c5f in i915_gem_fault (dummy=<optimized out>,
vmf=<optimized out>)
    at
/usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.3_4/drivers/gpu/drm/i915/gem/i915_gem_mman.c:367
(kgdb) p area->vm_obj->lock
$43 = {lock_object = {lo_name = 0xffffffff8112c767 "vm object", lo_flags =
627245056, lo_data = 0, lo_witness = 0xfffff8045f575800}, rw_lock =
18446741878623409920}

So vm_obj is not NULL and has a rw_lock member

Now at intel_freebsd.c:193 (frame #17) the driver calls
vm_page_busy_acquire(m, VM_ALLOC_WAITFAIL). 'm' is the page grabbed from
vm_obj of the calling frame.

The panic occurs in kern_rwlock.c:270 in frame #15 when
calling rw_wowner(rwlock2rw(c)) so something goes wrong either in rw_wowner
or in rwlock2rw.

Looking at rwlock2rw() :

/*
 * Return the rwlock address when the lock cookie address is provided.
 * This functionality assumes that struct rwlock* have a member named
rw_lock.
 */
#define rwlock2rw(c)    (__containerof(c, struct rwlock, rw_lock))

I think this one is just extracting out the rw_lock member of the passed in
struct. However I don't understand what the cookie address is about due to
my lack of knowledge on kernel locking concepts. So maybe something is
wrong with the cookie or the rw_lock value itself.

Looking at rw_wowner() :

/*
 * Return a pointer to the owning thread if the lock is write-locked or
 * NULL if the lock is unlocked or read-locked.
 */

#define lv_rw_wowner(v)                         \
    ((v) & RW_LOCK_READ ? NULL :                    \
     (struct thread *)RW_OWNER((v)))

#define rw_wowner(rw)   lv_rw_wowner(RW_READ_VALUE(rw))

I don't think that one could cause a panic but again I'm not experienced
enough to be sure, it seems this either returns the thread that owns the
lock or NULL if no thread owns it.

The is also the fact that the driver calls vm_page_busy_acquire with the
VM_ALLOC_WAITFAIL flag which is defined in vm_page.h as :

#define VM_ALLOC_WAITFAIL   0x0010  /* (acf) Sleep and return error */

Could this be the reason of the panic as in we try to lock, then cannot and
eventually just return an error without retrying ? There is the flag
VM_ALLOC_WAITOK that says /* (acf) Sleep and retry */. Should I try to
patch intel_freebsd.c to use this flag instead ?

Thanks.

Le sam. 15 août 2020 à 20:35, Alexandre Levy <a13xlevy_at_gmail.com> a écrit :

> Hi,
>
> I could finally generate a crash dump even with a black screen, I had to
> guess I was in the crash handler and I type "dump" and enter which worked.
> The driver logs "[drm] Cannot find any crtc or sizes" which I guess is the
> reason why I couldn't see anything on my screen.
>
> Back to the initial problem, I could start a kgdb session, loaded the
> i915kms.ko symbols and here are the results :
>
> (kgdb) bt
> #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
> #1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:394
> #2  0xffffffff8049c26a in db_dump (dummy=<optimized out>,
> dummy2=<unavailable>, dummy3=<unavailable>, dummy4=<unavailable>) at
> /usr/src/sys/ddb/db_command.c:575
> #3  0xffffffff8049c02c in db_command (last_cmdp=<optimized out>,
> cmd_table=<optimized out>, dopager=1) at /usr/src/sys/ddb/db_command.c:482
> #4  0xffffffff8049bd9d in db_command_loop () at
> /usr/src/sys/ddb/db_command.c:535
> #5  0xffffffff8049f048 in db_trap (type=<optimized out>, code=<optimized
> out>) at /usr/src/sys/ddb/db_main.c:270
> #6  0xffffffff80c1b374 in kdb_trap (type=3, code=0, tf=<optimized out>) at
> /usr/src/sys/kern/subr_kdb.c:699
> #7  0xffffffff8100ca98 in trap (frame=0xfffffe00d7567300) at
> /usr/src/sys/amd64/amd64/trap.c:576
> #8  <signal handler called>
> #9  kdb_enter (why=0xffffffff811d5de0 "panic", msg=<optimized out>) at
> /usr/src/sys/kern/subr_kdb.c:486
> #10 0xffffffff80bd00be in vpanic (fmt=<optimized out>, ap=<optimized out>)
> at /usr/src/sys/kern/kern_shutdown.c:902
> #11 0xffffffff80bcfe53 in panic (fmt=0xffffffff81c8c7c8 <cnputs_mtx>
> "\b\214\031\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:839
> #12 0xffffffff8100cee7 in trap_fatal (frame=0xfffffe00d7567600, eva=0) at
> /usr/src/sys/amd64/amd64/trap.c:915
> #13 0xffffffff8100c360 in trap (frame=0xfffffe00d7567600) at
> /usr/src/sys/amd64/amd64/trap.c:212
> #14 <signal handler called>
> #15 _rw_wowned (c=0x2659c92217d5aa52) at
> /usr/src/sys/kern/kern_rwlock.c:270
> #16 0xffffffff80ec23ed in vm_page_busy_acquire (m=0xfffffe00040ff9e8,
> allocflags=16) at /usr/src/sys/vm/vm_page.c:884
> #17 0xffffffff82b4e980 in remap_io_mapping (vma=0xfffff80315148300,
> addr=<optimized out>, pfn=<optimized out>, size=<optimized out>,
> iomap=<optimized out>)
>     at
> /usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.3_4/drivers/gpu/drm/i915/intel_freebsd.c:193
> #18 0xffffffff82be1c5f in i915_gem_fault (dummy=<optimized out>,
> vmf=<optimized out>)
>     at
> /usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.3_4/drivers/gpu/drm/i915/gem/i915_gem_mman.c:367
> #19 0xffffffff82cb5ddf in linux_cdev_pager_populate
> (vm_obj=0xfffff80368501420, pidx=<optimized out>, fault_type=<optimized
> out>, max_prot=<optimized out>,
>     first=0xfffffe00d7567868, last=0xfffffe00d7567888) at
> /usr/src/sys/compat/linuxkpi/common/src/linux_compat.c:554
> #20 0xffffffff80ea9e8f in vm_pager_populate (object=0x2659c92217d5aa52,
> pidx=18446741874754451944, fault_type=0, max_prot=0 '\000',
> first=<optimized out>, last=<optimized out>)
>     at /usr/src/sys/vm/vm_pager.h:172
> #21 vm_fault_populate (fs=<optimized out>) at
> /usr/src/sys/vm/vm_fault.c:444
> #22 vm_fault_allocate (fs=<optimized out>) at
> /usr/src/sys/vm/vm_fault.c:1028
> #23 vm_fault (map=<optimized out>, vaddr=<optimized out>,
> fault_type=<optimized out>, fault_flags=<optimized out>, m_hold=<optimized
> out>) at /usr/src/sys/vm/vm_fault.c:1338
> #24 0xffffffff80ea98ee in vm_fault_trap (map=0xfffffe00c0f539e8,
> vaddr=<optimized out>, fault_type=<optimized out>, fault_flags=0,
> signo=0xfffffe00d7567ac4,
>     ucode=0xfffffe00d7567ac0) at /usr/src/sys/vm/vm_fault.c:585
> #25 0xffffffff8100d0de in trap_pfault (frame=0xfffffe00d7567b00,
> usermode=<optimized out>, signo=<optimized out>, ucode=0xffffffff81d1de80
> <w_locklistdata+160624>)
>     at /usr/src/sys/amd64/amd64/trap.c:817
> #26 0xffffffff8100c72c in trap (frame=0xfffffe00d7567b00) at
> /usr/src/sys/amd64/amd64/trap.c:340
> #27 <signal handler called>
> #28 0x000000080296659a in ?? ()
> Backtrace stopped: Cannot access memory at address 0x7fffffffbf38
> (kgdb) list *0xffffffff82be1c5f
> 0xffffffff82be1c5f is in i915_gem_fault
> (/usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.3_4/drivers/gpu/drm/i915/gem/i915_gem_mman.c:367).
> 362             ret = i915_vma_pin_fence(vma);
> 363             if (ret)
> 364                     goto err_unpin;
> 365
> 366             /* Finally, remap it using the new GTT offset */
> 367             ret = remap_io_mapping(area,
> 368                                    area->vm_start +
> (vma->ggtt_view.partial.offset << PAGE_SHIFT),
> 369                                    (ggtt->gmadr.start +
> vma->node.start) >> PAGE_SHIFT,
> 370                                    min_t(u64, vma->size, area->vm_end
> - area->vm_start),
> 371                                    &ggtt->iomap);
> (kgdb) list *0xffffffff82b4e980
> 0xffffffff82b4e980 is in remap_io_mapping
> (/usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.3_4/drivers/gpu/drm/i915/intel_freebsd.c:193).
> 188                 pidx++, pa += PAGE_SIZE) {
> 189     retry:
> 190                     m = vm_page_grab(vm_obj, pidx, VM_ALLOC_NOCREAT);
> 191                     if (m == NULL) {
> 192                             m = PHYS_TO_VM_PAGE(pa);
> 193                             if (!vm_page_busy_acquire(m,
> VM_ALLOC_WAITFAIL))
> 194                                     goto retry;
> 195                             if (vm_page_insert(m, vm_obj, pidx)) {
> 196                                     vm_page_xunbusy(m);
> 197                                     VM_OBJECT_WUNLOCK(vm_obj);
> (kgdb) list *0xffffffff80ec23ed
> 0xffffffff80ec23ed is in vm_page_busy_acquire
> (/usr/src/sys/vm/vm_page.c:884).
> 879                     if (vm_page_tryacquire(m, allocflags))
> 880                             return (true);
> 881                     if ((allocflags & VM_ALLOC_NOWAIT) != 0)
> 882                             return (false);
> 883                     if (obj != NULL)
> 884                             locked = VM_OBJECT_WOWNED(obj);
> 885                     else
> 886                             locked = false;
> 887                     MPASS(locked || vm_page_wired(m));
> 888                     if (_vm_page_busy_sleep(obj, m, m->pindex,
> "vmpba", allocflags,
>
> It seems like the problem occured when calling vm_page_busy_acquire(m,
> VM_ALLOC_WAITFAIL) where m might be a NULL pointer ? I am very new to
> kernel debugging so not sure where to go from there.
>
> Thanks.
>
> Le lun. 10 août 2020 à 12:04, Hans Petter Selasky <hps_at_selasky.org> a
> écrit :
>
>> Hi,
>>
>> On 2020-08-10 12:59, Alexandre Levy wrote:
>> > Looking at the code, the error happens during the call to
>> VM_OBJECT_WLOCK
>> > (memory page locking for write ?) in the intel_freebsd.c (see [1]
>> below).
>> > I'm out for a few days but I'll try to dig more into it when I'm back
>> next
>> > weekend although I have no experience in the drm-devel-kmod codebase. In
>> > the meantime if you have any suggestions on debugging this further I'm
>> > happy to follow them.
>>
>> The problem is likely that the vm_obj is NULL.
>>
>> I think I recall that this function is special and can only be called
>> from a certain context, unlike in Linux. Will need the full backtrace
>> with line numbers in order to debug this.
>>
>> --HPS
>>
>