Re: Panic in ieee80211 tx mgmt timeout

From: Bernhard Schmidt <bschmidt_at_freebsd.org> Date: Wed, 29 Jun 2011 12:41:16 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC

On Wednesday, June 29, 2011 10:53:41 Stefan Esser wrote:
> Am 29.06.2011 10:03, schrieb Adrian Chadd:
> > On 29 June 2011 14:03, Bernhard Schmidt <bschmidt_at_freebsd.org> wrote:
> >> It's name is ieee80211_tx_mgt_timeout used to track AUTH/ASSOC
> >> requests. Afaik there is even a similar PR about that.
> 
> Sorry, I manually entered the panic message, since dumps were not
> working on my system at the time of that panic.
> 
> >> Adrian, you've got a AP set up to drop either a AUTH or ASSOC
> >> response frame?
> 
> I've got a number of AUTH -> SCAN transition lost messages for wlan0,
> seconds to minutes apart:
> 
> Jun 28 21:16:17 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
> -> SCAN transition lost
> Jun 28 21:34:46 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
> -> SCAN transition lost
> Jun 28 21:36:33 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
> -> SCAN transition lost
> Jun 28 21:45:14 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
> -> SCAN transition lost
> Jun 28 21:45:44 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
> -> SCAN transition lost
> 
> The setup is easy to reproduce, my rc.conf contained:
> 
> wlans_ath0="wlan0"
> ifconfig_ath0="down"
> ifconfig_wlan0="down"
> wpa_supplicant_enable="YES"

Strip the last 3 lines, don't ever fiddle around with ath0 directly.
This configuration always starts wpa_supplicant.

> This system used to be connected via ath0, but recently was moved to a
> place where Ethernet is available. The panics started only after WLAN
> was not used anymore. I might disable wpa_supplicant, since it is not
> required in the current situation, but did not try whether that helps
> prevent the panic.
> 
> > Tell me how and I'll set it up.
> > 
> > A panic at that point in the function indicates maybe ni is NULL?
> > or ni->vap is now NULL, maybe?
> 
> I recreated the panic, this time with kernel dumps correctly configured
> (thanks for the hint, Scott). The panic message is:
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0xffffff809c7a1000
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff805e1851
> stack pointer           = 0x28:0xffffff8000288ab0
> frame pointer           = 0x28:0xffffff8000288b60
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 11 (swi4: clock)
> 
> Traceback:
> 
> #10 0xffffffff805e1851 in ieee80211_tx_mgt_timeout (arg=0xffffff809c7a1000)
>     at ../../../net80211/ieee80211_output.c:2487
> 
> This indicates, that an invalid argument is passed and assigned to
> "*ni", which causes the page fault when dereferencing "ni" to obtain "*va".

The problem here seems to be wpa_supplicant. It can try to associate
at any given point in time which results in the BSS ni being destroyed,
though it might still be referenced somewhere (In this case the timeout
stuff, or better said ath's TX queue). Not clearing the reference (or
stopping whatever is using it) is the fault here. Now how to figure out
who the caller is? Got the complete backtrace?

-- 
Bernhard