Re: Panic in ieee80211 tx mgmt timeout

From: Stefan Esser <se_at_freebsd.org> Date: Wed, 29 Jun 2011 10:53:41 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC

Am 29.06.2011 10:03, schrieb Adrian Chadd:
> On 29 June 2011 14:03, Bernhard Schmidt <bschmidt_at_freebsd.org> wrote:
>> It's name is ieee80211_tx_mgt_timeout used to track AUTH/ASSOC
>> requests. Afaik there is even a similar PR about that.

Sorry, I manually entered the panic message, since dumps were not
working on my system at the time of that panic.

>> Adrian, you've got a AP set up to drop either a AUTH or ASSOC
>> response frame?

I've got a number of AUTH -> SCAN transition lost messages for wlan0,
seconds to minutes apart:

Jun 28 21:16:17 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
-> SCAN transition lost
Jun 28 21:34:46 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
-> SCAN transition lost
Jun 28 21:36:33 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
-> SCAN transition lost
Jun 28 21:45:14 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
-> SCAN transition lost
Jun 28 21:45:44 kernel: wlan0: ieee80211_new_state_locked: pending AUTH
-> SCAN transition lost

The setup is easy to reproduce, my rc.conf contained:

wlans_ath0="wlan0"
ifconfig_ath0="down"
ifconfig_wlan0="down"
wpa_supplicant_enable="YES"

This system used to be connected via ath0, but recently was moved to a
place where Ethernet is available. The panics started only after WLAN
was not used anymore. I might disable wpa_supplicant, since it is not
required in the current situation, but did not try whether that helps
prevent the panic.

> Tell me how and I'll set it up.
> 
> A panic at that point in the function indicates maybe ni is NULL?
> or ni->vap is now NULL, maybe?

I recreated the panic, this time with kernel dumps correctly configured
(thanks for the hint, Scott). The panic message is:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0xffffff809c7a1000
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff805e1851
stack pointer           = 0x28:0xffffff8000288ab0
frame pointer           = 0x28:0xffffff8000288b60
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 11 (swi4: clock)

Traceback:

#10 0xffffffff805e1851 in ieee80211_tx_mgt_timeout (arg=0xffffff809c7a1000)
    at ../../../net80211/ieee80211_output.c:2487

This indicates, that an invalid argument is passed and assigned to
"*ni", which causes the page fault when dereferencing "ni" to obtain "*va".

I'm afraid that the assumption in the comment (about timeout being save
to use) does not really hold:

static void
ieee80211_tx_mgt_timeout(void *arg)
{
        struct ieee80211_node *ni = arg;
        struct ieee80211vap *vap = ni->ni_vap;

        if (vap->iv_state != IEEE80211_S_INIT &&
            (vap->iv_ic->ic_flags & IEEE80211_F_SCAN) == 0) {
                /*
                 * NB: it's safe to specify a timeout as the reason here;
                 *     it'll only be used in the right state.
                 */
                ieee80211_new_state(vap, IEEE80211_S_SCAN,
                        IEEE80211_SCAN_FAIL_TIMEOUT)*vap ;
        }
}

If "vap" is valid during one invocation of that function, I'd expect it
to at least be a pointer to valid kernel memory after the timeout.
I.e., the value found by dereferencing it may be stale, but the pointer
itself should at least not cause a page fault. (???)

The compressed core.txt is 27KB, the compressed vmcore is 800MB. I might
be able to find a place to upload the vmcore file to, but since I'm
currently on a DSL with only 672KBit/s upstream, it would take me some 3
hours to upload to a better connected server (and I'd like to avoid
doing that, if not essential for debugging).

The core.txt is small enough to send by mail. Let me know if you think
it helps you understand the problem.

I'm willing to support debugging, e.g. by placement of printfs in my
kernel for the timeout handler and the creation and destruction of *vap
structures.

After removal of "wlans_ath0=wlan0" the system will most probably be
stable, I did not specifically test this case (i.e. ath0 configured, but
no wlan0 created). I do know, that an "ifconfig down" of ath0 and wlan0
suffices; probably an "ifconfig wlan0 down" alone would be enough.

So, I know how to avoid the panic, but I think it is still important to
find the cause.

Thank you for looking into this!

Best regards, STefan