Re: iwn crashes in current (r282269)

From: Adrian Chadd <adrian_at_freebsd.org> Date: Sat, 2 May 2015 02:03:52 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:57 UTC

Hi,

On 2 May 2015 at 00:02, Poul-Henning Kamp <phk_at_phk.freebsd.dk> wrote:
> May  2 01:01:34 critter kernel: iwn0: device timeout
> May  2 01:01:34 critter kernel: firmware: 'iwn6000g2afw' version 0: 677296 bytes loaded at 0xffffffff81f880c0
> May  2 01:01:34 critter kernel: iwn0: iwn_read_firmware: ucode rev=0x12a80601
> May  2 01:01:40 critter kernel: iwn0: iwn_tx_data: m=0xfffff80236fe8500: seqno (9550) (78) != ring index (0) !
> May  2 01:01:40 critter kernel: iwn0: iwn_intr: fatal firmware error
> May  2 01:01:40 critter kernel: iwn0: iwn_panicked: controller panicked, iv_state = 5; resetting...
> May  2 01:01:40 critter kernel: firmware: 'iwn6000g2afw' version 0: 677296 bytes loaded at 0xffffffff81f880c0
> May  2 01:01:40 critter kernel: iwn0: iwn_read_firmware: ucode rev=0x12a80601
>
> And then the machine hung.
>
> No further details, as the screen-blanker was on.

So there's something odd with iwn and sequence number allocations.
what's supposed to happen here is that:

* net80211 handles sequence number allocation;
* then A-MPDU is negotiated;
* then the driver handles sequence number allocations.

The firmware requires that for 11n transmit, each frame goes into a
ring slot that's seqno % 256. It's not an arbitrary slot. It'll panic
otherwise, like you saw above.

Now, something's upsetting it. It may be a noisy environment leading
to BAR frame transmissions and eventual tear-down of the A-MPDU state,
leading to net80211 taking over sequence number allocation again. I
fixed a whole of those races in the ath(4) driver when I implemented
11n and found there's no locking at all going on there. :( It could
also be something inside net80211 that's advancing the sequence number
space, even though A-MPDU is enabled.

There's only a couple of places where ni_txseqs is updated in
net80211. If it were getting updated there, it should be obvious. But
it does do a check to see if AMPDU is enabled and running, and none of
that is consistently locked.

iwn_addba_response() sets the ni_txseq for the tid to be whatever was
negotiated during the aggregation negotiation (ADDBA) and then sets
the initial ring slot id to be whatever the starting sequence number
is ('ssn' in *_ampdu_tx_start()). iwn_tx_data() does do sequence
number allocation there. It's possible we're seeing races where
aggregation is being torn down during active transmit and the state is
all mucked up.

I recall seeing issues in ath(4) where there were some packets queued
between sending out the initial aggregation negotiation and it being
negotiated, which meant some packets would go out with sequence
numbers /after/ what was initially negotatied during ADDBA. Ie:

* you're at seq X, and you negotiate ADDBA at seq X;
* you queue a bunch of transmit frames, seq X -> X + n;
* peer says "ADDBA acceptable, starting seq X";
* the next frame you transmit comes from seq X + n + 1, but the other
peer is confused.

Here it may show up as:

* you negotiate seq X via addba;
* you queue a bunch more frames via the normal transmit path;
* you get the addba response, set initial ssn to X;
* the 'cur' pointer here in the ring is now X % 256, but the next
frame you transmit is (X + n) % 256, and stuff is out of alignment.

So, would someone please help see if that's the case? That'd be really
helpful. :)

-adrian