Re: em interrupt storm

From: Michael Vince <mv_at_roq.com>
Date: Thu, 24 Nov 2005 15:22:11 +1100
Scott Long wrote:

> Michael Vince wrote:
>
>> Kris Kennaway wrote:
>>
>>> On Tue, Nov 22, 2005 at 08:54:49PM -0800, John Polstra wrote:
>>>  
>>>
>>>> On 23-Nov-2005 Kris Kennaway wrote:
>>>>  
>>>>
>>>>> I am seeing the em driver undergoing an interrupt storm whenever the
>>>>> amr driver receives interrupts.  In this case I was running newfs on
>>>>> the amr array and em0 was not in use:
>>>>>
>>>>>   28 root        1 -68 -187     0K     8K CPU1   1   0:32 53.98% 
>>>>> irq16: em0
>>>>>   36 root        1 -64 -183     0K     8K RUN    1   0:37 27.75% 
>>>>> irq24: amr0
>>>>>
>>>>> # vmstat -i
>>>>> interrupt                          total       rate
>>>>> irq1: atkbd0                           2          0
>>>>> irq4: sio0                           199          1
>>>>> irq6: fdc0                            32          0
>>>>> irq13: npx0                            1          0
>>>>> irq14: ata0                           47          0
>>>>> irq15: ata1                          931          5
>>>>> irq16: em0                       6321801      37187
>>>>> irq24: amr0                        28023        164
>>>>> cpu0: timer                       337533       1985
>>>>> cpu1: timer                       337285       1984
>>>>> Total                            7025854      41328
>>>>>
>>>>> When newfs finished (i.e. amr was idle), em0 stopped storming.
>>>>>
>>>>> MPTable: <INTEL    SE7520BD22  >
>>>>>     
>>>>
>>>>
>>>> This is the dreaded interrupt aliasing problem that several of us have
>>>> experienced with this chipset.  High-numbered interrupts alias down to
>>>> interrupts in the range 16..19 (or maybe 16..23), a multiple of 8 less
>>>> than the original interupt.
>>>>
>>>> Nobody knows what causes it, and nobody knows how to fix it.
>>>>   
>>>
>>>
>>>
>>> This would be good to document somewhere so that people don't either
>>> accidentally buy this hardware, or know what to expect when they run
>>> it.
>>>
>>> Kris
>>>  
>>>
>> This is Intels latest server chipset designs and Dell are putting 
>> that chipset in all their servers.
>> Luckily I haven't not seen the problem on any of my Dell servers (as 
>> long as I am looking at this right).
>>
>> This server has been running for a long time.
>> vmstat -i
>> interrupt                          total       rate
>> irq1: atkbd0                           6          0
>> irq4: sio0                         23433          0
>> irq6: fdc0                            10          0
>> irq8: rtc                     2631238611        128
>> irq13: npx0                            1          0
>> irq14: ata0                           99          0
>> irq16: uhci0                  1507608958         73
>> irq18: uhci2                    42005524          2
>> irq19: uhci1                           3          0
>> irq23: atapci0                       151          0
>> irq46: amr0                     41344088          2
>> irq64: em0                    1513106157         73
>> irq0: clk                     2055605782         99
>> Total                         7790932823        379
>>
>> This one just transfered over 8gigs of data in 77seconds with around 
>> 1000 simultaneous tcp connections under a load of 35. Both seem OK.
>> vmstat -i
>> interrupt                          total       rate
>> irq4: sio0                           315          0
>> irq13: npx0                            1          0
>> irq14: ata0                           47          0
>> irq16: uhci0                     2894669          2
>> irq18: uhci2                      977413          0
>> irq23: ehci0                           3          0
>> irq46: amr0                       883138          0
>> irq64: em0                       2890414          2
>> cpu0: timer                   2763566717       1999
>> cpu3: timer                   2763797300       1999
>> cpu1: timer                   2763551479       1999
>> cpu2: timer                   2763797870       1999
>> Total                        11062359366       8004
>>
>> Mike
>>
>>
>
> Looks like at least some of your interrupts are being aliased to 
> irq16, which just happens to be USB(uhci) in this case.  Note that the 
> rate is
> the same between irq64 and irq16, and the totals are pretty close.  If
> you don't need USB, I'd suggest turning it off.
>
> Scott

Most of my Dell servers occasionally use the USB ports to serial out via 
tip using a usb2serial cable with the uplcom driver and then into 
another servers real serial port (sio) so its not really an option to 
disable USB.

How much do you think it affects performance if the USB device is 
actually rarely used.

I also have a 6-stable machine and noticed that the vmstat -i output 
lists the em and usb together, but em0 isn't used at all, em2 and em3 
are the active ones, it doesn't seem reasonable that my usb serial usage 
would be that high for irq16 or could it be that em2 and em3 and also 
going through irq16

vmstat -i
interrupt                          total       rate
irq4: sio0                           228          0
irq14: ata0                           47          0
irq16: em0 uhci0                  917039         11
irq18: uhci2                       54823          0
irq23: ehci0                           3          0
irq46: amr0                        45998          0
irq64: em2                        898628         11
lapic0: timer                  159140889       1999
Total                          161057655       2024

Mike
Received on Thu Nov 24 2005 - 03:22:14 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:48 UTC