Re: Interrupt storm with MSI in combination with em1

From: Daan Vreeken <Daan_at_vehosting.nl>
Date: Fri, 6 May 2011 17:02:42 +0200
On Thursday 05 May 2011 22:22:15 Jack Vogel wrote:
> On Thu, May 5, 2011 at 1:17 PM, Daan Vreeken <Daan_at_vehosting.nl> wrote:
> > Hi Peter,
> >
> > On Thursday 05 May 2011 21:28:02 Peter Jeremy wrote:
> > > On 2011-May-05 13:22:59 +0200, Daan Vreeken <Daan_at_vehosting.nl> wrote:
> > > >Not yet. I'll reboot the machine later today when I have physical
> > > > access to it to check the BIOS version. I'll keep you informed as
> > > > soon as I get another storm going.
> > >
> > > Depending on the quality of your BIOS (competence of the vendor), you
> > > might find that kenv(8) reports the BIOS version without needing a
> > > reboot.
> > > (Look at smbios.bios.* in the output).
...
> > smbios.bios.version="0303   "
...
> > Version "0402" is the latest and greatest, so it's time to upgrade.
> > According
> > to Asus it "Improves system stability", so let's see if this 'cures' IRQ
> > 16.
>
> Cool, thanks for the update! Good luck.

I've updated the BIOS and let the machine run for a couple of hours with 
MSI/MSIX enabled. After 3 hours of uptime I see the storm again.

Here are the first couple of lines of output of "top -S" :

	last pid: 33218;  load averages:  0.47,  0.35,  0.33    up 0+03:52:1016:42:52
	317 processes: 6 running, 289 sleeping, 22 waiting
	CPU:  0.4% user,  0.0% nice,  0.5% system, 11.6% interrupt, 87.5% idle
	Mem: 280M Active, 176M Inact, 1797M Wired, 8572K Cache, 32M Buf, 5545M Free
	Swap: 500M Total, 500M Free
	PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
	11 root             4 171 ki31     0K    64K CPU0    0 893:17 351.95% idle
	12 root            23 -80    -     0K   368K WAIT    2  18:37 50.39% intr

One core is spending half it's time handling interrupts.
/var/log/messages doesn't show any new message since the storm 
started. "vmstat -i" now shows :

	# vmstat -i
	interrupt                          total       rate
	irq3: uart1                       917384         63
-->	irq16: ehci0                   809547235      55608
	irq23: ehci1                     1751385        120
	cpu0:timer                      16380717       1125
	irq256: em0:rx 0                 1651907        113
	irq257: em0:tx 0                 1495708        102
	irq258: em0:link                       3          0
	irq259: em1:rx 0                  397227         27
	irq260: em1:tx 0                  257865         17
	irq261: em1:link                       6          0
	irq262: re0                        10549          0
	irq263: ahci0                     290926         19
	cpu1:timer                       1160008         79
	cpu3:timer                        763939         52
	cpu2:timer                       4120133        283
	irq272: hdac0                     819282         56
	Total                          839564274      57670

Apart from spending far too much time handling interrupts, the machine works 
fine, so I'll let it run in case anyone wants me to try something on it.

As a next step to try to isolate the problem I could create a kernel with 
MSI/MSIX enabled, but with a modified 'em' driver so it doesn't try to attach 
the MSI/MSIX interrupts to see if the problem is really related to the 
network cards or not.
If anyone has a better idea, I'm all ears :)


Regards,
-- 
Daan Vreeken
VEHosting
http://VEHosting.nl
tel: +31-(0)40-7113050 / +31-(0)6-46210825
KvK nr: 17174380
Received on Fri May 06 2011 - 13:02:49 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:13 UTC