Re: regression: msk0 watchdog timeout and interrupt storm

From: Boris Samorodov <bsam_at_passap.ru>
Date: Sun, 09 Feb 2014 20:56:21 +0400
06.02.2014 21:12, Boris Samorodov пишет:
> 06.02.2014 06:00, Yonghyeon PYUN пишет:
>> On Sat, Feb 01, 2014 at 12:18:59PM +0400, Boris Samorodov wrote:
>>> Hi Yonghyeon and All,
>>>
>>> (this time it's a CURRENT issue)
>>>
>>> 31.10.2013 17:33, Boris Samorodov пишет:
>>>> 30.10.2013 06:16, Yonghyeon PYUN пишет:
>>>>> On Tue, Oct 29, 2013 at 05:38:27PM +0400, Boris Samorodov wrote:
>>>>
>>>>>> >From time to time I use a notebook and boot FreeBSD from USB
>>>>>> stick. FreeBSD 9.2-i386 works OK. So I tried to use
>>>>>> FreeBSD 10.0-i386 BETA2 and the network adapter works for
>>>>>> some 10-15 seconds and then stops with diagnostic message
>>>>>> "msk0:watchdog timeout". I've found similar case at
>>>>>> freebsd-current_at_ with no workaround. Yes, there is an
>>>>>> interrupt storm as well.
>>>>>
>>>>> There had been no functional changes for very long time so I'm not
>>>>> sure what's going on here.  I've attached local change I have at
>>>>> this moment but I'm afraid it wouldn't address the issue above.
>>>>>
>>>>> I recall jhb also reported interrupt storm in the past but the root
>>>>> cause was not identified yet.  Could you change msk_intr() and let
>>>>> me know which interrupt is firing?
>>>>
>>>> I've yet to organize a build.
>>>>
>>>>>> Here is some additional info:
>>>>>> -----
>>>>>> mskc0_at_pci0:3:0:0:       class=0x020000 card=0xff501179 chip=0x435511ab
>>>>>> rev=0x12 hdr=0x00
>>>>>>     vendor     = 'Marvell Technology Group Ltd.'
>>>>>>     device     = '88E8040T PCI-E Fast Ethernet Controller'
>>>>>>     class      = network
>>>>>>     subclass   = ethernet
>>>>>>     cap 01[48] = powerspec 3  supports D0 D1 D2 D3  current D0
>>>>>>     cap 05[5c] = MSI supports 1 message, 64 bit enabled with 1 message
>>>>>>     cap 10[c0] = PCI-Express 2 legacy endpoint max data 128(128) link x1(x1)
>>>>>>                  speed 2.5(2.5) ASPM disabled(L0s/L1)
>>>>>>     ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
>>>>>>     ecap 0003[130] = Serial 1 b8b063ffff681e00
>>>>>> -----
>>>>
>>>> Meanwhile some more investigations, "vmstat -i" for calm and storm:
>>>> -----
>>>> interrupt                          total       rate
>>>> irq1: atkbd0                        1025          2
>>>> irq9: acpi0                          204          0
>>>> irq14: ata0                          327          0
>>>> irq16: uhci0+                        246          0
>>>> irq20: hpet0                       22472         52
>>>> irq23: uhci2 ehci1                 10341         24
>>>> irq256: hdac0                         52          0
>>>> irq257: mskc0                        258          0
>>>> irq258: ahci0                        221          0
>>>> Total                              35146         81
>>>> -----
>>>> interrupt                          total       rate
>>>> irq1: atkbd0                        1508          2
>>>> irq9: acpi0                          234          0
>>>> irq14: ata0                          409          0
>>>> irq16: uhci0+                        246          0
>>>> irq20: hpet0                       72288        131
>>>> irq23: uhci2 ehci1                 10846         19
>>>> irq256: hdac0                         52          0
>>>> irq257: mskc0                    4419760       8021
>>>> irq258: ahci0                        221          0
>>>> Total                            4505564       8177
>>>> -----
>>>>
>>>> And "vmstat -w1" for calm and storm:
>>>> -----
>>>>  procs      memory      page                    disks     faults         cpu
>>>>  r b w     avm    fre   flt  re  pi  po    fr  sr mm0 ad0   in   sy   cs
>>>> us sy id
>>>>  0 0 0  206928  956040   277   0   2   0   330   4   0   0  117  476
>>>> 454  0  1 99
>>>>  0 0 0  206928  956036     0   0   0   0     8   4   0   0   50  123
>>>> 137  0  0 100
>>>>  0 0 0  206928  956036     0   0   0   0     0   4   0   0   47  120
>>>> 92  0  1 99
>>>>  0 0 0  206928  956036     0   0   0   0     0   4   0   0   43  123
>>>> 119  0  1 99
>>>>  0 0 0  206928  956036     0   0   0   0     0   4   0   0   55  132
>>>> 123  0  1 99
>>>>  0 0 0  206928  956004     0   0   0   0     0   4   0   0   68  123
>>>> 185  0  1 99
>>>>  0 0 0  206928  956036     0   0   0   0     8   4   0   0   86  123
>>>> 266  0  1 99
>>>>  0 0 0  206928  956036     0   0   0   0     0   4   0   0   44  125
>>>> 124  0  0 100
>>>>  0 0 0  206928  956036     0   0   0   0     0   4   0   0   64  128
>>>> 164  0  1 99
>>>>  0 0 0  206928  956036     0   0   0   0     0   4   0   0   42  131
>>>> 101  0  1 99
>>>> -----
>>>>  procs      memory      page                    disks     faults         cpu
>>>>  r b w     avm    fre   flt  re  pi  po    fr  sr mm0 ad0   in   sy   cs
>>>> us sy id
>>>>  0 0 0  213648  954676   104   0   1   0   121   4   0   0 22299  204
>>>> 44262  0 10 90
>>>>  0 0 0  213648  954672     0   0   0   0     8   4   0   0 112259  123
>>>> 222379  0 44 56
>>>>  0 0 0  213648  954672     0   0   0   0     0   4   0   0 111792  123
>>>> 221489  0 43 57
>>>>  0 0 0  213648  954672     1   0   0   0     0   4   0   0 109887  183
>>>> 217754  0 43 57
>>>>  0 0 0  213648  954668     2   0   0   0     0   4   0   0 109543  146
>>>> 216963  0 44 56
>>>>  0 0 0  213648  954668     0   0   0   0     0   4   0   0 110142  123
>>>> 218187  0 45 55
>>>>  0 0 0  213648  954660   472   0   0   0   474   4   0   0 109340  717
>>>> 216674  0 42 57
>>>>  0 0 0  213648  954656     2   0   0   0     0   4   0   0 109459  147
>>>> 216831  0 43 57
>>>>  0 0 0  213648  954656     0   0   0   0     0   4   0   0 109462  131
>>>> 216827  0 43 57
>>>>  0 0 0  213648  954656     0   0   0   0     0   4   0   0 109454  123
>>>> 216803  0 42 58
>>>> -----
>>>>
>>>> Dmesg is here: ftp://ftp.wart.ru/pub/misc/tos.dmesg.boot.txt .
>>>>
>>>> BTW, some more observations. While downloading a file the system
>>>> goto watchdog timeout rather quickly, but the system works. If I
>>>> try to upload files the system works much longer (for a couple of
>>>> minutes) but then freeses. No ctrl-alt-esc. Only cold restart works.
>>>
>>> I've successfully upgraded to 10.0-RELEASE. Then I tried CURRENT
>>> (verbose dmesg is here: ftp://ftp.wart.ru/pub/misc/dmesg.boot.a300.txt )
>>> and I've got watchdog timeouts. The situation is very much alike
>>> (see previous diagnostics). Just uploads happens very quickly and
>>> the machine is not freezed and operates well.
>>>
>>> This time I have sources and can test patches (if any) rather
>>> quickly.
>>>
>>
>> There is no driver code difference between CURRENT and
>> 10.0-RELEASE.  If you don't encounter watchdog timeouts on
>> 10.0-RELEASE I have no idea what's going on there.
>> I recall a couple of users are seeing msk(4) watchdog timeouts on
>> 10.0-RELEASE/CURRENT so I started to think about r234666 which was
>> not merged to stable/9 and stable/8.
>>
>> Could you back out r234666 and let me know whether it makes any
>> difference for you?
> 
> Thank you!
> 
> That was it. The system survived svn up of /usr/src, rebuild/reinstall
> and almost 25000 patches were downloaded by portsnap.

Some additional info. As of r261651 at CURRENT the driver works for me
if:
. disable multi-core at BIOS (so kern.smp.cpus: 1);
. do not load driver at /boot/loader.conf (i.e. use the builtin kernel
  driver);
. disable WITNESS* and INVARIANTS* (GENERIC does not work even with
  single CPU).

-- 
WBR, Boris Samorodov (bsam)
FreeBSD Committer, http://www.FreeBSD.org The Power To Serve
Received on Sun Feb 09 2014 - 15:56:34 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:46 UTC