Re: vmx0: watchdog timeout on queue 2, no interrupts on BSP

From: Patrick Kelsey <pkelsey_at_freebsd.org>
Date: Sun, 21 Jul 2019 16:32:04 -0400
> On Jul 21, 2019, at 4:17 PM, Andriy Gapon <avg_at_freebsd.org> wrote:
> 
>> On 20/07/2019 20:08, Patrick Kelsey wrote:
>> 
>> 
>> On Fri, Jul 19, 2019 at 10:07 AM Andriy Gapon <avg_at_freebsd.org
>> <mailto:avg_at_freebsd.org>> wrote:
>> 
>> 
>>    Recently we experienced a strange problem.
>>    We noticed a lot of these messages in the logs:
>>    vmx0: watchdog timeout on queue 2
>>    (always queue 2)
>>    Also, we noticed that connections to some end points did not work at all
>>    while others worked without problems.  I assume that that was because
>>    specific flows got assigned to that queue 2.
>> 
>>    Further investigation has shown that none of interrupts assigned to the
>>    BSP has ever fired (since boot, of course).  That included vmx0:rx2 and
>>    vmx0:tx2.  But also interrupts for other drivers as well.
>> 
>>    Trying to get more information I rebooted the system and the problem
>>    disappeared.
>> 
>>    Has anyone seen anything like that?
>>    Any thoughts on possible causes?
>>    Any suggestions what to check if/when the problem reoccurs?
>> 
>>    Thanks!
>> 
>> 
>> If you are running head at or after r347221 or stable/12 at or after
>> r349112, then this could be due to
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118 (see Comment 4
>> - short story is that an iflib change has broken the vmx driver).
> 
> I am not sure if that bug could lead to all interrupts on the core
> getting disabled (for all drivers), and right at the boot time.

I am not sure either, but it’s the kind of bug that breaks the design of the vmx driver in such a way that its state can get corrupted to the point where the kernel can panic.  I haven’t fully analyzed the potential scope of memory corruption / hardware state corruption that can occur (because the fix for the issue is already apparent), so I am freely considering it to include elements beyond the device and driver itself.

If you are saying that zero vmx queue interrupts have occurred anywhere in the system, then I would rule out any connection to this as a prerequisite for the corruption to occur is having at least one such interrupt.

-Patrick
Received on Sun Jul 21 2019 - 18:32:10 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC