> On Jul 21, 2019, at 4:17 PM, Andriy Gapon <avg_at_freebsd.org> wrote: > >> On 20/07/2019 20:08, Patrick Kelsey wrote: >> >> >> On Fri, Jul 19, 2019 at 10:07 AM Andriy Gapon <avg_at_freebsd.org >> <mailto:avg_at_freebsd.org>> wrote: >> >> >> Recently we experienced a strange problem. >> We noticed a lot of these messages in the logs: >> vmx0: watchdog timeout on queue 2 >> (always queue 2) >> Also, we noticed that connections to some end points did not work at all >> while others worked without problems. I assume that that was because >> specific flows got assigned to that queue 2. >> >> Further investigation has shown that none of interrupts assigned to the >> BSP has ever fired (since boot, of course). That included vmx0:rx2 and >> vmx0:tx2. But also interrupts for other drivers as well. >> >> Trying to get more information I rebooted the system and the problem >> disappeared. >> >> Has anyone seen anything like that? >> Any thoughts on possible causes? >> Any suggestions what to check if/when the problem reoccurs? >> >> Thanks! >> >> >> If you are running head at or after r347221 or stable/12 at or after >> r349112, then this could be due to >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118 (see Comment 4 >> - short story is that an iflib change has broken the vmx driver). > > I am not sure if that bug could lead to all interrupts on the core > getting disabled (for all drivers), and right at the boot time. I am not sure either, but it’s the kind of bug that breaks the design of the vmx driver in such a way that its state can get corrupted to the point where the kernel can panic. I haven’t fully analyzed the potential scope of memory corruption / hardware state corruption that can occur (because the fix for the issue is already apparent), so I am freely considering it to include elements beyond the device and driver itself. If you are saying that zero vmx queue interrupts have occurred anywhere in the system, then I would rule out any connection to this as a prerequisite for the corruption to occur is having at least one such interrupt. -PatrickReceived on Sun Jul 21 2019 - 18:32:10 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC