Re: kernel panic caused by virtualbox(?)

From: Don Lewis <truckman_at_FreeBSD.org> Date: Thu, 11 Aug 2016 15:22:44 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:07 UTC

On 11 Aug, Konstantin Belousov wrote:
> On Wed, Aug 10, 2016 at 04:47:15PM -0700, Don Lewis wrote:
>> On 10 Aug, Jung-uk Kim wrote:
>> > On 08/09/16 05:12 AM, Konstantin Belousov wrote:
>> >> On Mon, Aug 08, 2016 at 04:44:20PM -0700, Don Lewis wrote:
>> >>> On  8 Aug, Konstantin Belousov wrote:
>> >>>> On Mon, Aug 08, 2016 at 10:22:44AM -0700, John Baldwin wrote:
>> >>>>> On Thursday, August 04, 2016 05:10:29 PM Don Lewis wrote:
>> >>>>>> Reposted to -current to get some more eyes on this ...
>> >>>>>>
>> >>>>>> I just got a kernel panic when I started up a CentOS 7 VM in virtualbox.
>> >>>>>> The host is:
>> >>>>>> 	FreeBSD 12.0-CURRENT #17 r302500 GENERIC amd64
>> >>>>>> The virtualbox version is:
>> >>>>>> 	virtualbox-ose-5.0.26
>> >>>>>> 	virtualbox-ose-kmod-5.0.26_1
>> >>>>>>
>> >>>>>> The panic message is:
>> >>>>>>
>> >>>>>> panic: Unregistered use of FPU in kernel
>> >>>>>> cpuid = 1
>> >>>>>> KDB: stack backtrace:
>> >>>>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe085a55d030
>> >>>>>> vpanic() at vpanic+0x182/frame 0xfffffe085a55d0b0
>> >>>>>> kassert_panic() at kassert_panic+0x126/frame 0xfffffe085a55d120
>> >>>>>> trap() at trap+0x7ae/frame 0xfffffe085a55d330
>> >>>>>> calltrap() at calltrap+0x8/frame 0xfffffe085a55d330
>> >>>>>> --- trap 0x16, rip = 0xffffffff827dd3a9, rsp = 0xfffffe085a55d408, rbp = 0xfffffe085a55d430 ---
>> >>>>>> g_pLogger() at 0xffffffff827dd3a9/frame 0xfffffe085a55d430
>> >>>>>> g_pLogger() at 0xffffffff8274e5c7/frame 0x3
>> >>>>>> KDB: enter: panic
>> >>>>>>
>> >>>>>> Since g_pLogger is a symbol in vboxdrv.ko, it looks like virtualbox is
>> >>>>>> the trigger.
>> >>>>>>
>> >>>>>> There are no symbols for the virtualbox kmods, possibly because I
>> >>>>>> installed them as an upgrade using packages (built with the same source
>> >>>>>> tree version) instead of by using PORTS_MODULES in make.conf, so ports
>> >>>>>> kgdb didn't have anything useful to say about what happened before the
>> >>>>>> trap.
>> >>>>>>
>> >>>>>> This panic is very repeatable.  I just got another one when starting the
>> >>>>>> same VM., but this time the two calls before the trap were
>> >>>>>> null_bug_bypass().  Hmn, that symbol is in nullfs ...
>> >>>>>>
>> >>>>>> I don't see this with a Windows 7 VM.
>> >>>>>>
>> >>>>>> All of the virtualbox kmod files are compiled with -mno-mmx -mno-sse
>> >>>>>> -msoft-float -mno-aes -mno-avx
>> >>>> Your disassemble listed fxrstor instruction that failing, or did I
>> >>>> mis-remembered ? This is most likely some context switch code, either
>> >>>> by virtual machine or erronously executed guest code. It is not a
>> >>>> spontaneous use of FPU, but more likely something different. Can you
>> >>>> confirm ?
>> >>>>
>> >>>> In either case, I do not remember any KBI changes around PCB layout or
>> >>>> fpu_enter() KPI recently.
>> >>>>
>> >>>>>
>> >>>>> I suspect head packages are quite likely built against the a "wrong" KBI
>> >>>>> and are too fragile to use for kmods vs compiling from ports. :-/  I would
>> >>>>> try a built-from-ports kmod to see if the panics go away.
>> >>>>
>> >>>> FWIW, I will commit the following change shortly. Since third-party
>> >>>> modules break the invariant, either due to bugs (ndis wrappers) or
>> >>>> possibly due to KBI breakage, it is worth to have the detection enabled
>> >>>> for production kernels.
>> >>>
>> >>> Interesting ... I tried running virtualbox on recent 10.3-STABLE with a
>> >>> GENERIC kernel and the guest seemed to operate properly.  Then I enabled
>> >>> INVARIANTS and got the panic.  I suspect that is why nobody has stumbled
>> >>> across this before.
>> >>>
>> >> This is yet another reason to promote KASSERT to the full panic.
>> >> I expect that the vbox source lacks fpu_kern_enter() calls around the
>> >> FPU state restoration.
>> > 
>> > Unfortunately, the code is in MI source as it is unnecessary for
>> > supported OSes (read: FreeBSD is not supported) and it's not easy to
>> > inject fpu_kern_enter()/fpu_kern_leave() calls there. :-(
>> 
>> It's a headache, but our ports can use patch files for that sort of
>> thing ...
> 
> Note that it is, most likely, completely useless to wrap single
> FXRSTOR instruction into the fpu_kern_enter() braces.  The purpose of
> the instruction is to load ('legacy', as they call it, no AVX+) FPU state
> into the machine context.  If you put fpu_kern_leave() right after
> the instruction, the context is flushed.

Since it looks like the code is preparing to re-enter the guest, then
calling fpu_kern_leave() doesn't make sense.

> There must be some larger scope where the braces do make sense.  And since
> some other OSes do require similar precautions around the in-kernel FPU
> access, I suspect that there should be some common place to put our KPI
> calls.

CPUMSetGuestXcr0() is the first stack frame.  It wouldn't seem to make
sense to call fpu_kern_enter() unless ASMXRstor() is going to be called,
and the tests for that are right before the call.  However, the comments
above this function say:

 * Will load additional state if the FPU state is already loaded (in ring-0 &
 * raw-mode context).

so it does look like something wasn't done before we got to this point.