Re: svn commit: r313268 - head/sys/kern [through -r313271 for atomic_fcmpset use and later: fails on PowerMac G5 "Quad Core"; -r313266 works]

From: Mark Millard <markmi_at_dsl-only.net> Date: Mon, 20 Feb 2017 18:36:15 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC

On 2017-Feb-20, at 3:35 PM, Mateusz Guzik <mjguzik at gmail.com> wrote:

> On Mon, Feb 20, 2017 at 03:10:44PM -0800, Mark Millard wrote:
>> On 2017-Feb-20, at 2:58 PM, Mark Millard <markmi at dsl-only.net> wrote:
>> 
>>> On 2017-Feb-20, at 11:10 AM, Mateusz Guzik <mjguzik at gmail.com> wrote:
>>> 
>>>> On Sat, Feb 18, 2017 at 04:18:05AM -0800, Mark Millard wrote:
>>>>> [Note: I experiment with clang based powerpc64 builds,
>>>>> reporting problems that I find. Justin is familiar
>>>>> with this, as is Nathan.]
>>>>> 
>>>>> I tried to update the PowerMac G5 (a so-called "Quad Core")
>>>>> that I have access to from head -r312761 to -r313864 and
>>>>> ended up with random panics and hang ups in fairly short
>>>>> order after booting.
>>>>> 
>>>>> Some approximate bisecting for the kernel lead to:
>>>>> (sometimes getting part way into a buildkernel attempt
>>>>> for a different version before a failure happens)
>>>>> 
>>>>> -r313266: works (just before use of atomic_fcmpset)
>>>>> vs.
>>>>> -r313271: fails (last of the "use atomic_fcmpset" check-ins)
>>>>> 
>>>>> (I did not try -r313268 through -r313270 as the use was
>>>>> gradually added.)
>>>>> 
>>>>> So I'm currently running a -r313864 world with a -r313266
>>>>> kernel.
>>>>> 
>>>>> No kernel that I tried that was from before -r313266 had the
>>>>> problems.
>>>>> 
>>>>> Any kernel that I tried that was from after -r313271 had the
>>>>> problems.
>>>>> 
>>>>> Of course I did not try them all in other direction. :)
>>>>> 
>>>> 
>>>> I found that spin mutexes were not properly handling this, fixed in
>>>> r313996.
>>>> 
>>>> Locally I added a if (cpu_tick() % 2) return (0); snipped to amd64
>>>> fcmpset to simulate failures. Everything works, while it would easily
>>>> fail without the patch.
>>>> 
>>>> That said, I hope this concludes the 'missing check for not-reread value
>>>> of failed fcmpset' saga.
>>>> 
>>>> -- 
>>>> Mateusz Guzik <mjguzik gmail.com>
>>> 
>>> I tried to update from -r313864 to -r313999 in my amd64 context
>>> (a VirtualBox machine under macOS) but it now crashes late in
>>> the boot sequence (after it processes a dump if I make one but
>>> before I can log in).
>>> 
>>> This update was via my usual explicit svnlite update; buildworld
>>> buildkernel; etc. production style build of world and kernel,
>>> including use of MALLOC_PRODUCTION.
>>> 
>>> The window shows:
>>> 
>>> _vm_map_lock+0xf
>>> vm_map_wire+0x32
>>> rtROMemObjNativeLockInMap+0x8c
>>> rtROMemObjNativeLockUser+0x51
>>> RTR0MemObjLockUserTag+0x231
>>> vbglR0HGCMInternalPreprocessCall+0x65d
>>> vbglR0HGCMInternalCall+0x17c
>>> vgdrvIoCtl_HGCMCall+0x43f
>>> VGDrvCommonIoCtl+0x261
>>> vgdrvFreeBSDIOCtl+0x2cd
>>> devfs_ioctl+0xae
>>> VOP_IOCTL_APV+0x88
>>> vn_ioctl+0x161
>>> devfs_ioctl_f+0x1f
>>> kern_ioctl+0x280
>>> sys_ioctl+0x13f
>>> amd64_syscall+0x397
>>> Xfast_syscall+0xfb
>> 
>> More detail from booting with the -r313864 kernel.old
>> and using kgdb on what the dump produced:
>> 
>> # kgdb kernel.debug /var/crash/vmcore.
>> /var/crash/vmcore.0    /var/crash/vmcore.last
>> # kgdb kernel.debug /var/crash/vmcore.0
>> GNU gdb 6.1.1 [FreeBSD]
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you are
>> welcome to change it and/or distribute copies of it under certain conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for details.
>> This GDB was configured as "amd64-marcel-freebsd"...
>> 
>> Unread portion of the kernel message buffer:
>> <118>Starting vboxservice.
>> <118>VBoxService 5.1.14 r112924 (verbosity: 0) freebsd.amd64 (Jan 20 2017 18:37:45) release log
>> <118>00:00:00.000120 main     Log opened 2017-02-20T22:38:46.348080000Z
>> <118>00:00:00.000162 main     OS Product: FreeBSD
>> <118>00:00:00.000171 main     OS Release: 12.0-CURRENT
>> <118>00:00:00.000180 main     OS Version: FreeBSD 12.0-CURRENT  r313999M
>> <118>00:00:00.000192 main     Executable: /usr/local/sbin/VBoxService
>> <118>00:00:00.000194 main     Process ID: 609
>> <118>00:00:00.000196 main     Package type: BSD_64BITS_GENERIC (OSE)
>> 
>> 
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 2; apic id = 02
>> fault virtual address   = 0xd6
>> fault code              = supervisor read data, page not present
>> instruction pointer     = 0x20:0xffffffff80d4ebaf
>> stack pointer           = 0x28:0xfffffe0122e2bef0
>> frame pointer           = 0x28:0xfffffe0122e2bf00
>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>                        = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags        = interrupt enabled, resume, IOPL = 0
>> current process         = 609 (VBoxService)
>> 
> 
> 
> 
>> #9  0xffffffff80eb6be1 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236
>> #10 0xffffffff80d4ebaf in _vm_map_lock (map=0x1, file=0x0, line=0) at /usr/src/sys/vm/vm_map.c:501
> 
> The function is:
> void
> _vm_map_lock(vm_map_t map, const char *file, int line)
> {
> 
>        if (map->system_map)
>                mtx_lock_flags_(&map->system_mtx, 0, file, line);
>        else
>                sx_xlock_(&map->lock, file, line);
>        map->timestamp++;
> }
> 
> system_map is at offset 0xd5, thus the faulting address of 0xd6 with map
> address of 1 looks like the backtrace is corect. But this suggests the
> bug is unrelated to my changes and there is a chance there is no bug in
> the first place.
> 
> Please make sure that the virtualbox module is recompiled against proper
> source tree. If the problem persists, please bisect. The range is not
> big.
> 
> Off hand I don't see what can cause the failure in question (and chances
> are there is no bug if kbi changed and the module was not recompiled).
> 
>> #11 0xffffffff80d51ea2 in vm_map_wire (map=<value optimized out>, start=4534272, end=4538368, flags=1) at /usr/src/sys/vm/vm_map.c:2534
>> #12 0xffffffff8265291c in rtR0MemObjNativeLockInMap () from /boot/modules/vboxguest.ko
>> #13 0xffffffff82652881 in rtR0MemObjNativeLockUser () from /boot/modules/vboxguest.ko
>> #14 0xffffffff8264ec01 in RTR0MemObjLockUserTag () from /boot/modules/vboxguest.ko
>> #15 0xffffffff82624afd in vbglR0HGCMInternalPreprocessCall () from /boot/modules/vboxguest.ko
>> #16 0xffffffff8262411a in VbglR0HGCMInternalCall () from /boot/modules/vboxguest.ko
>> #17 0xffffffff8261ec4f in vgdrvIoCtl_HGCMCall () from /boot/modules/vboxguest.ko
>> #18 0xffffffff8261d221 in VGDrvCommonIoCtl () from /boot/modules/vboxguest.ko

I do not expect that the kernel binary interface deliberately changed
between -r313864 and -r313999. Until the attempted update of amd64
(which I always do first) the amd64 and arm64 were running:

. . . 12.0-CURRENT FreeBSD 12.0-CURRENT  r313864M  . . . 1200021 1200021

I've not noticed an update to 1200022 yet.

[It turned out that for powerpc64 I had to use -r313266 for the
kernel when I tried to update to -r313864. This does mix 1200020
and 1200021. But 1200021 was removal of support for things I do
not have involved --and the combination has seemed okay so far.]

I've decided to do a round of port upgrades (to -r434493),
although virtualbox client has not been updated. I'll force a
rebuild before I'm done.

It turns out that llvm39 is now required for what I choose to
have and its build ran out of RAM/swap as I had things configured.
So I've adjusted to have the VM have more RAM assigned and I'm not
starting lumina but just using the console for now. We will see.

Note: I always manually start lumina and so it was not
involved in the boot problem: it was just a basic console
style context at all times for the boot crash.

Overall it will be a while before I have a works vs. fails
pair that are significantly closer together.

-- 
Mateusz Guzik <mjguzik gmail.com>