Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

From: Toomas Soome <tsoome_at_me.com>
Date: Mon, 22 Oct 2018 12:27:40 +0300
> On 22 Oct 2018, at 06:30, Warner Losh <imp_at_bsdimp.com> wrote:
> 
> On Sun, Oct 21, 2018 at 9:28 PM Warner Losh <imp_at_bsdimp.com <mailto:imp_at_bsdimp.com>> wrote:
> 
>> 
>> 
>> On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <
>> freebsd-stable_at_freebsd.org> wrote:
>> 
>>> [I built based on WITHOUT_ZFS= for other reasons. But,
>>> after installing the build, Hyper-V based boots are
>>> working.]
>>> 
>>> On 2018-Oct-20, at 2:09 AM, Mark Millard <marklmi at yahoo.com> wrote:
>>> 
>>>> On 2018-Oct-20, at 1:39 AM, Mark Millard <marklmi at yahoo.com> wrote:
>>>> 
>>>>> I attempted to jump from head -r334014 to -r339076
>>>>> on a threadripper 1950X board and the boot fails.
>>>>> This is both native booting and under Hyper-V,
>>>>> same machine and root file system in both cases.
>>>> 
>>>> I did my investigation under Hyper-V after seeing
>>>> a boot failure native.
>>>> 
>>>> Looks like the native failure is even earlier,
>>>> before db> is even possible, possibly during
>>>> early loader activity.
>>>> 
>>>> So this report is really for running under
>>>> Hyper-V: -r338804 boots and -r338810 does
>>>> not. By contrast -r334804 does not boot native.
>>>> (But I've little information for that context.)
>>>> 
>>>> Sorry for the confusion. I rushed the report
>>>> in hopes of getting to sleep. It was not to be.
>>>> 
>>>>> It fails just after the FreeBSD/SMP lines,
>>>>> reporting "kernel trap 9 with interrupts disabled".
>>>>> 
>>>>> It fails in pmap_force_invaldiate_cache_range at
>>>>> a clflusl (%rax) instruction that produces a
>>>>> "Fatal trap 9: general protection fault while
>>>>> in kernel mode". cpudid=0 apic id= 00
>>>>> 
>>>>> I used kernel.txz files from:
>>>>> 
>>>>> https://artifact.ci.freebsd.org/snapshot/head/r*/amd64/amd64/
>>>>> 
>>>>> to narrow the range of kernel builds for working -> failing
>>>>> and got:
>>>>> 
>>>>> -r338804 boots fine
>>>>> (no amd64 kernel builds between to try)
>>>>> -r338810+ fails (any that I tried, anyway)
>>>>> 
>>>>> In that range is -r338807 :
>>>>> 
>>>>> QUOTE
>>>>> Author: kib
>>>>> Date: Wed Sep 19 19:35:02 2018
>>>>> New Revision: 338807
>>>>> URL:
>>>>> https://svnweb.freebsd.org/changeset/base/338807
>>>>> 
>>>>> 
>>>>> Log:
>>>>> Convert x86 cache invalidation functions to ifuncs.
>>>>> 
>>>>> This simplifies the runtime logic and reduces the number of
>>>>> runtime-constant branches.
>>>>> 
>>>>> Reviewed by: alc, markj
>>>>> Sponsored by:        The FreeBSD Foundation
>>>>> Approved by: re (gjb)
>>>>> Differential revision:
>>>>> https://reviews.freebsd.org/D16736
>>>>> 
>>>>> Modified:
>>>>> head/sys/amd64/amd64/pmap.c
>>>>> head/sys/amd64/include/pmap.h
>>>>> head/sys/dev/drm2/drm_os_freebsd.c
>>>>> head/sys/dev/drm2/i915/intel_ringbuffer.c
>>>>> head/sys/i386/i386/pmap.c
>>>>> head/sys/i386/i386/vm_machdep.c
>>>>> head/sys/i386/include/pmap.h
>>>>> head/sys/x86/iommu/intel_utils.c
>>>>> END QUOTE
>>>>> 
>>>>> There do seem to be changes associated with
>>>>> clflush(...) use. Looking at:
>>>>> 
>>>>> 
>>> https://svnweb.freebsd.org/base/head/sys/amd64/amd64/pmap.c?annotate=339432
>>>>> 
>>>>> it appears that pmap_force_invalidate_cache_range has not
>>>>> changed since -r338807.
>>>>> 
>>>>> It seems that -r338806 and -r3388810 would be unlikely
>>>>> contributors.
>>>> 
>>> 
>>> I went after my native-boot loader problem first because I
>>> could switch kernels via the loader for booting FreeBSD under
>>> Hyper-V. Switching loaders is more of a problem.
>>> 
>>> In order to avoid the loader-time crash I switched to building
>>> installing based on WITHOUT_ZFS= . I've had no active use of
>>> ZFS in years. (The old official-build loaders that worked were
>>> non-ZFS ones.)
>>> 
>>> This took care of the native-boot loader-crash --and, to my
>>> surprise, also the Hyper-V-boot kernel-time crash.
>>> 
>>> My private builds now boot the 1950X in both contexts just
>>> fine.
>>> 
>>> During my early investigation I did pick up specific changes
>>> from after -r339076 that seemed to be tied to Ryzen and such.
>>> (They made no difference to the boot problems at the time
>>> but I saw no reason to remove them.)
>>> 
>>> # uname -apKU
>>> FreeBSD FBSDFSSD 12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #5 r339076:339432M: Sun
>>> Oct 21 16:44:25 PDT 2018     markmi_at_FBSDFSSD:/usr/obj/amd64_clang/amd64.amd64/usr/src/amd64.amd64/sys/GENERIC-NODBG
>>> amd64 amd64 1200084 1200084
>> 
>> 
> (stupid gmail)
> 
> The phrase "no active use" bothers me. What does that mean? Are there any
> ZFS pools or any disks that any whiff of ZFSish thing on it at all?
> Clearly, there's something in the zfs boot loader that's freaking out by
> something on your system, but absent that information I can't help you.
> 

It would help to get output from loader lsdev -v command. Also if you could test boot loader with UEFI - for example get to loader prompt via usb/cd boot and then get the same lsdev -v output. I would be interested to see the sector size information and if the UEFI loader does also have issues. If it does, I’d like to see the outputs from commands:

zpool status
zpool import

thanks,
toomas
Received on Mon Oct 22 2018 - 07:28:24 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:18 UTC