Re: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sun, 21 Oct 2018 23:24:43 -0700
On 2018-Oct-21, at 8:30 PM, Warner Losh <imp at bsdimp.com> wrote:

> On Sun, Oct 21, 2018 at 9:28 PM Warner Losh <imp at bsdimp.com> wrote:
> 
> On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable <freebsd-stable_at_freebsd.org> wrote:
>> [I built based on WITHOUT_ZFS= for other reasons. But,
>> after installing the build, Hyper-V based boots are
>> working.]
>> 
>> On 2018-Oct-20, at 2:09 AM, Mark Millard <marklmi at yahoo.com> wrote:
>> 
>> > On 2018-Oct-20, at 1:39 AM, Mark Millard <marklmi at yahoo.com> wrote:
>> > 
>> >> I attempted to jump from head -r334014 to -r339076
>> >> on a threadripper 1950X board and the boot fails.
>> >> This is both native booting and under Hyper-V,
>> >> same machine and root file system in both cases.
>> > 
>> > I did my investigation under Hyper-V after seeing
>> > a boot failure native.
>> > 
>> > Looks like the native failure is even earlier,
>> > before db> is even possible, possibly during
>> > early loader activity.
>> > 
>> > So this report is really for running under
>> > Hyper-V: -r338804 boots and -r338810 does
>> > not. By contrast -r334804 does not boot native.
>> > (But I've little information for that context.)
>> > 
>> > Sorry for the confusion. I rushed the report
>> > in hopes of getting to sleep. It was not to be.
>> > 
>> >> It fails just after the FreeBSD/SMP lines,
>> >> reporting "kernel trap 9 with interrupts disabled".
>> >> 
>> >> It fails in pmap_force_invaldiate_cache_range at
>> >> a clflusl (%rax) instruction that produces a
>> >> "Fatal trap 9: general protection fault while
>> >> in kernel mode". cpudid=0 apic id= 00
>> >> 
>> >> I used kernel.txz files from:
>> >> 
>> >> https://artifact.ci.freebsd.org/snapshot/head/r*/amd64/amd64/
>> >> 
>> >> to narrow the range of kernel builds for working -> failing
>> >> and got:
>> >> 
>> >> -r338804 boots fine
>> >> (no amd64 kernel builds between to try)
>> >> -r338810+ fails (any that I tried, anyway)
>> >> 
>> >> In that range is -r338807 :
>> >> 
>> >> QUOTE
>> >> Author: kib
>> >> Date: Wed Sep 19 19:35:02 2018
>> >> New Revision: 338807
>> >> URL: 
>> >> https://svnweb.freebsd.org/changeset/base/338807
>> >> 
>> >> 
>> >> Log:
>> >> Convert x86 cache invalidation functions to ifuncs.
>> >> 
>> >> This simplifies the runtime logic and reduces the number of
>> >> runtime-constant branches.
>> >> 
>> >> Reviewed by: alc, markj
>> >> Sponsored by:        The FreeBSD Foundation
>> >> Approved by: re (gjb)
>> >> Differential revision:       
>> >> https://reviews.freebsd.org/D16736
>> >> 
>> >> Modified:
>> >> head/sys/amd64/amd64/pmap.c
>> >> head/sys/amd64/include/pmap.h
>> >> head/sys/dev/drm2/drm_os_freebsd.c
>> >> head/sys/dev/drm2/i915/intel_ringbuffer.c
>> >> head/sys/i386/i386/pmap.c
>> >> head/sys/i386/i386/vm_machdep.c
>> >> head/sys/i386/include/pmap.h
>> >> head/sys/x86/iommu/intel_utils.c
>> >> END QUOTE
>> >> 
>> >> There do seem to be changes associated with
>> >> clflush(...) use. Looking at:
>> >> 
>> >> https://svnweb.freebsd.org/base/head/sys/amd64/amd64/pmap.c?annotate=339432
>> >> 
>> >> it appears that pmap_force_invalidate_cache_range has not
>> >> changed since -r338807.
>> >> 
>> >> It seems that -r338806 and -r3388810 would be unlikely
>> >> contributors.
>> > 
>> 
>> I went after my native-boot loader problem first because I
>> could switch kernels via the loader for booting FreeBSD under
>> Hyper-V. Switching loaders is more of a problem.
>> 
>> In order to avoid the loader-time crash I switched to building
>> installing based on WITHOUT_ZFS= . I've had no active use of
>> ZFS in years. (The old official-build loaders that worked were
>> non-ZFS ones.)
>> 
>> This took care of the native-boot loader-crash --and, to my
>> surprise, also the Hyper-V-boot kernel-time crash.
>> 
>> My private builds now boot the 1950X in both contexts just
>> fine.
>> 
>> During my early investigation I did pick up specific changes
>> from after -r339076 that seemed to be tied to Ryzen and such.
>> (They made no difference to the boot problems at the time
>> but I saw no reason to remove them.)
>> 
>> # uname -apKU
>> FreeBSD FBSDFSSD 12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #5 r339076:339432M: Sun Oct 21 16:44:25 PDT 2018     markmi_at_FBSDFSSD:/usr/obj/amd64_clang/amd64.amd64/usr/src/amd64.amd64/sys/GENERIC-NODBG  amd64 amd64 1200084 1200084
>> 
>> (stupid gmail) 
> 
> The phrase "no active use" bothers me. What does that mean? Are there any ZFS pools or any disks that any whiff of ZFSish thing on it at all? Clearly, there's something in the zfs boot loader that's freaking out by something on your system, but absent that information I can't help you.

No ZFS pools: Strictly UFS for FreeBSD file systems
for the last few years, UFS before I had access to
the 1950X system.

I've never before bothered to use WITHOUT_ZFS= in
my builds. So the system had the ZFS support,
such as kernel modules, over all the time that
this system had been in use.

Prior to the recent versions I saw no such problems.
But the default loader was not ZFS capable.


As seen in the under-Hyper-V use-context:

# gpart show -p
=>       40  937703008    da0  GPT  (447G)
         40       1024  da0p1  freebsd-boot  (512K)
       1064  746586112  da0p2  freebsd-ufs  (356G)
  746587176   31457280  da0p3  freebsd-swap  (15G)
  778044456  159383552  da0p4  freebsd-swap  (76G)
  937428008     275040         - free -  (134M)

=>       40  937703008    da1  GPT  (447G)
         40       1024  da1p1  freebsd-boot  (512K)
       1064  369098752  da1p2  freebsd-ufs  (176G)
  369099816  406846424  da1p3  freebsd-swap  (194G)
  775946240  130024488         - free -  (62G)
  905970728   31457280  da1p4  freebsd-swap  (15G)
  937428008     275040         - free -  (134M)

=>       40  419430320    da2  GPT  (200G)
         40       4056         - free -  (2.0M)
       4096  419426263  da2p1  freebsd-ufs  (200G)
  419430359          1         - free -  (512B)

=>        40  2000409184    da3  GPT  (954G)
          40        1024  da3p1  freebsd-boot  (512K)
        1064  2000408159  da3p2  freebsd-ufs  (954G)
  2000409223           1         - free -  (512B)

So no ZFS pools.

The above context never had the ZFS-capable loader
problem but did have the kernel problem. I was
booting the 356G freebsd-ufs partition: the only
one that I have updated the FreeBSD version on
so far.


FreeBSD booted natively more drives are seen in
gpart show, some not from/for FreeBSD. But the
above drives are present and I was booting from
the same partition of the same drive: the 356G
freebsd-ufs partition. Still no ZFS pools
anywhere:

# gpart show -p
=>        34  4000797293    nvd0  GPT  (1.9T)
          34      262144  nvd0p1  ms-reserved  (128M)
      262178        2014          - free -  (1.0M)
      264192  3600451584  nvd0p2  ms-basic-data  (1.7T)
  3600715776   400081551          - free -  (191G)

=>       40  937703008    nvd1  GPT  (447G)
         40       1024  nvd1p1  freebsd-boot  (512K)
       1064  746586112  nvd1p2  freebsd-ufs  (356G)
  746587176   31457280  nvd1p3  freebsd-swap  (15G)
  778044456  159383552  nvd1p4  freebsd-swap  (76G)
  937428008     275040          - free -  (134M)

=>       40  937703008    nvd2  GPT  (447G)
         40       1024  nvd2p1  freebsd-boot  (512K)
       1064  369098752  nvd2p2  freebsd-ufs  (176G)
  369099816  406846424  nvd2p3  freebsd-swap  (194G)
  775946240  130024488          - free -  (62G)
  905970728   31457280  nvd2p4  freebsd-swap  (15G)
  937428008     275040          - free -  (134M)

=>        34  2000409197    nvd3  GPT  (954G)
          34        2014          - free -  (1.0M)
        2048     1021952  nvd3p1  ms-recovery  (499M)
     1024000      202752  nvd3p2  efi  (99M)
     1226752       32768  nvd3p3  ms-reserved  (16M)
     1259520  1859119104  nvd3p4  ms-basic-data  (886G)
  1860378624   140030607          - free -  (67G)

=>        40  2000409184    nvd4  GPT  (954G)
          40        1024  nvd4p1  freebsd-boot  (512K)
        1064  2000408159  nvd4p2  freebsd-ufs  (954G)
  2000409223           1          - free -  (512B)

=>        63  2000409201    ada0  MBR  (954G)
          63        1985          - free -  (993K)
        2048        4096  ada0s1  linux-data  (2.0M)
        6144     2093056          - free -  (1.0G)
     2099200  1998309376  ada0s2  linux-lvm  (953G)
  2000408576         688          - free -  (344K)

=>        34  2000409197    ada1  GPT  (954G)
          34      262144  ada1p1  ms-reserved  (128M)
      262178  2000147053          - free -  (954G)

=>        34  2000409197    ada2  GPT  (954G)
          34      262144  ada2p1  ms-reserved  (128M)
      262178  2000147053          - free -  (954G)

=>        34  1953497022    da0  GPT  (932G)
          34      262144  da0p1  ms-reserved  (128M)
      262178        2014         - free -  (1.0M)
      264192  1953230848  da0p2  ms-basic-data  (931G)
  1953495040        2016         - free -  (1.0M)

=>       1  60062499    da1  MBR  (29G)
         1        31         - free -  (16K)
        32  60062468  da1s1  fat32lba  (29G)

The 356G freebsd-ufs partition is the only one
of the freebsd-ufs partitions updated so far.

This is the context that had the problem with
the ZFS-capable loaders --but no later kernel
problem when a not-ZFS-capable loader was used
(via copying over an older one --until I did the
WITHOUT_ZFS= build/install).

As for the ZFS-capable loader: May it has
problems when it sees one or more of:

ms-reserved (on GPT)
ms-basic-data (on GPT) (NTFS file system)
ms-recovery (on GPT)
efi (on GPT)
linux-data (on MBR)
linux-lvm (on MBR)
fat32lba (on MBR)

(given that none of these is available in
the Hyper-V context as the virtual machine
has been configured).

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
Received on Mon Oct 22 2018 - 04:24:48 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:18 UTC