Re: Ryzen public erratas

From: Mike Tancsa <mike_at_sentex.net>
Date: Wed, 13 Jun 2018 16:41:02 -0400
On 6/13/2018 6:35 AM, Konstantin Belousov wrote:
> Today I noted that AMD published the public errata document for Ryzens,
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
> 
> Some of the issues listed there looks quite relevant to the potential
> hangs that some people still experience with the machines.  I wrote
> a script which should apply the recommended workarounds to the erratas
> that I find interesting.
> 
> To run it, kldload cpuctl, then apply the latest firmware update to your
> CPU, then run the following shell script.  Comments indicate the errata
> number for the workarounds.

Hi,
	
tl;dr:  The Microcode changes seem to fix a hard lockup I was able to
reliable reproduce back in Feb.



The BIOS on my AMD is pretty up to date. I think it has the same
microcode as whats in the ports.  x86info -a shows

root_at_ryzenbsd11:/home/mdtancsa # x86info -a | grep -i microc
Microcode patch level: 0x8001137
root_at_ryzenbsd11:/home/mdtancsa #

after running the microcode update and


root_at_ryzenbsd11:/home/mdtancsa # /usr/local/etc/rc.d/microcode_update
onestart
Updating CPU Microcode...
Done.
root_at_ryzenbsd11:/home/mdtancsa # x86info -a | grep -i microc
Microcode patch level: 0x8001137
root_at_ryzenbsd11:/home/mdtancsa #

However, the dmesg after the microcode update adds this line

 AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr>




CPU: AMD Ryzen 5 1600X Six-Core Processor            (3593.36-MHz
K8-class CPU)
  Origin="AuthenticAMD"  Id=0x800f11  Family=0x17  Model=0x1  Stepping=1

Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>

Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD
Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX>
  Structured Extended
Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics

I ran the script

root_at_ryzenbsd11:/home/mdtancsa # cat fix.sh
#!/bin/sh

# Enable workarounds for erratas listed in
# https://developer.amd.com/wp-content/resources/55449_1.12.pdf

# 1057, 1109
sysctl machdep.idle_mwait=0
sysctl machdep.idle=hlt

for x in /dev/cpuctl*; do
        # 1021
        cpucontrol -m '0xc0011029|=0x2000' $x
        # 1033
        cpucontrol -m '0xc0011020|=0x10' $x
        # 1049
        cpucontrol -m '0xc0011028|=0x10' $x
        # 1095
        cpucontrol -m '0xc0011020|=0x200000000000000' $x
        echo $x
done
root_at_ryzenbsd11:/home/mdtancsa # sh ./fix.sh
machdep.idle_mwait: 1 -> 0
machdep.idle: acpi -> hlt
/dev/cpuctl0
/dev/cpuctl1
/dev/cpuctl10
/dev/cpuctl11
/dev/cpuctl2
/dev/cpuctl3
/dev/cpuctl4
/dev/cpuctl5
/dev/cpuctl6
/dev/cpuctl7
/dev/cpuctl8
/dev/cpuctl9
root_at_ryzenbsd11:/home/mdtancsa #

Using a FreeBSD stable from back in Feb, I was able to crash Ryzen and
Epyc based systems
(https://lists.freebsd.org/pipermail/freebsd-stable/2018-February/088439.html)
by generating a lot of traffic between the hypervisor and guests.  The
same tests on an intel based box ran just fine.

e.g. start 3 guests in bhyve (amd64) and run combos of iperf3 between
them.  It would not take too long, but the box would hard lock-- i.e.
blank screen, no crash dump etc.

With the latest micro code update, I have been running the same sort of
tests and so far so good. I will let them run overnight to see if things
are now stable on STABLE.

	---Mike






> 
> Please report the results.  If the script helps, I will code the kernel
> change to apply the workarounds.
> 
> #!/bin/sh
> 
> # Enable workarounds for erratas listed in
> # https://developer.amd.com/wp-content/resources/55449_1.12.pdf
> 
> # 1057, 1109
> sysctl machdep.idle_mwait=0
> sysctl machdep.idle=hlt
> 
> for x in /dev/cpuctl*; do
> 	# 1021
> 	cpucontrol -m '0xc0011029|=0x2000' $x
> 	# 1033
> 	cpucontrol -m '0xc0011020|=0x10' $x
> 	# 1049
> 	cpucontrol -m '0xc0011028|=0x10' $x
> 	# 1095
> 	cpucontrol -m '0xc0011020|=0x200000000000000' $x
> done
> 
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> 
> 


-- 
-------------------
Mike Tancsa, tel +1 519 651 3400 x203
Sentex Communications, mike_at_sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada
Received on Wed Jun 13 2018 - 18:41:06 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:16 UTC