Re: svn commit: r313268 - head/sys/kern [through -r313271 for atomic_fcmpset use and later: fails on PowerMac G5 "Quad Core"; -r313266 works]

From: Mark Millard <markmi_at_dsl-only.net>
Date: Sat, 25 Feb 2017 09:58:39 -0800
On 2017-Feb-25, at 5:49 AM, Mark Millard <markmi_at_dsl-only.net> wrote:

> On 2017-Feb-25, at 1:05 AM, Mark Millard <markmi_at_dsl-only.net> wrote:
> 
>> On 2017-Feb-24, at 11:46 PM, Mark Millard <markmi at dsl-only.net> wrote:
>> 
>>> On 2017-Feb-24, at 8:25 PM, Mark Millard <markmi at dsl-only.net> wrote:
>>> 
>>>> On 2017-Feb-24, at 4:23 PM, Mateusz Guzik <mjguzik at gmail.com> wrote:
>>>>> 
>>>>> On Tue, Feb 21, 2017 at 01:37:25AM -0800, Mark Millard wrote:
>>>>>> [Back to the powerpc64 context.]
>>>>>> 
>>>>>> On 2017-Feb-20, at 11:10 AM, Mateusz Guzik <mjguzik at gmail.com> wrote:
>>>>>> 
>>>>>>> On Sat, Feb 18, 2017 at 04:18:05AM -0800, Mark Millard wrote:
>>>>>>>> [Note: I experiment with clang based powerpc64 builds,
>>>>>>>> reporting problems that I find. Justin is familiar
>>>>>>>> with this, as is Nathan.]
>>>>>>>> 
>>>>>>>> I tried to update the PowerMac G5 (a so-called "Quad Core")
>>>>>>>> that I have access to from head -r312761 to -r313864 and
>>>>>>>> ended up with random panics and hang ups in fairly short
>>>>>>>> order after booting.
>>>>>>>> 
>>>>>>>> Some approximate bisecting for the kernel lead to:
>>>>>>>> (sometimes getting part way into a buildkernel attempt
>>>>>>>> for a different version before a failure happens)
>>>>>>>> 
>>>>>>>> -r313266: works (just before use of atomic_fcmpset)
>>>>>>>> vs.
>>>>>>>> -r313271: fails (last of the "use atomic_fcmpset" check-ins)
>>>>>>>> 
>>>>>>>> (I did not try -r313268 through -r313270 as the use was
>>>>>>>> gradually added.)
>>>>>>>> 
>>>>>>>> So I'm currently running a -r313864 world with a -r313266
>>>>>>>> kernel.
>>>>>>>> 
>>>>>>>> No kernel that I tried that was from before -r313266 had the
>>>>>>>> problems.
>>>>>>>> 
>>>>>>>> Any kernel that I tried that was from after -r313271 had the
>>>>>>>> problems.
>>>>>>>> 
>>>>>>>> Of course I did not try them all in other direction. :)
>>>>>>>> 
>>>>>>> 
>>>>>>> I found that spin mutexes were not properly handling this, fixed in
>>>>>>> r313996.
>>>>>>> 
>>>>>>> Locally I added a if (cpu_tick() % 2) return (0); snipped to amd64
>>>>>>> fcmpset to simulate failures. Everything works, while it would easily
>>>>>>> fail without the patch.
>>>>>>> 
>>>>>>> That said, I hope this concludes the 'missing check for not-reread value
>>>>>>> of failed fcmpset' saga.
>>>>>>> 
>>>>>>> -- 
>>>>>>> Mateusz Guzik <mjguzik gmail.com>
>>>>>> 
>>>>>> -r313999 is an improvement for powerpc64: it boots and I can
>>>>>> log in on the old PowerMac G5 so-called "Quad Core".
>>>>>> 
>>>>>> But, e.g., buildworld buildkernel eventually hangs and later
>>>>>> the powerpc64 panics for "spin lock held too long".
>>>>>> 
>>>>> 
>>>>> Allright, play time is over.
>>>>> 
>>>>> Can you please:
>>>>> 1. verify r313254 is stable for you
>>>>> 2. apply https://people.freebsd.org/~mjg/patches/complete-locks.diff and
>>>>> https://people.freebsd.org/~mjg/.junk/ppc.diff on top of it and retry
>>>>> the test?
>>>>> 
>>>>> This is a workaround which effectively disables the powerpc-specific
>>>>> primitive and makes it use a cmpset wrapper instead. I don't have the
>>>>> hardware to test right now and my attempts to boot in qemu also failed.
>>>>> 
>>>>> That said, does not look like there are general fcmpset bugs left and
>>>>> the remaining issue seems powerpc-specific.
>>>>> 
>>>>> If this works, I'll commit the workaround for the time being as in few
>>>>> weeks I'd like to start merging the work back to stable/11.
>>>>> 
>>>>> -- 
>>>>> Mateusz Guzik <mjguzik gmail.com>
>>>> 
>>>> I've started a self-hosted powerpc64 -r313254 build
>>>> based on running the -r313266 kernel. (The context 
>>>> sometimes do cross builds in is tied up with other
>>>> things. -r313266 is what my prior bisection came up
>>>> with as the last appearently-working kernel at the
>>>> time.)
>>>> 
>>>> So it will be a while before I have a -r313254 in
>>>> place to try: the self-hosted build takes longer
>>>> and so will not be installed for a while.
>>>> 
>>>> To judge stability I'll probably have -e313254 build
>>>> the patched update that you want me to test, initially
>>>> doing a cleanworld. So that too will take a while.
>>>> 
>>>> (The above wording presumes all goes well.)
>>>> 
>>>> I'll let you know as I go along if I run into anything
>>>> interesting.
>>>> 
>>>> 
>>>> My builds are rebuilding both world and kernel since
>>>> what turns into /usr/include/sys/* has changes in your
>>>> patch.
>>>> 
>>>> The builds are without MALLOC_PRODUCTION but are
>>>> otherwise not debug builds.
>>>> 
>>>> 
>>>> I've not seen anything indicating that anyone has
>>>> been trying TARGET_ARCH=powerpc. I've been trying
>>>> TARGET_ARCH=powerpc64 .
>>>> 
>>>> While I do not have access to a true
>>>> TARGET_ARCH=powerpc machine currently, such a build
>>>> can be used on a PowerMac G5 so-called "Quad Core".
>>>> So I could eventually build and try such on the one
>>>> powerpc family machine that I currently have access
>>>> to.
>>>> 
>>>> clang 3.9.1 has a significant code generation problem
>>>> for TARGET_ARCH=powerpc and so I'd have to use
>>>> a gcc 4.2.1 based build for that sort of experiment.
>>>> (There is no xtoolchain for 32-bit powerpc.)
>>>> 
>>>> I use clang 3.9.1 or xtoolchain for
>>>> TARGET_ARCH=powerpc64 and have been using clang 3.9.1
>>>> in recent times. My primary powerpc family use has
>>>> been to experiment with building based on the
>>>> modern libc++ and reporting issues discovered in the
>>>> attempts. This explains the clang/xtoolchain context.
>>>> 
>>>> clang 3.9.1 has major problems for C++ exception
>>>> handling for both powerpc64 and powerpc but a
>>>> lot of FreeBSD is independent of throwing C++
>>>> exceptions. By contrast xtoolchain-based works
>>>> for C++ exception handling but lib32 fails
>>>> to operate when built by a xtoolchain build.
>>> 
>>> -r313254 had no trouble booting or building
>>> the patched version or anything else involved
>>> in getting there or installing.
>>> 
>>> But the patched version failed quickly just
>>> attempting cleanworld's recursive remove. (So
>>> it did boot and let me log in.) The panic
>>> description was:
>>> 
>>> panic: vn_finished_secondary_write: neg cnt
>>> 
>>> 
>>> The sources that are different from svn's -r313254
>>> are (some tied to arm64 experiments, most everything
>>> else tied to powerpc64 and/or powerpc, those not
>>> from your patches are long standing from my
>>> investigations or from Justin H.):
>>> 
>>> # svnlite status /usr/src | sort
>>> . . . (ignoring the ? lines) . . .
>>> M       /usr/src/bin/sh/jobs.c
>>> M       /usr/src/bin/sh/miscbltin.c
>>> M       /usr/src/contrib/llvm/lib/Target/PowerPC/PPCInstrInfo.td
>>> M       /usr/src/contrib/llvm/tools/lld/ELF/Target.cpp
>>> M       /usr/src/lib/csu/powerpc64/Makefile
>>> M       /usr/src/libexec/rtld-elf/Makefile
>>> M       /usr/src/sys/arm/arm/gic.c
>>> M       /usr/src/sys/boot/ofw/Makefile.inc
>>> M       /usr/src/sys/boot/powerpc/Makefile.inc
>>> M       /usr/src/sys/boot/powerpc/kboot/Makefile
>>> M       /usr/src/sys/boot/uboot/Makefile.inc
>>> M       /usr/src/sys/conf/kmod.mk
>>> M       /usr/src/sys/ddb/db_main.c
>>> M       /usr/src/sys/ddb/db_script.c
>>> M       /usr/src/sys/kern/init_main.c
>>> M       /usr/src/sys/kern/kern_condvar.c
>>> M       /usr/src/sys/kern/kern_lock.c
>>> M       /usr/src/sys/kern/kern_lockstat.c
>>> M       /usr/src/sys/kern/kern_mutex.c
>>> M       /usr/src/sys/kern/kern_rwlock.c
>>> M       /usr/src/sys/kern/kern_sx.c
>>> M       /usr/src/sys/kern/kern_synch.c
>>> M       /usr/src/sys/kern/kern_thread.c
>>> M       /usr/src/sys/kern/subr_lock.c
>>> M       /usr/src/sys/kern/vfs_default.c
>>> M       /usr/src/sys/kern/vfs_subr.c
>>> M       /usr/src/sys/powerpc/include/atomic.h
>>> M       /usr/src/sys/powerpc/ofw/ofw_machdep.c
>>> M       /usr/src/sys/sys/lock.h
>>> M       /usr/src/sys/sys/lockmgr.h
>>> M       /usr/src/sys/sys/lockstat.h
>>> M       /usr/src/sys/sys/mutex.h
>>> M       /usr/src/sys/sys/rwlock.h
>>> M       /usr/src/sys/sys/sdt.h
>>> M       /usr/src/sys/sys/sx.h
>>> M       /usr/src/sys/sys/systm.h
>> 
>> To recover from the problem and again have a buildworld
>> buildkernel present I've booted based on:
>> 
>> A) The -r313254 kernel without your patches (kernel.old).
>> B) The -r313254 world (which had your patches in its
>>  build).
>> 
>> I've reverted the /usr/src/ to not have your patches
>> (but does have my prior ones from prior activity).
>> 
>> I repeated the cleanworld to let it finish after its
>> prior failure (that failed during a SSD trim activity).
>> 
>> I've started buildworld buildkernel (with -j 4 as is
>> normal for my context).
>> 
>> So far this combination seems to be working fine. This
>> suggests that the sys/sys/*.h files that ended up in
>> /usr/include/sys/ and the sys/powerpc/include/atomic.h
>> that ended up in /usr/include/machine/ were not problems
>> as used in the world code --since those uses are still in
>> place in the binaries being used. Only the kernel
>> binaries seem to be a problem (not necessarily all of
>> them).
> 
> Unfortunately it eventually got a panic for a Data Storage
> Interrupt.
> 
> I may not be unable to do a self hosted build to get things
> back to normal. 

I tried simply starting another buildworld buildkernel after
booting and it did complete. Installing and rebooting worked
fine.

So apparently whatever was going on for the Data Storage
Interrupt is fairly rare.

Thus the PowerMac G5 so-called "Quad Core" is back to
-r313254 without your patches. (The "Quad Core" really has
two processors, each with 2 cores.)

===
Mark Millard
markmi at dsl-only.net
Received on Sat Feb 25 2017 - 16:58:43 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC