Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

From: Mark Millard <markmi_at_dsl-only.net>
Date: Tue, 14 Mar 2017 18:18:56 -0700
On 2017-Mar-14, at 4:44 PM, Bernd Walter <ticso_at_cicely7.cicely.de> wrote:

> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote:
>> [test_check() between the fork and the wait/sleep prevents the
>> failure from occurring. Even a small access to the memory at
>> that stage prevents the failure. Details follow.]
> 
> Maybe a stupid question, since you might have written it somewhere.
> What medium do you swap to?
> I've seen broken firmware on microSD cards doing silent data
> corruption for some access patterns.

The root filesystem is on a USB SSD on a powered hub.

Only the kernel is from the microSD card.

I have several examples of the USB SSD model and have
never observed such problems in any other context.


The original issue that started this investigation
has been reported by several people on the lists:

Failed assertion: "tsd_booted"

on arm64 specifically, no other contexts so far as
I know. Earlier I had discovered that:

A) I could use a swap-in to cause the messages from
   instances of sh or su that had swapped out earlier.

B) The core dumps showed that a large memory region
   containing the global tsd_booted had all turned
   to be zero bytes. The assert is just exposing one
   of those zeros. (tsd_booted is from jemalloc that
   is in a .so that is loaded.)

This prompted me to look for simpler contexts involving
swapping that also show memory corruption.

So far I've only managed to produce corrupted memory when
fork and later swapping are both involved. Being a shared
library global is not a requirement for the problem,
although such contexts can have an issue. I've not made a
simpler example of that yet, although I tried.

I have not explored vfork, rfork, or any other alternatives.

So far all failure examples end up with zeroed memory when
the memory does not match the original pattern from before
the fork. At least that is what the core dumps show for all
examples that I've looked at.

See bugzilla 217138 and 217239. In some respects this example
is more analogous to the 217239 context as I remember.

My tests on amd64, armv6 (really -mcpu=cortex-a7 so armv7),
and powerpc64 have never produced any problems, including
never getting the failed assertion. Only arm64. (But I've
access to only one arm64 system, a Pine64+ 2GB.)

Prior to this I tracked down a different arm64 problem to
the fork_trampline code (for the child process) modifying
a system register but in a place allowing interrupts that
could also change the value. Andrew Turner fixed that
one at the time.

For this fork/swapping kind of issue I'm not sure that
I'll be able to do more than provide the simpler
example and the steps that I used. My isolating the
internal stage(s) and specific problem(s) at the code
level of detail does not seem likely.

But whatever is found needs to be able to explain the
contrast with an access after the fork but before the
swap preventing the failing behavior. So what I've got
so far hopefully does provide some hints to someone.

===
Mark Millard
markmi at dsl-only.net
Received on Wed Mar 15 2017 - 00:19:00 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC