Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

From: Mark Millard <markmi_at_dsl-only.net>
Date: Thu, 16 Mar 2017 02:07:23 -0700
On 2017-Mar-15, at 11:07 PM, Scott Bennett <bennett at sdf.org> wrote:

> Mark Millard <markmi ta dsl-only.net> wrote:
> 
>> [Something strange happened to the automatic CC: fill-in for my original
>> reply. Also I should have mentioned that for my test program if a
>> variant is made that does not fork the swapping works fine.]
>> 
>> On 2017-Mar-15, at 9:37 AM, Mark Millard <markmi at dsl-only.net> wrote:
>> 
>>> On 2017-Mar-15, at 6:15 AM, Scott Bennett <bennett at sdf.org> wrote:
>>> 
>>>>   On Tue, 14 Mar 2017 18:18:56 -0700 Mark Millard
>>>> <markmi at dsl-only.net> wrote:
>>>>> On 2017-Mar-14, at 4:44 PM, Bernd Walter <ticso_at_cicely7.cicely.de> wrote:
>>>>> 
>>>>>> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote:
>>>>>>> [test_check() between the fork and the wait/sleep prevents the
>>>>>>> failure from occurring. Even a small access to the memory at
>>>>>>> that stage prevents the failure. Details follow.]
>>>>>> 
>>>>>> Maybe a stupid question, since you might have written it somewhere.
>>>>>> What medium do you swap to?
>>>>>> I've seen broken firmware on microSD cards doing silent data
>>>>>> corruption for some access patterns.
>>>>> 
>>>>> The root filesystem is on a USB SSD on a powered hub.
>>>>> 
>>>>> Only the kernel is from the microSD card.
>>>>> 
>>>>> I have several examples of the USB SSD model and have
>>>>> never observed such problems in any other context.
>>>>> 
>>>>> [remainder of irrelevant material deleted  --SB]
>>>> 
>>>>   You gave a very long-winded non-answer to Bernd's question, so I'll
>>>> repeat it here.  What medium do you swap to?
>>> 
>>> My wording of:
>>> 
>>> The root filesystem is on a USB SSD on a powered hub.
>>> 
>>> was definitely poor. It should have explicitly mentioned the
>>> swap partition too:
>>> 
>>> The root filesystem and swap partition are both on the same
>>> USB SSD on a powered hub.
>>> 
>>> More detail from dmesg -a for usb:
>>> 
>>> usbus0: 12Mbps Full Speed USB v1.0
>>> usbus1: 480Mbps High Speed USB v2.0
>>> usbus2: 12Mbps Full Speed USB v1.0
>>> usbus3: 480Mbps High Speed USB v2.0
>>> ugen0.1: <Generic OHCI root HUB> at usbus0
>>> uhub0: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
>>> ugen1.1: <Allwinner EHCI root HUB> at usbus1
>>> uhub1: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
>>> ugen2.1: <Generic OHCI root HUB> at usbus2
>>> uhub2: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2
>>> ugen3.1: <Allwinner EHCI root HUB> at usbus3
>>> uhub3: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3
>>> . . .
>>> uhub0: 1 port with 1 removable, self powered
>>> uhub2: 1 port with 1 removable, self powered
>>> uhub1: 1 port with 1 removable, self powered
>>> uhub3: 1 port with 1 removable, self powered
>>> ugen3.2: <GenesysLogic USB2.0 Hub> at usbus3
>>> uhub4 on uhub3
>>> uhub4: <GenesysLogic USB2.0 Hub, class 9/0, rev 2.00/90.20, addr 2> on usbus3
>>> uhub4: MTT enabled
>>> uhub4: 4 ports with 4 removable, self powered
>>> ugen3.3: <OWC Envoy Pro mini> at usbus3
>>> umass0 on uhub4
>>> umass0: <OWC Envoy Pro mini, class 0/0, rev 2.10/1.00, addr 3> on usbus3
>>> umass0:  SCSI over Bulk-Only; quirks = 0x0100
>>> umass0:0:0: Attached to scbus0
>>> . . .
>>> da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
>>> da0: <OWC Envoy Pro mini 0> Fixed Direct Access SPC-4 SCSI device
>>> da0: Serial Number <REPLACED>
>>> da0: 40.000MB/s transfers
>>> 
>>> (Edited a bit because there is other material interlaced, even
>>> internal to some lines. Also: I removed the serial number of the
>>> specific example device.)
> 
>     Thank you.  That presents a much clearer picture.
>>> 
>>>>   I will further note that any kind of USB device cannot automatically
>>>> be trusted to behave properly.  USB devices are notorious, for example,
>>>> 
>>>>  [reasons why deleted  --SB]
>>>> 
>>>>   You should identify where you page/swap to and then try substituting
>>>> a different device for that function as a test to eliminate the possibility
>>>> of a bad storage device/controller.  If the problem still occurs, that
>>>> means there still remains the possibility that another controller or its
>>>> firmware is defective instead.  It could be a kernel bug, it is true, but
>>>> making sure there is no hardware or firmware error occurring is important,
>>>> and as I say, USB devices should always be considered suspect unless and
>>>> until proven innocent.
>>> 
>>> [FYI: This is a ufs context, not a zfs one.]
> 
>     Right.  It's only a Pi, after all. :-)

It is a Pine64+ 2GB, not an rpi3.

>>> 
>>> I'm aware of such  things. There is no evidence that has resulted in
>>> suggesting the USB devices that I can replace are a problem. Otherwise
>>> I'd not be going down this path. I only have access to the one arm64
>>> device (a Pine64+ 2GB) so I've no ability to substitution-test what
>>> is on that board.
> 
>     There isn't even one open port on that hub that you could plug a
> flash drive into temporarily to be the paging device?

Why do you think that I've never tried alternative devices? It
is just that the result was no evidence that my usually-in-use
SSD is having a special/local problem: the behavior continues
across all such contexts when the Pine64+ 2GB is involved. (Again
I have not had access to an alternate to the one arm64 board.
That limits my substitution testing possibilities.)

Why would you expect a Flash drive to be better than another SSD
for such testing? (The SSD that I usually use even happens to be
a USB 3.0 SSD, capable of USB 3.0 speeds in USB 3.0 contexts. So
is the hub that I usually use for that matter.)

> You could then
> try your tests before returning to the normal configuration.  If there
> isn't an open port, then how about plugging a second hub into one of
> the first hub's ports and moving the displaced device to the second
> hub?  A flash drive could then be plugged in.  That kind of configuration
> is obviously a bad idea for the long run, but just to try your tests it
> ought to work well enough.

I have access to more SSDs that I can use than I do to Flash drives. I
see no reason to specifically use a Flash drive.

> (BTW, if a USB storage device containing a
> paging area drops off=line even momentarily and the system needs to use
> it, that is the beginning of the end, even though it may take up to a few
> minutes for everything to lock up.

The system does not lock up, even days or weeks later, with having done
dozens of experiments that show memory corruption failures over those
days. The only processes showing memory corruption so far are those
that were the parent or child for a fork that were later swapped out
to have zero RES(ident memory) and then even later swapped back in.

The context has no such issues. You are inventing problems that do
not exist in my context. That is why none of my list submittals
mention such problems: they did not occur.

> You probably won't be able to do an
> orderly shutdown, but will instead have to crash it with the power switch.
> In the case of something like a Pi, this is an unpleasant fact of life,
> to be sure.)

Such things did not occur and has nothing to do with my context so far.

>     I think I buy your arguments, given the evidence you've collected
> thus far, including what you've added below.  I just like to eliminate
> possibilities that are much simpler to deal with before facing nastinesses
> like bugs in the VM subsystem. :-)

When I started this I found no evidence of device-specific problems.
My investigation activity goes back to long before my list submittals.

And I repeat: Other people have reported the symptoms that started
this investigation. They did so before I ever started my activities.
They were using none of the specific devices that I have access to.
Likely the types of devices were frequently even different, such as
a rpi3 instead of a Pine64+ 2GB or a different USB drive. I was able
to get the symptoms that they reported.

>>> It would be neat if some folks used my code to test other arm64
>>> contexts and reported the results. I'd be very interested.
>>> (This is easier to do on devices that do not have massive
>>> amounts of RAM, which may limit the range of devices or
>>> device configurations that are reasonable to test.)
>>> 
>>> There is that other people using other devices have reported
>>> the behavior that started this investigation. I can produce the
>>> behavior that they reported, although I've not seen anyone else
>>> listing specific steps that lead to the problem or ways to tell
>>> if the symptom is going to happen before it actually does. Nor
>>> have I seen any other core dump analysis. (I have bugzilla
>>> submittals 217138 and 217239 tied to symptoms others have
>>> reported as well as this test program material.)
>>> 
>>> Also, considering that for my test program I can control which pages
>>> get the zeroed-problem by read-accessing even one byte of any 4K
>>> Byte page that I want to make work normally, doing so in the child
>>> process of the fork, between the fork and the sleep/swap-out, it does
>>> not suggest USB-device-specific behavior. The read-access is changing
>>> the status of the page in some way as far as I can tell.
>>> 
>>> (Such read-accesses in the parent process make no difference to the
>>> behavior.)
>> 
>> I should have noted another comparison/contrast between
>> having memory corruption and not in my context:
>> 
>> I've tried variants of my test program that do not fork but
>> just sleep for 60s to allow me to force the swap-out. I
>> did this before adding fork and before using
>> parital_test_check, for example. I gradually added things
>> apparently involved in the reports others had made
>> until I found a combination that produced a memory
>> corruption test failure.
>> 
>> These tests without fork involved find no problems with
>> the memory content after the swap-in.
>> 
>> For my test program it appears that fork-before-swap-out
>> or the like is essential to having the problem occur.
>> 
>     A comment about terminology seems in order here.  It bothers
> me considerably to see you writing "swap out" or "swapping" where
> it seems like you mean to write "page out" or "paging".  A BSD
> system whose swapping mechanism gets activated has already waded
> very deeply into the quicksand and frequently cannot be gotten out
> in a reasonable amount of time even with manual assistance.  It is
> often quicker to crash it, reboot, and wait for the fsck(8) cleanups
> to complete.  Orderly shutdowns, even of the kind that results from
> a quick poke to the power button, typically get mired in the same
> mess that already has the system in knots.  Also, BSD systems since
> 3.0BSD, unlike older AT&T (pre-SysVR2.3) systems, do not swap in,
> just out.  A swapped out process, once the system determines that it
> has adequate resources again to attempt to run the process, will have
> the interrupted text page paged in and the rest will be paged in by
> the normal mechanism of page faults and page-in operations.  I assume
> you must already know all this, which is a large part of why it grates
> on me that you appear to be using the wrong terms.

You apparently did not read any of the material about how the test
is done or are unfamiliar with what "stress -m 1 --vm-bytes 1800M"
does when there is only 2GB of RAM. I am deliberately inducing
swapping in other processes, including the 2 from my test program
(after the fork), not just paging. (stress is a port, not part of
the base system.)

When I say swap-out and swap-in I mean it.

From the source code of my test program:

            sleep(60);

            // During this manually force this process to
            // swap out. I use something like:

            // stress -m 1 --vm-bytes 1800M

            // in another shell and ^C'ing it after top
            // shows the swapped status desired. 1800M
            // just happened to work on the Pine64+ 2GB
            // that I was using. I watch with top -PCwaopid .

That type of stress run uses about 1.8 GiBytes after a bit,
which is enough to cause the swapping of other processes,
including the two that I am testing (post-fork). (Some RAM
is in use already before the stress run, which explains not
needing 2 GiBytes to be in use by stress.)

Look at a "top -PCwaopid" display: there are columns for
RES(ident memory) and SWAP. I cause my 2 test processes to
show zero RES and everything under SWAP, starting sometime
during the 60s sleep/wait.

Why would I cause swapping? Because buildworld causes such
swap-outs at times when there is only 2GBytes of RAM,
including processes that forked earlier, and as a result
the corrupted memory problems show up later in some processes
that were swapped out at the time. The build eventually
stops for process failures tied to the corruptions of memory
in the failing processes. (At least that is what my testing
strongly suggests.)

But that is a very complicated context to use for analysis or
testing of the problem. My test program is vastly simpler
and easier/quicker to set up and test when used with stress
as well. Such was the kind of thing I was trying to find.

I want the Pine64+ 2GB to work well enough to be able to have
buildworld (-j 4) complete correctly without having to restart
the build --even when everything has to be rebuilt. So I'm
trying to find and provide enough evidence to help someone fix
the problems that are observed to block such buildworld
activity.

Again: others have reported such arm64 problems on the lists
before I ever got into this activity. The evidence is that
the issues are not a local property of my environment.

Swapping is supposed to work. I can do buildworld (-j 4)
on armv6 (really -mcpu=cortex-a7 so armv7-a) and the
swapping it causes works fine. This is true for both a
bpim3 (2 GiBytes of RAM) and a rpi2 (1 GiByte of RAM
so even more swapping). On a powerpc64 with 16 GiBytes
I've built things that caused 26 GiBytes of swap to be
in use some of the time (during 4 ld's running in
parallel), with lots of processes having zero for
RES(ident memory) and all their space listed under SWAP
in a "top -PCwaopid" display. This too has no problems
with swapping of previously forked processes (or of any
other processes).

For the likes of a Pine64+ 2GB to be "self hosted" 
for source-code based updates, swapping of previously
forked processes must work and currently such
swapping is unreliable.

===
Mark Millard
markmi at dsl-only.net
Received on Thu Mar 16 2017 - 08:07:27 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC