On 2017-Mar-14, at 4:44 PM, Bernd Walter <ticso_at_cicely7.cicely.de> wrote: > On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote: >> [test_check() between the fork and the wait/sleep prevents the >> failure from occurring. Even a small access to the memory at >> that stage prevents the failure. Details follow.] > > Maybe a stupid question, since you might have written it somewhere. > What medium do you swap to? > I've seen broken firmware on microSD cards doing silent data > corruption for some access patterns. The root filesystem is on a USB SSD on a powered hub. Only the kernel is from the microSD card. I have several examples of the USB SSD model and have never observed such problems in any other context. The original issue that started this investigation has been reported by several people on the lists: Failed assertion: "tsd_booted" on arm64 specifically, no other contexts so far as I know. Earlier I had discovered that: A) I could use a swap-in to cause the messages from instances of sh or su that had swapped out earlier. B) The core dumps showed that a large memory region containing the global tsd_booted had all turned to be zero bytes. The assert is just exposing one of those zeros. (tsd_booted is from jemalloc that is in a .so that is loaded.) This prompted me to look for simpler contexts involving swapping that also show memory corruption. So far I've only managed to produce corrupted memory when fork and later swapping are both involved. Being a shared library global is not a requirement for the problem, although such contexts can have an issue. I've not made a simpler example of that yet, although I tried. I have not explored vfork, rfork, or any other alternatives. So far all failure examples end up with zeroed memory when the memory does not match the original pattern from before the fork. At least that is what the core dumps show for all examples that I've looked at. See bugzilla 217138 and 217239. In some respects this example is more analogous to the 217239 context as I remember. My tests on amd64, armv6 (really -mcpu=cortex-a7 so armv7), and powerpc64 have never produced any problems, including never getting the failed assertion. Only arm64. (But I've access to only one arm64 system, a Pine64+ 2GB.) Prior to this I tracked down a different arm64 problem to the fork_trampline code (for the child process) modifying a system register but in a place allowing interrupts that could also change the value. Andrew Turner fixed that one at the time. For this fork/swapping kind of issue I'm not sure that I'll be able to do more than provide the simpler example and the steps that I used. My isolating the internal stage(s) and specific problem(s) at the code level of detail does not seem likely. But whatever is found needs to be able to explain the contrast with an access after the fork but before the swap preventing the failing behavior. So what I've got so far hopefully does provide some hints to someone. === Mark Millard markmi at dsl-only.netReceived on Wed Mar 15 2017 - 00:19:00 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC