[Something strange happened to the automatic CC: fill-in for my original reply. Also I should have mentioned that for my test program if a variant is made that does not fork the swapping works fine.] On 2017-Mar-15, at 9:37 AM, Mark Millard <markmi at dsl-only.net> wrote: > On 2017-Mar-15, at 6:15 AM, Scott Bennett <bennett at sdf.org> wrote: > >> On Tue, 14 Mar 2017 18:18:56 -0700 Mark Millard >> <markmi at dsl-only.net> wrote: >>> On 2017-Mar-14, at 4:44 PM, Bernd Walter <ticso_at_cicely7.cicely.de> wrote: >>> >>>> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote: >>>>> [test_check() between the fork and the wait/sleep prevents the >>>>> failure from occurring. Even a small access to the memory at >>>>> that stage prevents the failure. Details follow.] >>>> >>>> Maybe a stupid question, since you might have written it somewhere. >>>> What medium do you swap to? >>>> I've seen broken firmware on microSD cards doing silent data >>>> corruption for some access patterns. >>> >>> The root filesystem is on a USB SSD on a powered hub. >>> >>> Only the kernel is from the microSD card. >>> >>> I have several examples of the USB SSD model and have >>> never observed such problems in any other context. >>> >>> [remainder of irrelevant material deleted --SB] >> >> You gave a very long-winded non-answer to Bernd's question, so I'll >> repeat it here. What medium do you swap to? > > My wording of: > > The root filesystem is on a USB SSD on a powered hub. > > was definitely poor. It should have explicitly mentioned the > swap partition too: > > The root filesystem and swap partition are both on the same > USB SSD on a powered hub. > > More detail from dmesg -a for usb: > > usbus0: 12Mbps Full Speed USB v1.0 > usbus1: 480Mbps High Speed USB v2.0 > usbus2: 12Mbps Full Speed USB v1.0 > usbus3: 480Mbps High Speed USB v2.0 > ugen0.1: <Generic OHCI root HUB> at usbus0 > uhub0: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0 > ugen1.1: <Allwinner EHCI root HUB> at usbus1 > uhub1: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1 > ugen2.1: <Generic OHCI root HUB> at usbus2 > uhub2: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2 > ugen3.1: <Allwinner EHCI root HUB> at usbus3 > uhub3: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3 > . . . > uhub0: 1 port with 1 removable, self powered > uhub2: 1 port with 1 removable, self powered > uhub1: 1 port with 1 removable, self powered > uhub3: 1 port with 1 removable, self powered > ugen3.2: <GenesysLogic USB2.0 Hub> at usbus3 > uhub4 on uhub3 > uhub4: <GenesysLogic USB2.0 Hub, class 9/0, rev 2.00/90.20, addr 2> on usbus3 > uhub4: MTT enabled > uhub4: 4 ports with 4 removable, self powered > ugen3.3: <OWC Envoy Pro mini> at usbus3 > umass0 on uhub4 > umass0: <OWC Envoy Pro mini, class 0/0, rev 2.10/1.00, addr 3> on usbus3 > umass0: SCSI over Bulk-Only; quirks = 0x0100 > umass0:0:0: Attached to scbus0 > . . . > da0 at umass-sim0 bus 0 scbus0 target 0 lun 0 > da0: <OWC Envoy Pro mini 0> Fixed Direct Access SPC-4 SCSI device > da0: Serial Number <REPLACED> > da0: 40.000MB/s transfers > > (Edited a bit because there is other material interlaced, even > internal to some lines. Also: I removed the serial number of the > specific example device.) > >> I will further note that any kind of USB device cannot automatically >> be trusted to behave properly. USB devices are notorious, for example, >> for momentarily dropping off-line and then immediately reconnecting. (ZFS >> reacts very poorly to such events, BTW.) This misbehavior can be caused >> by either processor involved, i.e., the one controlling either the >> upstream or the downstream device. Hubs are really bad about this, but >> any USB device can be guilty. You may have a defective storage device, >> its controller may be defective, or any controller in the chain all the >> way back to the motherboard may be defective or, not defective, but >> corrupted by having been connected to another device with corrupted >> (infected) firmware that tries to flash itself into the firmware flash >> chips in its potential victim. >> Flash memory chips, spinning disks, or {S,}{D,}RAM chips can be >> defective. Without parity bits, the devices may return bad data and lie >> about the presence of corrupted data. That, for example, is where ZFS >> is better than any kind of classical RAID because ZFS keeps checksums on >> everything, so it has a reasonable chance of detecting corruption even >> without parity support and, if there is any redundancy in the pool or the >> data set, fixing the bad data machine. Even having parity generally >> allows only the detection of single-bit errors, but not of repairing them. >> You should identify where you page/swap to and then try substituting >> a different device for that function as a test to eliminate the possibility >> of a bad storage device/controller. If the problem still occurs, that >> means there still remains the possibility that another controller or its >> firmware is defective instead. It could be a kernel bug, it is true, but >> making sure there is no hardware or firmware error occurring is important, >> and as I say, USB devices should always be considered suspect unless and >> until proven innocent. > > [FYI: This is a ufs context, not a zfs one.] > > I'm aware of such things. There is no evidence that has resulted in > suggesting the USB devices that I can replace are a problem. Otherwise > I'd not be going down this path. I only have access to the one arm64 > device (a Pine64+ 2GB) so I've no ability to substitution-test what > is on that board. > > It would be neat if some folks used my code to test other arm64 > contexts and reported the results. I'd be very interested. > (This is easier to do on devices that do not have massive > amounts of RAM, which may limit the range of devices or > device configurations that are reasonable to test.) > > There is that other people using other devices have reported > the behavior that started this investigation. I can produce the > behavior that they reported, although I've not seen anyone else > listing specific steps that lead to the problem or ways to tell > if the symptom is going to happen before it actually does. Nor > have I seen any other core dump analysis. (I have bugzilla > submittals 217138 and 217239 tied to symptoms others have > reported as well as this test program material.) > > Also, considering that for my test program I can control which pages > get the zeroed-problem by read-accessing even one byte of any 4K > Byte page that I want to make work normally, doing so in the child > process of the fork, between the fork and the sleep/swap-out, it does > not suggest USB-device-specific behavior. The read-access is changing > the status of the page in some way as far as I can tell. > > (Such read-accesses in the parent process make no difference to the > behavior.) I should have noted another comparison/contrast between having memory corruption and not in my context: I've tried variants of my test program that do not fork but just sleep for 60s to allow me to force the swap-out. I did this before adding fork and before using parital_test_check, for example. I gradually added things apparently involved in the reports others had made until I found a combination that produced a memory corruption test failure. These tests without fork involved find no problems with the memory content after the swap-in. For my test program it appears that fork-before-swap-out or the like is essential to having the problem occur. === Mark Millard markmi at dsl-only.netReceived on Wed Mar 15 2017 - 17:51:59 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC