John Baldwin wrote: > On Wednesday 28 November 2007 08:51:38 am Søren Schmidt wrote: >> John Baldwin wrote: >>> On Wednesday 28 November 2007 02:45:16 am Søren Schmidt wrote: >>> >>>> John Baldwin wrote: >>>> >>>>> FYI, I've seen weird in-memory corruption with machines with the HT1000_S1 >>>>> atapci device. In all the cases I've seen so far, a single page is corrupted >>>>> with garbage and the page happens to be used by UMA to hold credentials >>>>> including proc0's credentials. I've seen this corruption (trashed creds for >>>>> proc0 and other creds in that page) on many of the same boxes (Dell 1435's >>>>> IIRC) running on 6.2. I've tried switching the HT1000_S1 to use SWKSMIO >>>>> rather SWKS100 as I mentioned to you in an earlier e-mail (the Linux driver >>>>> uses equivalent of SWKSMIO FWIW) but don't have any conclusive tests on that. >>>>> >>>>> >>>>> >>>> OK, seems the chipset has some real problems, I have digged through all >>>> the (very little) docs and info I got from serverworks back when, and >>>> the only thing I can find is that the chips doesn't support MSI in any >>>> shape or fashion or it will do really strange things. >>>> Now on my system it seems to be disabled but I'm not sure yet how its >>>> determined to be that way. Would be worth for you guys to check what the >>>> sysctl's "hw.pci.enable_msi" and "hw.pci.enable_msix" are set to. >>>> I havn't looked into this yet, but I'm pretty sure we added MSI support >>>> in the 6.2 -> 7.0 timeframe, so that might have uncovered this chipset >>>> bug, and possibly the Promise data corruption one as well. >>>> >>> The ata driver doesn't use MSI (no calls to pci_msi_count or pci_msi_alloc, >>> etc.), so this isn't an issue. Also, the boxes I've seen the corruption on >>> already have MSI disabled (it's still disabled by default in 6.x). >>> >> OK, its must be *totally* disabled not just for ATA but for everything >> on those chipsets or they'll barf all over the place. >> If we do that already we need to look into other places. >> However, if we are dealing with in-memory corruption this is going to >> get "interesting".... >> Does that also happen if nothing uses DMA ? > > Again, on the machines I'm seeing this on it was totally disabled. I don't think > I can totally disable DMA (NICs etc. must use DMA) on the machines and since they > are in production and I only see the corruption as an after-effect when the boxes > panic or deadlock for another reason I'm not easily able to reproduce this. Also, > we do disable MSI for devices behind HT2000 chipsets because of a chip bug, but > not on HT1000 currently. However, MSI isn't on on 6.x anyway. It does make sense that the problem with RELENG_7 is a memory corruption issue, mine just seems to be triggered when I start using my PCI-X ata devices. At boot random processes will start sig11'ing all over the place randomly not to mention horrible corruption of my data. Here's some testing I've just done with various FreeBSD versions on my HT1000 Tyan S3950 motherboard, all went pretty badly... Hm I guess we never tried 6.2 on this. It works way worse than 7.0. All this testing is with a disk attached to the onboard SATA controller, no Marvell PCI-X controller involved. 7.0-BETA3 amd64 and i386, onboard SATA controller causes massive data corruption and things sig11 (like init and the "sh" that's running /etc/rc), but system runs well when you use a SiI3114 4-port SATA 32-bit PCI card. 6.2-R amd64, wont boot past "Timecounters tick every 1.000 msec" with ACPI enabled. With ACPI disabled, after "Timecounters tick" it'll spew stray irq 7, stray irq 7, not logging any more, and then it'll run at a snail's pace. In 3-4 minutes it didn't manage to finish reading the mfsroot and start sysinstall so I didn't go any further. I tried with the SATA controller disabled and with the on-board "em" NICs disabled, no effect on that problem. 6.3-BETA2 amd64, boots with ACPI enabled, sysinstall console is sluggish indicating it's having the same problem, it's just not as severe. Sysinstall almost made it through, but finally failed with a g_vfs_alloc error while trying to write out the kernel distributions from CD to hard drive. top was showing interrupt usage at 70%. I booted that almost-completed 6.3-BETA2 installation up with the SMP kernel for the hell of it, top showed 35% interrupt usage but ran a little faster... things sig11'ed and there was massive data corruption on the disk. fsck (not background fsck) even went crazy and ran out of swap while trying to fsck /usr, etc. etc. I tried to boot HEAD from today but unfortunately it hung at the boot loader and wouldn't boot the kernel, so not much to say on that. -- TerraNovaNet Internet Services - Key Largo, FL Voice: (305)453-4011 x101 Fax: (305)451-5991 http://www.terranova.net/ ---------------------------------------------- Life's not fair, but the root password helps.Received on Wed Nov 28 2007 - 19:43:03 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:23 UTC