I'm fairly certain I know what the problem is. The (de)compress functions allocate their own memory completely independently of the arc limits. The allocations are blocking so the system will try to page in attempt to provide the requested memory. Cheers, Kip On Fri, May 29, 2009 at 10:44 AM, Larry Rosenman <ler_at_lerctr.org> wrote: > On Thu, 28 May 2009, Larry Rosenman wrote: > >> On Thu, 28 May 2009, Kip Macy wrote: >> >>> On Tue, May 26, 2009 at 5:04 AM, Larry Rosenman <ler_at_lerctr.org> wrote: >>>> >>>> On Mon, 25 May 2009, Larry Rosenman wrote: >>>> >>>>> On Mon, 25 May 2009, Larry Rosenman wrote: >>>>> >>>>>> after looking at the code, never mind the "don't call doadump", so >>>>>> we'll >>>>>> get the textdump. >>>>>> >>>>>> Thanks rwatson for the textdump stuff! >>>>>> >>>>> Here is current stats before we crash. Does any of this look totally >>>>> out of line? >>>>> >>>> It crashed again, but did *NOT* make it into ddb enough to do the >>>> textdump. >>>> >>>> It was hung with the backtrace (looks like the same, but I couldn't >>>> scroll the screen back). >>>> >>>> Ideas? >>>> >>>> I'm really concerned that there is a problem. >>>> >>>> >>>> >>> >>> >>> - Type of disks? >> >> 6 SATA Seagate 400GB (5) / 500 GB (1). >> >> >> ATA channel 0: >> Master: acd0 <Memorex DVD+-RAM 510L v1/MWS7> ATA/ATAPI revision 7 >> Slave: no device present >> ATA channel 2: >> Master: ad4 <ST3400620AS/3.AAJ> SATA revision 2.x >> Slave: no device present >> ATA channel 3: >> Master: ad6 <ST3400620AS/3.AAJ> SATA revision 2.x >> Slave: no device present >> ATA channel 4: >> Master: ad8 <ST3500630AS/3.AAE> SATA revision 2.x >> Slave: no device present >> ATA channel 5: >> Master: ad10 <ST3400620AS/3.AAJ> SATA revision 2.x >> Slave: no device present >> ATA channel 6: >> Master: ad12 <ST3400620AS/3.AAJ> SATA revision 2.x >> Slave: no device present >> ATA channel 7: >> Master: ad14 <ST3400620AS/3.AAJ> SATA revision 2.x >> Slave: no device present >>> >>> >>> - Size of zpools? >> >> All 6. >> >> pool: vault >> state: ONLINE >> status: One or more devices has experienced an error resulting in data >> corruption. Applications may be affected. >> action: Restore the file in question if possible. Otherwise restore the >> entire pool from backup. >> see: http://www.sun.com/msg/ZFS-8000-8A >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> vault ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> ad6 ONLINE 0 0 0 >> ad8 ONLINE 0 0 0 >> ad10 ONLINE 0 0 0 >> ad12 ONLINE 0 0 0 >> ad14 ONLINE 0 0 0 >> ad4s1f ONLINE 0 0 0 >> ad4s1e ONLINE 0 0 0 >> ad4s1d ONLINE 0 0 0 >> >> errors: 10 data errors, use '-v' for a list >> >> >> pool: vault >> state: ONLINE >> status: One or more devices has experienced an error resulting in data >> corruption. Applications may be affected. >> action: Restore the file in question if possible. Otherwise restore the >> entire pool from backup. >> see: http://www.sun.com/msg/ZFS-8000-8A >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> vault ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> ad6 ONLINE 0 0 0 >> ad8 ONLINE 0 0 0 >> ad10 ONLINE 0 0 0 >> ad12 ONLINE 0 0 0 >> ad14 ONLINE 0 0 0 >> ad4s1f ONLINE 0 0 0 >> ad4s1e ONLINE 0 0 0 >> ad4s1d ONLINE 0 0 0 >> >> errors: Permanent errors have been detected in the following files: >> >> /usr/local/sbin/p4d >> /var/db/bacula/borg-dir.conmsg >> vault/usr/obj:<0x16c3a> >> vault/usr/obj:<0x169bb> >> /usr/obj/usr/src/lib/libc/random.o >> >>> >>> >>> - Compression enabled? >> >> Yes. >> >> >> > > Ok, it just crashed. Unfortunately, I'm at work and the box is at home. > > I did have my script running every minute of that entire boot. > > What I saw was a full backup running, and then we started paging, and then > the backup jobs got pager errors, and were killed. > > I'm not sure what else went on, so I restarted the bacula daemons that > got killed, and was in the bacula console when it died. > > I'll see if I can get a cell-phone camera shot of the console. > > I'll also tar up the vmstat outputs and put them on my web server. > > What other forensics should I get? Bear in mind the system is probably > locked up with no dump taken :( > > > -- > Larry Rosenman http://www.lerctr.org/~ler > Phone: +1 512-248-2683 E-Mail: ler_at_lerctr.org > US Mail: 430 Valona Loop, Round Rock, TX 78681-3893 > -- When bad men combine, the good must associate; else they will fall one by one, an unpitied sacrifice in a contemptible struggle. Edmund BurkeReceived on Fri May 29 2009 - 16:29:16 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:48 UTC