Re: ZFS Crash

From: Kip Macy <kmacy_at_freebsd.org>
Date: Fri, 29 May 2009 11:29:15 -0700
I'm fairly certain I know what the problem is. The (de)compress
functions allocate their own memory completely independently of the
arc limits. The allocations are blocking so the system will try to
page in attempt to provide the requested memory.


Cheers,
Kip

On Fri, May 29, 2009 at 10:44 AM, Larry Rosenman <ler_at_lerctr.org> wrote:
> On Thu, 28 May 2009, Larry Rosenman wrote:
>
>> On Thu, 28 May 2009, Kip Macy wrote:
>>
>>> On Tue, May 26, 2009 at 5:04 AM, Larry Rosenman <ler_at_lerctr.org> wrote:
>>>>
>>>> On Mon, 25 May 2009, Larry Rosenman wrote:
>>>>
>>>>> On Mon, 25 May 2009, Larry Rosenman wrote:
>>>>>
>>>>>> after looking at the code, never mind the "don't call doadump", so
>>>>>> we'll
>>>>>> get the textdump.
>>>>>>
>>>>>> Thanks rwatson for the textdump stuff!
>>>>>>
>>>>> Here is current stats before we crash.  Does any of this look totally
>>>>> out of line?
>>>>>
>>>> It crashed again, but did *NOT* make it into ddb enough to do the
>>>> textdump.
>>>>
>>>> It was hung with the backtrace (looks like the same, but I couldn't
>>>> scroll the screen back).
>>>>
>>>> Ideas?
>>>>
>>>> I'm really concerned that there is a problem.
>>>>
>>>>
>>>>
>>>
>>>
>>> - Type of disks?
>>
>> 6 SATA Seagate 400GB (5) / 500 GB (1).
>>
>>
>> ATA channel 0:
>>   Master: acd0 <Memorex DVD+-RAM 510L v1/MWS7> ATA/ATAPI revision 7
>>   Slave:       no device present
>> ATA channel 2:
>>   Master:  ad4 <ST3400620AS/3.AAJ> SATA revision 2.x
>>   Slave:       no device present
>> ATA channel 3:
>>   Master:  ad6 <ST3400620AS/3.AAJ> SATA revision 2.x
>>   Slave:       no device present
>> ATA channel 4:
>>   Master:  ad8 <ST3500630AS/3.AAE> SATA revision 2.x
>>   Slave:       no device present
>> ATA channel 5:
>>   Master: ad10 <ST3400620AS/3.AAJ> SATA revision 2.x
>>   Slave:       no device present
>> ATA channel 6:
>>   Master: ad12 <ST3400620AS/3.AAJ> SATA revision 2.x
>>   Slave:       no device present
>> ATA channel 7:
>>   Master: ad14 <ST3400620AS/3.AAJ> SATA revision 2.x
>>   Slave:       no device present
>>>
>>>
>>> - Size of zpools?
>>
>> All 6.
>>
>>  pool: vault
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>>        corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>>        entire pool from backup.
>>  see: http://www.sun.com/msg/ZFS-8000-8A
>> scrub: none requested
>> config:
>>
>>        NAME        STATE     READ WRITE CKSUM
>>        vault       ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            ad6     ONLINE       0     0     0
>>            ad8     ONLINE       0     0     0
>>            ad10    ONLINE       0     0     0
>>            ad12    ONLINE       0     0     0
>>            ad14    ONLINE       0     0     0
>>          ad4s1f    ONLINE       0     0     0
>>          ad4s1e    ONLINE       0     0     0
>>          ad4s1d    ONLINE       0     0     0
>>
>> errors: 10 data errors, use '-v' for a list
>>
>>
>>  pool: vault
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>>        corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>>        entire pool from backup.
>>  see: http://www.sun.com/msg/ZFS-8000-8A
>> scrub: none requested
>> config:
>>
>>        NAME        STATE     READ WRITE CKSUM
>>        vault       ONLINE       0     0     0
>>          raidz1    ONLINE       0     0     0
>>            ad6     ONLINE       0     0     0
>>            ad8     ONLINE       0     0     0
>>            ad10    ONLINE       0     0     0
>>            ad12    ONLINE       0     0     0
>>            ad14    ONLINE       0     0     0
>>          ad4s1f    ONLINE       0     0     0
>>          ad4s1e    ONLINE       0     0     0
>>          ad4s1d    ONLINE       0     0     0
>>
>> errors: Permanent errors have been detected in the following files:
>>
>>       /usr/local/sbin/p4d
>>       /var/db/bacula/borg-dir.conmsg
>>       vault/usr/obj:<0x16c3a>
>>       vault/usr/obj:<0x169bb>
>>       /usr/obj/usr/src/lib/libc/random.o
>>
>>>
>>>
>>> - Compression enabled?
>>
>> Yes.
>>
>>
>>
>
> Ok, it just crashed.  Unfortunately, I'm at work and the box is at home.
>
> I did have my script running every minute of that entire boot.
>
> What I saw was a full backup running, and then we started paging, and then
> the backup jobs got pager errors, and were killed.
>
> I'm not sure what else went on, so I restarted the bacula daemons that
> got killed, and was in the bacula console when it died.
>
> I'll see if I can get a cell-phone camera shot of the console.
>
> I'll also tar up the vmstat outputs and put them on my web server.
>
> What other forensics should I get?  Bear in mind the system is probably
> locked up with no dump taken :(
>
>
> --
> Larry Rosenman                     http://www.lerctr.org/~ler
> Phone: +1 512-248-2683                 E-Mail: ler_at_lerctr.org
> US Mail: 430 Valona Loop, Round Rock, TX 78681-3893
>



-- 
When bad men combine, the good must associate; else they will fall one
by one, an unpitied sacrifice in a contemptible struggle.

    Edmund Burke
Received on Fri May 29 2009 - 16:29:16 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:48 UTC