Re: Fatal double fault in ZFS with yesterday's CURRENT

From: Steven Hartland <killing_at_multiplay.co.uk> Date: Sat, 3 May 2014 19:04:40 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:48 UTC

> "Steven Hartland" <killing_at_multiplay.co.uk> wrote:
> 
> > From: "Fabian Keil" <freebsd-listen_at_fabiankeil.de>
> > 
> > > After updating my laptop to yesterday's CURRENT (r265216),
> > > I got the following fatal double fault on boot:
> > > http://www.fabiankeil.de/bilder/freebsd/kernel-panic-r265216/
> > > 
> > > My previous kernel was based on r264721.
> > >
> > > I'm using a couple of custom patches, some of them are ZFS-related
> > > and thus may be part of the problem (but worked fine for months).
> > > I'll try to reproduce the panic without the patches tomorrow.
> > >
> > 
> > Your seeing a stack overflow in the new ZFS queuing code, which I
> > believe is being triggered by lack of support for TRIM in one of
> > your devices, something Xin reported to me yesterday.
> > 
> > I commited a fix for failing TRIM requests processing slowly last
> > night so you could try updating to after r265253 and see if that
> > helps.
> 
> Thanks. The hard disk is indeed unlikely to support TRIM requests,
> but I can still reproduce the problem with a kernel based on r265255.

Thanks for testing, I suspect its still a numbers game with how many items
are outstanding in the queue and now that free / TRIM requests are also
now queued its triggering the failure.

If your just on a HDD try setting the following in /boot/loader.conf as
a temporary workaround:
vfs.zfs.trim.enabled=0

> > I still need to investigate the stack overflow more directly which
> > appears to be caused by the new zfs queuing code when things are
> > running slowly and there's a large backlog of IO's.
> >
> > I would be interested to know you config there so zpool layout and
> > hardware in the mean time.
> 
> The system is a Lenovo ThinkPad R500:
> http://www.nycbug.org/index.cgi?action=dmesgd&do=view&dmesgid=2449
> 
> I'm booting from UFS, the panic occurs while the pool is being imported.
> 
> The pool is located on a single geli-encrypted slice:
> 
> fk_at_r500 ~ $zpool status tank
>   pool: tank
>  state: ONLINE
>   scan: scrub repaired 0 in 4h11m with 0 errors on Sat Mar 22 18:25:01 2014
> config:
> 
>  NAME           STATE     READ WRITE CKSUM
>  tank           ONLINE       0     0     0
>    ada0s1d.eli  ONLINE       0     0     0
> 
> errors: No known data errors
> 
> Maybe geli fails TRIM requests differently.

That helps, Xin also reported the issue with geli and thats what I'm testing
with, I believe this is a factor because is significantly slows things down
again meaning more items in the queues, but I've only managed to trigger it
once here as the machine I'm using is pretty quick.

I'll continue looking at this ASAP.

    Regards
    Steve