Re: Fatal double fault in ZFS with yesterday's CURRENT

From: Fabian Keil <freebsd-listen_at_fabiankeil.de> Date: Sun, 4 May 2014 08:57:00 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:48 UTC

"Steven Hartland" <killing_at_multiplay.co.uk> wrote:

> > "Steven Hartland" <killing_at_multiplay.co.uk> wrote:
> > 
> > > From: "Fabian Keil" <freebsd-listen_at_fabiankeil.de>
> > > 
> > > > After updating my laptop to yesterday's CURRENT (r265216),
> > > > I got the following fatal double fault on boot:
> > > > http://www.fabiankeil.de/bilder/freebsd/kernel-panic-r265216/
> > > > 
> > > > My previous kernel was based on r264721.
> > > >
> > > > I'm using a couple of custom patches, some of them are ZFS-related
> > > > and thus may be part of the problem (but worked fine for months).
> > > > I'll try to reproduce the panic without the patches tomorrow.
> > > >
> > > 
> > > Your seeing a stack overflow in the new ZFS queuing code, which I
> > > believe is being triggered by lack of support for TRIM in one of
> > > your devices, something Xin reported to me yesterday.
> > > 
> > > I commited a fix for failing TRIM requests processing slowly last
> > > night so you could try updating to after r265253 and see if that
> > > helps.
> > 
> > Thanks. The hard disk is indeed unlikely to support TRIM requests,
> > but I can still reproduce the problem with a kernel based on r265255.
> 
> Thanks for testing, I suspect its still a numbers game with how many items
> are outstanding in the queue and now that free / TRIM requests are also
> now queued its triggering the failure.
> 
> If your just on a HDD try setting the following in /boot/loader.conf as
> a temporary workaround:
> vfs.zfs.trim.enabled=0

That worked, thanks.

> > > I still need to investigate the stack overflow more directly which
> > > appears to be caused by the new zfs queuing code when things are
> > > running slowly and there's a large backlog of IO's.
> > >
> > > I would be interested to know you config there so zpool layout and
> > > hardware in the mean time.
> > 
> > The system is a Lenovo ThinkPad R500:
> > http://www.nycbug.org/index.cgi?action=dmesgd&do=view&dmesgid=2449
> > 
> > I'm booting from UFS, the panic occurs while the pool is being imported.
> > 
> > The pool is located on a single geli-encrypted slice:
> > 
> > fk_at_r500 ~ $zpool status tank
> >   pool: tank
> >  state: ONLINE
> >   scan: scrub repaired 0 in 4h11m with 0 errors on Sat Mar 22 18:25:01 2014
> > config:
> > 
> >  NAME           STATE     READ WRITE CKSUM
> >  tank           ONLINE       0     0     0
> >    ada0s1d.eli  ONLINE       0     0     0
> > 
> > errors: No known data errors
> > 
> > Maybe geli fails TRIM requests differently.
> 
> That helps, Xin also reported the issue with geli and thats what I'm testing
> with, I believe this is a factor because is significantly slows things down
> again meaning more items in the queues, but I've only managed to trigger it
> once here as the machine I'm using is pretty quick.

It probably doesn't make a difference, but my system is rather old
and thus I'm still using geli version 3 for ada0s1d.eli while
geli init nowadays defaults to geli version 7.

The system certainly is also slow, though.

Fabian