Re: ffs_truncate3 panics

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Sat, 11 Aug 2018 12:05:25 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:17 UTC

Konstantin Belousov wrote:
>On Thu, Aug 09, 2018 at 08:38:50PM +0000, Rick Macklem wrote:
>> >BTW, does NFS server use extended attributes ?  What for ?  Can you, please,
>> >point out the code which does this ?
>> For the pNFS service, there are two system namespace extended attributes for
>> each file stored on the service.
>> pnfsd.dsfile - Stores where the data for the file is. Can be displayed by the
>>      pnfsdsfile(8) command.
>>
>> pnfsd.dsattr - Cached attributes that change when a file is written (size, mtime,
>> change) so that the MDS doesn't have to do a Getattr on the data server for every client Getattr.
>>
>
>My reading of the nfsd code + ffs extattr handling reminds me that you
>already reported this issue some time ago.  I suspected ufs_balloc() at
>that time.
Yes. I had almost forgotten about them, because I have been testing with a
couple of machines (not big, but amd64 with a few Gbytes of RAM) and they
never hit the panic(). Recently, I've been using the 256Mbyte i386 and started
seeing them again.

>Now I think that the situation with the stray buffers hanging on the
>queue is legitimate, ffs_extread() might create such buffer and release
>it to a clean queue, then removal of the file would see inode with no
>allocated ext blocks but with the buffer.
>
>I think the easiest way to handle it is to always flush buffers and pages
>in the ext attr range, regardless of the number of allocated ext blocks.
>Patch below was not tested.
[patch deleted for brevity]
Well, the above sounds reasonable, but the patch didn't help.
Here's a small portion of the log a test run last night.
- First, a couple of things about the printf()s. When they start with "CL=<N>",
  the printf() is at the start of ffs_truncate(). "<N>" is a static counter of calls to
  ffs_truncate(), so "same value" indicates same call.

CL=31816 flags=0xc00 vtyp=1 bodirty=0 boclean=1 diextsiz=320
buf at 0x429f260
b_flags = 0x20001020<vmio,reuse,cache>, b_xflags=0x2<clean>, b_vflags=0x0
b_error = 0, b_bufsize = 4096, b_bcount = 4096, b_resid = 0
b_bufobj = (0xfa3f734), b_data = 0x4c90000, b_blkno = -1, b_lblkno = -1, b_dep = 0
b_kvabase = 0x4c90000, b_kvasize = 32768

CL=34593 flags=0xc00 vtyp=1 bodirty=0 boclean=1 diextsiz=320
buf at 0x429deb0
b_flags = 0x20001020<vmio,reuse,cache>, b_xflags=0x2<clean>, b_vflags=0x0
b_error = 0, b_bufsize = 4096, b_bcount = 4096, b_resid = 0
b_bufobj = (0xfd3da94), b_data = 0x5700000, b_blkno = -1, b_lblkno = -1, b_dep = 0
b_kvabase = 0x5700000, b_kvasize = 32768

FFST3=34593 vtyp=1 bodirty=0 boclean=1
buf at 0x429deb0
b_flags = 0x20001020<vmio,reuse,cache>, b_xflags=0x2<clean>, b_vflags=0x0
b_error = 0, b_bufsize = 4096, b_bcount = 4096, b_resid = 0
b_bufobj = (0xfd3da94), b_data = 0x5700000, b_blkno = -1, b_lblkno = -1, b_dep = 0
b_kvabase = 0x5700000, b_kvasize = 32768

So, the first one is what typically happens and there would be no panic().
 The second/third would be a panic(), since the one that starts with "FFST3"
is a printf() that replaces the panic() call.
- Looking at the second/third, the number at the beginning is the same, so it is
  the same call, but for some reason, between the start of the function and
  where the ffs_truncate3 panic() test is, di_extsize has been set to 0, but the
  buffer is still there (or has been re-created there by another thread?).

Looking at the code, I can't see how this could happen, since there is a vinvalbuf()
call after the only place in the code that sets di_extsize == 0, from what I can see?
I am going to add printf()s after the vinvalbuf() calls, to make sure they are
happening and getting rid of the buffer.

If another thread could somehow (re)create the buffer concurrently with the
ffs_truncate() call, that would explain it, I think?

Just a wild guess, but I suspect softdep_slowdown() is flipping, due to the small
size of the machine and this makes the behaviour of ffs_truncate() confusing.

I'll post again when I have more info.
Thanks for looking at it, rick