zfs locking vs vnode locking [Was: zfs/vfs lockups, via poudriere]

From: Andriy Gapon <avg_at_FreeBSD.org>
Date: Mon, 24 Nov 2014 23:12:33 +0200
On 23/11/2014 18:57, Sean Bruno wrote:
> 31071 100995 rm               -                mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x18d __lockmgr_args+0x9ab vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 zfs_lookup+0x45d zfs_freebsd_lookup+0x6d VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x534 kern_rmdirat+0x8d amd64_syscall+0x3fb Xfast_syscall+0xfb
> 31075 100693 mv               -                mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x18d __lockmgr_args+0xd5d vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vputx+0x28a zfs_rename_unlock+0x3e zfs_freebsd_rename+0xe39 VOP_RENAME_APV+0xab kern_renameat+0x4a6 amd64_syscall+0x3fb Xfast_syscall+0xfb

Just the stack traces are not sufficient to analyze the problem without 
examining the relevant vnodes and vnode locks.  But I believe that I have seen 
reports about this kind of problem before.  And I think that I understand what's 
going on.  And, as a part of my job, I tried to develop a fix [*] for this 
problem and had some positive feedback for it.

But the fix is not just a few lines of changes.  It's a lot of modifications to 
a lot of files.  Besides, my changes alter quite a bit more code than a bare 
minimum required to fix the problem, which still would be quite a bit of changes.

So, right now I would like to describe the problem as I understand it.  Some 
general information about the FreeBSD VFS and its difference from Solaris VFS 
[**] can be useful, but is not really required.

I'll try to explain by example.  If we look at any mature and "native" FreeBSD 
filesystem with read-write support - ffs, maybe tmpfs - then we can make the 
following observations.  In most of the vnode operation implementations there 
are no calls to vnode locking functions.  E.g. for an operation like vop_remove 
two vnodes in question are already locked at the VFS layer.  In some cases VOPs 
do locking, but it is very trivial e.g. in vop_create a newly created vnode must 
be returned locked.
Naturally, if we look at the VFS code we see a lot of vnode locking for various 
purposes.  Like locking the vnodes for vop_remove call.  Or locking vnodes 
during their life cycle management, so that, for example, a vnode is not 
destroyed while there is an ongoing operation on it.  Also, we can see locking 
in VFS namei / lookup implementation where we need to hold onto a directory 
vnode while looking up a child vnode by name.
But there are two vnode operation implementations where we can see a non-trivial 
vnode locking "dance".  Those are vop_rename and vop_lookup.  Anyone is welcome 
to take a cursory look at the first hundred or so lines of ufs_rename().

The point of the above observations is that both VFS and a filesystem driver do 
vnode locking.  And, thus, both VFS and the driver must cooperate by using the 
same locking protocol.

Now, if we look at the ZFS ZPL code and most prominently at zfs_rename() we see 
that there is quite a bit of locking going on there, e.g. see zfs_rename_lock, 
but the locks in question are all ZFS internal locks.  We do not see the vnode 
locks.

 From this comes a suspicion, or even a conclusion, that ZFS currently does not 
use the same vnode locking protocol that is expected from any filesystem driver.

There is a weird form of redundancy between the fine grained ZFS locks that got 
ported over and the FreeBSD vnode locks.  In some cases the ZFS locks are always 
uncontested because the vnode locks held at the VFS level ensure a serialized 
access.  In other cases there is no protection at all, because one thread is in 
VFS code which uses the vnode locks and another thread is in ZFS code which uses 
the ZFS locks and thus there is no real synchronization between those threads.

My solution to this problem was to completely eliminate (at least) the following 
ZFS locks
	kmutex_t    z_lock;     /* znode modification lock */
	krwlock_t   z_parent_lock;  /* parent lock for directories */
	krwlock_t   z_name_lock;    /* "master" lock for dirent locks */
	zfs_dirlock_t   *z_dirlocks;    /* directory entry lock list */
and to ensure that the proper vnode locking protocol is followed.
That required substantial changes to the rename and lookup code.

Finally, this is not really a suggestion to test or discuss my changes, but 
rather a call to discuss the problem and other possible ways to fix it.
I do not preclude any options including making changes to our VFS (and thus ti 
all the filesystems) :-)

[*]	https://github.com/avg-I/freebsd/compare/wip/hc/zfs-fbsd-vfs
[**]	https://clusterhq.com/blog/complexity-freebsd-vfs-using-zfs-example-part-1-2/
	https://clusterhq.com/blog/complexity-freebsd-vfs-using-zfs-example-part-2/
-- 
Andriy Gapon
Received on Mon Nov 24 2014 - 20:13:33 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:54 UTC