Re: dev_lock() contention for fdesc syscalls -- possible fix

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Mon, 10 Nov 2014 10:34:57 +0200
On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
> It was noticed that there is huge dev_lock() contention when multiple
> processes do a poll() even on independent file descriptors.
> 
> Turns out that not just poll but most syscalls on file descriptors
> (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
> devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
> as they use the following pattern
> 
>     devfs_poll_f() {
>         ...
>         devfs_fp_check(fp, ...) -->
>             kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) -->
>                 dev_lock();
>                 dev = vp->v_rdev; // lock on vp ?
>                 ... check that dev != NULL
>                 atomic_add_long(&dev->si_threadcount, 1);
>                 dev_unlock();
>         dsw->d_poll()
>         dev_relthread() -->
>             atomic_subtract_rel_long(&dev->si_threadcount, 1);
>    }
> 
> 
> I believe that dev_lock() in devvn_refthread() is protecting
>         dev = vp->v_rdev
> (the cdev entry 'dev' cannot go away for the reasons stated below).
> 
> However looking at places where vp->v_rdev is modified, turns out
> that it only happens when holding VI_LOCK(vp) -- the places are
> devfs_allocv() and devfs_reclaim().
> There is one place in zfs where the vnode is created and v_rdev
> is set without holding a lock, so nobody can dereference it.
> 
> As a consequence i believe that if in devfs_fp_check() we replace
> dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
> we make sure that we can safely dereference vp->v_rdev, and the
> cdev cannot go away because the vnode has a reference to it.
> The counter uses an atomic instruction (so the release is lockless)
Vnode reference, as well as cdev reference, which is obtained by
dev_ref(), do not provide any protection there.  v_rdev is only
coincidentally protected by the vnode interlock.

If you look at larger part of the code, you would note that dev mutex
is held even after v_rdev is dereferenced.  The real protection it
provides is against race with destroy_dev(l)(), which could invalidate
dev->so_devsw at any moment when either device thread reference is
not held, or dev mutex is not held.  So your patch breaks the
device destruction.

> 
> This should be enough to remove the contention.
If you never calls destroy_dev() for the devfs node, you could use
MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
is no risk of race with destroy_dev() for the created node.

Usually, code which could be compiled in kernel statically but also
loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
module needed to call destroy_dev().


> 
> Anyone care to review/test the patch below ?
Yes, there are people who care.

> (dev_refthread() still has the single dev_lock() contention,
> i don't know how to address that yet)
> 
> 	cheers
> 	luigi
> 
> Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c
> ===================================================================
> --- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c	(revision 273452)
> +++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c	(working copy)
> _at__at_ -224,10 +224,11 _at__at_
>  	}
>  
>  	csw = NULL;
> -	dev_lock();
> +	ASSERT_VI_UNLOCKED(vp, __func__);
> +	VI_LOCK(vp); // dev_lock();
>  	dev = vp->v_rdev;
>  	if (dev == NULL) {
> -		dev_unlock();
> +		VI_UNLOCK(vp); // dev_unlock();
>  		return (NULL);
>  	}
>  	cdp = cdev2priv(dev);
> _at__at_ -236,7 +237,7 _at__at_
>  		if (csw != NULL)
>  			atomic_add_long(&dev->si_threadcount, 1);
>  	}
> -	dev_unlock();
> +	VI_UNLOCK(vp); // dev_unlock();
>  	if (csw != NULL) {
>  		*devp = dev;
>  		*ref = 1;
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
Received on Mon Nov 10 2014 - 07:35:05 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:53 UTC