Re: dev_lock() contention for fdesc syscalls -- possible fix

From: Luigi Rizzo <rizzo_at_iet.unipi.it> Date: Mon, 10 Nov 2014 10:44:12 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:53 UTC

On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote:
> On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
> > It was noticed that there is huge dev_lock() contention when multiple
> > processes do a poll() even on independent file descriptors.
> > 
> > Turns out that not just poll but most syscalls on file descriptors
> > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
> > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
> > as they use the following pattern
> > 
> >     devfs_poll_f() {
> >         ...
> >         devfs_fp_check(fp, ...) -->
> >             kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) -->
> >                 dev_lock();
> >                 dev = vp->v_rdev; // lock on vp ?
> >                 ... check that dev != NULL
> >                 atomic_add_long(&dev->si_threadcount, 1);
> >                 dev_unlock();
> >         dsw->d_poll()
> >         dev_relthread() -->
> >             atomic_subtract_rel_long(&dev->si_threadcount, 1);
> >    }
> > 
> > 
> > I believe that dev_lock() in devvn_refthread() is protecting
> >         dev = vp->v_rdev
> > (the cdev entry 'dev' cannot go away for the reasons stated below).
> > 
> > However looking at places where vp->v_rdev is modified, turns out
> > that it only happens when holding VI_LOCK(vp) -- the places are
> > devfs_allocv() and devfs_reclaim().
> > There is one place in zfs where the vnode is created and v_rdev
> > is set without holding a lock, so nobody can dereference it.
> > 
> > As a consequence i believe that if in devfs_fp_check() we replace
> > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
> > we make sure that we can safely dereference vp->v_rdev, and the
> > cdev cannot go away because the vnode has a reference to it.
> > The counter uses an atomic instruction (so the release is lockless)
> Vnode reference, as well as cdev reference, which is obtained by
> dev_ref(), do not provide any protection there.  v_rdev is only
> coincidentally protected by the vnode interlock.
> 
> If you look at larger part of the code, you would note that dev mutex
> is held even after v_rdev is dereferenced.  The real protection it
> provides is against race with destroy_dev(l)(), which could invalidate
> dev->so_devsw at any moment when either device thread reference is
> not held, or dev mutex is not held.  So your patch breaks the
> device destruction.

I see. Thanks for the clarification.

Would it help to rewrite the part of devvn_refthread as follows:

    devvn_refthread() {
	// protect vp->v_rdev dereference and dev disappearing
	VI_LOCK(vp);
	dev = vi->v_rdev;
	.. check that dev != NULL
	// protect the race on dev->si_devsw
	atomic_add_long(&dev->si_threadcount, 1);
	VI_UNLOCK(vp);
	... appropriate memory barrier
	csw = dev->si_devsw;
	if (csw != NULL) {
		// we won the race, even if destroy_devl() runs
		// it will keep the cdevsw alive until si_threadcount goes to 0
		*devp = dev;
		*ref = 1;
	} else {
		// we lost
		*ref = 0;
	}
	return csw;
    }

i.e.  tentatively increment si_threadcount before dereferencing si_devsw
and then restoring it if we lose the race ?
It might be necessary to add a barrier in destroy_devl() between clearing
si_devsw and reading si_threadcount.

> > 
> > This should be enough to remove the contention.
> If you never calls destroy_dev() for the devfs node, you could use
> MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
> is no risk of race with destroy_dev() for the created node.
> 
> Usually, code which could be compiled in kernel statically but also
> loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
> module needed to call destroy_dev().

that I suppose would mean that the module cannot be safely unloaded ?

cheers
luigi