Re: Panic in nfs_putpages() on 6-stable, more info.

From: Frank Mayhar <frank_at_exit.com> Date: Sun, 15 Jan 2006 14:45:11 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:50 UTC

A bit more data and another question.

On Sun, 2006-01-15 at 12:40 -0800, Frank Mayhar wrote:
> In nfs_reclaim(), just before he calls vnode_destroy_vobject(), he
> zfrees and clears vp->v_data.  When, down in the guts of vm_object.c, he
> tries to flush the associated pages, v_data is already NULL so he goes
> boom.
> 
> Now, why does he do the zfree/clear before vnode_destroy_vobject()?  Is
> he assuming that there are no pages associated with this vnode that need
> to be flushed?  Should there be? I looked at some other file systems and
> they do the same thing.  The obvious fix is to move the zfree/clear to
> after the vnode_destroy_vobject() but if there should be no pages that
> need to be flushed on the vnode at this point, that would just hide the
> problem.

Looking further down, at vlrureclaim(), I see that the commentary for
vlrureclaim() specifically says that a a flushed vnode may still have
backing store, so it appears that yes, there may be pages associated
with the vnode when he calls vgonel().  Between vgonel() and
nfs_reclaim() there's just VOP stuff, so the flushing has to be done
lower down.  The nfs_reclaim() routine itself just does some bookkeeping
and then calls vnode_destroy_vobject().  That routine can push pages
out, which means that if the backing store is on NFS, nfs_putpages() can
be called.  But that routine will fault because he'll try to use v_data
as an nfsnode.

The reason for my confusion is that of the filesystems in the tree, the
only one that doesn't zfree and clear v_data before calling
vnode_destroy_vobject() is UFS.  The commentary in ufs_reclaim() is
clear, though:

        /*
         * Destroy the vm object and flush associated pages.
         */
        vnode_destroy_vobject(vp);

Then later he VI_LOCKS() and clears v_data.  (And [indirectly] does the
zfree only _after_ that, which is interesting but probably not
important.)

I'm going to go slightly out on a limb here and guess that the "flush
associated pages" thing came in relatively recently and the other
filesystems haven't caught up with it.  This implies that the proper fix
is to go through those other xxx_reclaim() routines and reorder the
operations.

That's easy enough to do, but I would like to make sure that my
understanding of this (and my guess) is correct and that I'm not wasting
my time.

Thanks!
-- 
Frank Mayhar frank_at_exit.com     http://www.exit.com/
Exit Consulting                 http://www.gpsclock.com/
                                http://www.exit.com/blog/frank/