Re: rpc.lockd spinning; much breakage

From: Don Lewis <truckman_at_FreeBSD.org> Date: Tue, 13 May 2003 00:00:59 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:07 UTC

On 12 May, Robert Watson wrote:

> Ok, well, I seem to be running into several classes of problems with
> rpc.lockd:
> 
> (1) When O_NONBLOCK is passed to open() along with O_EXLOCK, EAGAIN is
>     normally returned if the lock is contended.  However, when using
>     rpc.lockd, I get back EACCES instead.  You can reproduce this by
>     trying to acquire the same lock twice using lockf(1) on the same
>     client, or different clients, or the server and the client (...).  The
>     local test I use is:
> 
>        server% lockf -k test sleep 20
> 
>        client% lockf -k test echo hi
>        lockf: cannot open test: Permission denied
> 
>     Normally, lockf(1) gets back EAGAIN and polls for the lock to be
>     released.  I'm not sure which case in the client rpc.lockd(8) it is.
>     Since Solaris doesn't support O_EXLOCK and lockf(1) it wasn't easy for
>     me to test, but since the server returns the same result from FreeBSD
>     and Solaris, I'm guessing it's a client-side mapping problem.
>     Presumably some instance of nlm_denied should return EAGAIN instead.

I think this gets an nlm4_denied response, which is handled by the
following code in lock_answer():

                case nlm4_denied:
                        if (pid_p == NULL)
                                ans.la_errno = EACCES;
                        else {
                                /* this is an answer to a nlm_test msg */
                                ans.la_set_getlk_pid = 1;
                                ans.la_getlk_pid = *pid_p;
                                ans.la_errno = 0;
                        }
                        break; 

It looks like this should work, because I think in the case of
O_NONBLOCK, the caller should be passing pid_p != NULL.  The only way to
be sure is to turn on client-side logging to find out which RPC is
calling lock_answer().

> (2) When O_NONBLOCK is not passed to open() along with O_EXLOCK, and a
>     lock is contended, wakeups are not properly delivered from the
>     client-side rpc.lockd to the process.  I'm not sure of the exact
>     sequence of events, but in this test I ran a locktest client on two
>     FreeBSD 5.x nodes with rpc.lockd, and a Solaris server.  There was
>     almost a 15-second lag from the release of the lock on one client and
>     notification of the other client before the process was woken up. 
>     Here's the rpc.lockd debugging log on the client showing the lag: 
> 
> <server acknowledges lock request to cboss client>
>   May 12 22:43:59 cboss rpc.lockd: nlm4_lock_res from raven.rv.nailabs.com
> 
> <server notifies client lock has been released by other client and
>  granted>
>   May 12 22:44:07 cboss rpc.lockd: nlm4_granted_msg from raven.rv.nailabs.com
> 
> <server sends another message to the client>
>   May 12 22:44:19 cboss rpc.lockd: nlm4_lock_res from raven.rv.nailabs.com
> 
>    The waiting process is woken up at 22:44:19 instead of at 22:44:07, so
>    there's a 12 second lag before the process is woken up, even though a
>    first glance suggests that the client-side rpc.lockd got a notification
>    substantially early.

Are you sure the client isn't waking up because it polled the server
again and got the lock on the second attempt?  Does the debug log on the
client show anything?

Hmn, nlm4_granted_4_svc() calls lock_answer(), but
nlm4_granted_msg_4_svc() (and nlm_granted_msg_1_svc()) don't ...

> (3) Sometimes rpc.lockd on 5.x acting as a server gets really confused
>     when you mix local and remote locks.  I haven't quite figured out the
>     circumstances, but occasionally I run into a situation where a client
>     contends against an existing lock on the server, and the client never
>     receives a notification from the server that the lock has been
>     released.  It looks like the server stores state that the lock is
>     contended, but perhaps never properly re-polls the kernel to see if
>     the lock has been locally re-released:
> 
> May 12 22:53:35 cboss rpc.lockd: nlm_lock_msg from crash1.cboss.tislabs.com
> May 12 22:53:35 cboss rpc.lockd: Entering getlock... 
> May 12 22:53:35 cboss rpc.lockd: recieved fhandle size 32, local size 28
> May 12 22:53:35 cboss rpc.lockd: Pointer to new lock is 0x8090800 
> May 12 22:53:35 cboss rpc.lockd: Entering do_lock... 
> May 12 22:53:35 cboss rpc.lockd: Entering lock_partialfilelock 
> May 12 22:53:35 cboss rpc.lockd: Entering lock_nfslock... 
> May 12 22:53:35 cboss rpc.lockd: Entering test_nfslock 
> May 12 22:53:35 cboss rpc.lockd: Entering lock search loop 
> May 12 22:53:35 cboss rpc.lockd: *********************************** 
> May 12 22:53:35 cboss rpc.lockd: Dumping match filelock 
> May 12 22:53:35 cboss rpc.lockd: *********************************** 
> May 12 22:53:35 cboss rpc.lockd: *********************************** 
> May 12 22:53:35 cboss rpc.lockd: Top of lock loop 
> May 12 22:53:35 cboss rpc.lockd: Pointer to file lock: 0x8072800 
> May 12 22:53:35 cboss rpc.lockd: *********************************** 
> May 12 22:53:35 cboss rpc.lockd: Dumping test filelock 
> May 12 22:53:35 cboss rpc.lockd: *********************************** 
> May 12 22:53:35 cboss rpc.lockd: *********************************** 
> May 12 22:53:35 cboss rpc.lockd: test_nfslock: filehandle match found 
> May 12 22:53:35 cboss rpc.lockd: Entering region overlap with vals:
> 0:0--0:0 
> May 12 22:53:35 cboss rpc.lockd: Exiting region overlap with val: 4 
> May 12 22:53:35 cboss rpc.lockd: test_nfslock: Region overlap found 0 : 0
> -- 0 : 0 
> May 12 22:53:35 cboss rpc.lockd: test_nfslock: Exclusivity failure: 1 1 
> May 12 22:53:35 cboss rpc.lockd: Checking filelock identity 
> May 12 22:53:35 cboss rpc.lockd: Exiting checking filelock identity:
> retval: 0 
> May 12 22:53:35 cboss rpc.lockd: test_nfslock: Lock attempt failed 
> May 12 22:53:35 cboss rpc.lockd: Desired lock 
> May 12 22:53:35 cboss rpc.lockd: Conflicting lock 
> May 12 22:53:35 cboss rpc.lockd: Dumping file locks 
> May 12 22:53:35 cboss rpc.lockd: Exiting test_nfslock 
> May 12 22:53:35 cboss rpc.lockd: Exiting lock_nfslock... 
> May 12 22:53:35 cboss rpc.lockd: BLOCKING LOCK RECEIVED 
> May 12 22:53:35 cboss rpc.lockd: Entering add_blockingfilelock 
> May 12 22:53:35 cboss rpc.lockd: Exiting add_blockingfilelock 
> May 12 22:53:35 cboss rpc.lockd: Exiting lock_partialfilelock 
> May 12 22:53:35 cboss rpc.lockd: PFL_NFS blocking lock denied.  Queued. 
> May 12 22:53:35 cboss rpc.lockd: Exiting do_lock... 
> May 12 22:53:35 cboss rpc.lockd: Pointer to new lock is 0x8090800 
> May 12 22:53:35 cboss rpc.lockd: Exiting getlock... 
> 
>    As a result, the client is never granted the lock by the server, and
>    waits indefinitely.  Once in a while, once in this state, rpc.lockd
>    starts to spin on the server, but I'm still trying to figure out how to
>    reproduce that.

It looks like this isn't implemented ...

/*
 * lock_partialfilelock:
 *
 * Argument fl gets modified as its list housekeeping entries get modified
 * upon insertion into the NFS lock list
 *
 * This routine makes several assumptions:
 * 1) It (will) pass locks through to flock to lock the entire underlying file
 *     and then parcel out NFS locks if it gets control of the file.
 *         This matches the old rpc.lockd file semantics (except where it
 *         is now more correct).  It is the safe solution, but will cause
 *         overly restrictive blocking if someone is trying to use the
 *         underlying files without using NFS.  This appears to be an
 *         acceptable tradeoff since most people use standalone NFS servers.
 * XXX: The right solution is probably kevent combined with fcntl
 *
 *    2) Nothing modifies the lock lists between testing and granting
 *           I have no idea whether this is a useful assumption or not
 */

and

        siglock();
        debuglog("Pointer to new lock is %p\n",newfl);

        retval = do_lock(newfl);

        debuglog("Pointer to new lock is %p\n",newfl);
        sigunlock();

        switch (retval)
                {
                case nlm4_granted:
                        /* case nlm_granted: is the same as nlm4_granted */
                        /* do_mon(lckarg->alock.caller_name); */
                        break;
                case nlm4_blocked:
                        /* case nlm_blocked: is the same as nlm4_blocked */
                        /* do_mon(lckarg->alock.caller_name); */
                        break;
                default:
                        deallocate_file_lock(newfl);
                        break;
                }

        debuglog("Exiting getlock...\n");