Re: rpc.lockd spinning; much breakage

From: Don Lewis <truckman_at_FreeBSD.org> Date: Tue, 13 May 2003 14:11:36 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:07 UTC

On 13 May, Robert Watson wrote:
> 
> On Tue, 13 May 2003, Robert Watson wrote:
> 
>> So the client isn't retrying, or mapping errors right after this patch,
>> but the failure modes are more consistent and I seem not to be getting
>> any interminable hangs anymore on the client. 
> 
> I should clarify this statement: I no longer get the odd hangs when it
> comes to client and server interactions when contending a lock established
> on the server and now tested by the client.  I still bump into the "client
> isn't woken up in a timely manner after a lock is released by the same or
> another client".  Here's the demonstration case with a bit more detail
> from what I presented earlier.  The server runs on host cboss, the client
> runs twice on host crash1 on different pty's.  In this scenario, each
> client attempts to grab an exclusive lock, potentially blocking, and then
> sleep for 10 seconds (this is with one of the earlier posted patches):

Try adding the lock_answer() calls I suggested in an earlier message ...

> crash1:/tmp> ./locktest nocreate openlock block noflock test 10
> 933  open(test, 32, 0666)               Tue May 13 14:31:31 2003
> 933  open() returns                     Tue May 13 14:31:31 2003
> 933  sleep(10)                          Tue May 13 14:31:31 2003
> 933  sleep() returns                    Tue May 13 14:31:41 2003
> 
> crash1:/tmp> ./locktest nocreate openlock block noflock test 0
> 934  open(test, 32, 0666)               Tue May 13 14:31:33 2003
> 934  open() returns                     Tue May 13 14:31:53 2003
> 
> rpc.lockd results on crash1:
> 
> May 13 14:31:31 crash1 rpc.lockd: nlm_lock_res from 192.168.50.1
> May 13 14:31:33 crash1 rpc.lockd: nlm_lock_res from 192.168.50.1
> May 13 14:31:42 crash1 rpc.lockd: nlm_granted_msg from 192.168.50.1
> May 13 14:31:42 crash1 rpc.lockd: nlm_unlock_res from 192.168.50.1
> May 13 14:31:42 crash1 rpc.lockd: process 933: No such process
> May 13 14:31:53 crash1 rpc.lockd: nlm_lock_res from 192.168.50.1
> 
> In this example, pid 934 requests the lock on the object at 14:31:33 --
> pid 933 released that lock at 14:31:41, but the pid 934 isn't notified
> until 14:31:53.  It looks like it should have been notified at 14:31:42
> when a granted message is received, but instead it is notified when the
> client rpc.lockd polls again 10 seconds from lock inception.  I almost
> wonder if that ESRCH shouldn't have been the notification for 934 and it
> was using the wrong pid. 

Just looking at the order of the messages, I don't think so.  The nlm_*
messages appear to be printed at the beginning of the RPC handler.  If
the lock is being released because the process exited and closed the
file descriptor, then by the time the server is notified and the client
rpc.lockd gets the response in the server, the process that orignally
grabbed the lock is gone.  I don't know why rpc.lockd wants to tell the
process that it successfully dropped the lock, though ...