NFS over TCP patch testing/review, please!!

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Thu, 29 Oct 2009 16:10:52 -0400 (EDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:57 UTC

I think the following patch fixes the problem reported by O. Seibert
w.r.t. NFS over TCP taking 5min to reconnect to a server after a
period of inactivity. (I think there have been others bit by this,
but they were vague reports of trouble with NFS over TCP.) I didn't
see the problem, because I was mainly testing against a FreeBSD server
and/or using NFSv4 (NFSv4 does a Renew every 30sec, so the TCP
connection isn't inactive for long enough for a Solaris server to
disconnect it.)

clnt_vc_call() in sys/rpc/clnt_vc.c checks for the server closing
down the connection while the RPC is in progress, but doesn't check
to see if it has already happened. If it has already happened, there
would be no upcall to prompt a wakeup of the msleep() waiting for a
reply, etc. This patch adds a check for the connection being closed
by the server, just before queuing the request and sending it.
(I think this fixes the problem.)

What I really need is some people to test NFS over TCP with the
patch applied to their kernel. It doesn't matter if you aren't
seeing the problem (ie. using a FreeBSD server), since I am more
concerned with the patch breaking something else than fixing the
problem. (This seems serious enough that I'd like to try and get
a fix into 8.0, which is why I'm hoping some folks can test this
quickly?)

Thanks in advance for help with this, rick
--- patch for sys/rpc/clnt_vc.c ---
--- rpc/clnt_vc.c.sav	2009-10-28 15:44:20.000000000 -0400
+++ rpc/clnt_vc.c	2009-10-29 15:40:37.000000000 -0400
_at__at_ -413,6 +413,22 _at__at_

  	cr->cr_xid = xid;
  	mtx_lock(&ct->ct_lock);
+	/*
+	 * Check to see if the other end has already started to close down
+	 * the connection. The upcall will have set ct_error.re_status
+	 * to RPC_CANTRECV if this is the case.
+	 * If the other end starts to close down the connection after this
+	 * point, it will be detected later when cr_error is checked,
+	 * since the request is in the ct_pending queue.
+	 */
+	if (ct->ct_error.re_status == RPC_CANTRECV) {
+		if (errp != &ct->ct_error) {
+			errp->re_errno = ct->ct_error.re_errno;
+			errp->re_status = RPC_CANTRECV;
+		}
+		stat = RPC_CANTRECV;
+		goto out;
+	}
  	TAILQ_INSERT_TAIL(&ct->ct_pending, cr, cr_link);
  	mtx_unlock(&ct->ct_lock);