multihomed nfs server - NLM lock failure on additional interfaces

From: John <jwd_at_freebsd.org> Date: Tue, 13 Dec 2011 02:46:36 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:21 UTC

Hi Folks,

   I have a 9-prerelease system where I've been testing nfs/zfs. The
system has been working quite well until moving the server to a multihomed
configuration.

   Given the following:

nfsd: master (nfsd)
nfsd: server (nfsd)
/usr/sbin/rpcbind      -h 10.24.6.38 -h 172.1.1.2 -h 172.21.201.1 -h 172.21.202.1 -h 172.21.203.1 -h 172.21.204.1 -h 172.21.205.1 -h 10.24.6.34 -h 10.24.6.33
/usr/sbin/mountd -r -l -h 10.24.6.38 -h 172.1.1.2 -h 172.21.201.1 -h 172.21.202.1 -h 172.21.203.1 -h 172.21.204.1 -h 172.21.205.1 -h 10.24.6.34 -h 10.24.6.33
/usr/sbin/rpc.statd    -h 10.24.6.38 -h 172.1.1.2 -h 172.21.201.1 -h 172.21.202.1 -h 172.21.203.1 -h 172.21.204.1 -h 172.21.205.1 -h 10.24.6.34 -h 10.24.6.33
/usr/sbin/rpc.lockd    -h 10.24.6.38 -h 172.1.1.2 -h 172.21.201.1 -h 172.21.202.1 -h 172.21.203.1 -h 172.21.204.1 -h 172.21.205.1 -h 10.24.6.34 -h 10.24.6.33

   10.24.6.38 is the default interface on 1G. The 172 nets are 10G connected
to compute systems.

ifconfig_bce0=' inet 10.24.6.38    netmask 255.255.0.0 -rxcsum -txcsum'  _c='physical addr which never changes'
ifconfig_bce1=' inet 172.1.1.2     netmask 255.255.255.0'                _c='physcial addr on crossover cable'
ifconfig_cxgb2='inet 172.21.21.129 netmask 255.255.255.0'                _c='physical backside 10g compute net'
ifconfig_cxgb3='inet 172.21.201.1  netmask 255.255.255.0 mtu 9000'       _c='physical backside 10g compute net'
ifconfig_cxgb6='inet 172.21.202.1  netmask 255.255.255.0 mtu 9000'       _c='physical backside 10g compute net'
ifconfig_cxgb8='inet 172.21.203.1  netmask 255.255.255.0 mtu 9000'       _c='physical backside 10g compute net'
ifconfig_cxgb4='inet 172.21.204.1  netmask 255.255.255.0 mtu 9000'       _c='physical backside 10g compute net'
ifconfig_cxgb0='inet 172.21.205.1  netmask 255.255.255.0 mtu 9000'       _c='physical backside 10g compute net'

   The 10.24.6.34 and 10.24.6.33 are alias addresses for the system.

Destination        Gateway            Flags    Refs      Use  Netif Expire
default            10.24.0.1          UGS         0     1049   bce0

   The server works correctly (and quite well) for both udp & tcp mounts.
Basically, all nfs traffic is great!

   However, locking only works for clients connected to the 10.24.6.38
interface.

   A tcpdump file from good & bad runs:

http://www.freebsd.org/~jwd/lockgood.pcap
http://www.freebsd.org/~jwd/lockbad.pcap

   Basically, the clients (both FreeBSD & Linux) query the servers rpcbind
for the address of the nlm which is returned correctly. For the good run, the
NLM is then called. For the bad call, it is not.

   I've started digging through code, but I do not claim to be an rpc expert.
If anyone has suggestions I would appreciate any pointers.

Thanks!
John