Re: post ino64: lockd no runs?

From: Rodney W. Grimes <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> Date: Mon, 12 Jun 2017 17:45:12 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:12 UTC

> On Mon, Jun 12, 2017 at 10:14 AM, John Baldwin <jhb_at_freebsd.org> wrote:
> > On Sunday, June 11, 2017 11:12:25 AM David Wolfskill wrote:
> >> On Sun, Jun 04, 2017 at 08:57:44AM -0400, Michael Butler wrote:
> >> > It seems that {rpc.}lockd no longer runs after the ino64 changes on any
> >> > of my systems after a full rebuild of src and ports. No log entries
> >> > offer any insight as to why :-(
> >> >
> >> >     imb
> >>
> >> I don't tend to use NFS on my systems that are running head, so I
> >> haven't had occasion to test this as stated.
> >>
> >> However, I just completed my weekly update of the "prooduction" systems
> >> here at home, running stable/11.  And I find that lockd seems to be ...
> >> claiming that all is well, but declining to run (for long).
> >>
> >> To the best of my knowledge, that was not the case until this last
> >> update, which was from:
> >>
> >> FreeBSD albert.catwhisker.org 11.1-PRERELEASE FreeBSD 11.1-PRERELEASE #316  r319566M/319569:1100514: Sun Jun  4 03:54:41 PDT 2017     root_at_freebeast.catwhisker.org:/common/S1/obj/usr/src/sys/ALBERT  amd64
> >>
> >> to
> >>
> >> FreeBSD albert.catwhisker.org 11.1-BETA1 FreeBSD 11.1-BETA1 #322  r319823M/319823:1100514: Sun Jun 11 03:56:10 PDT 2017     root_at_freebeast.catwhisker.org:/common/S1/obj/usr/src/sys/ALBERT  amd64
> >>
> >> The "glaringly obvious" symptom in my case is that I am now unable
> >> to (directly) save an email message from within mutt(1) by appending
> >> it to an NFS-resident file.  (Saving it to a local file, then using
> >> cat(1) to append that to the NFS- resident file & removing the local
> >> copy works....)
> >>
> >> After a few variations on a theme of:
> >>
> >> albert(11.1)[5] sudo service lockd restart
> >> lockd not running?
> >> Starting lockd.
> >> albert(11.1)[6] echo $?
> >> 0
> >> albert(11.1)[7] service lockd status
> >> lockd is not running.
> >>
> >> I finally(!) thought to ask ktrace what's going on (as tailing
> >> /var/log/messages was completely unproductive, even after enabling
> >> rc_debug).
> >>
> >> So I tried: "sudo ktrace -di service lockd restart"; upon exanimation of
> >> the output of kdump(1), I see that the trace ends with:
> >>
> >>   ...
> >>   2811 rpc.lockd NAMI  "/var/run/logpriv"
> >>   2786 sh       CALL  read(0xa,0x627fc0,0x400)
> >>   2786 sh       GIO   fd 10 read 0 bytes
> >>        ""
> >>   2811 rpc.lockd RET   connect 0
> >>   2786 sh       RET   read 0
> >>   2811 rpc.lockd CALL  sendto(0x3,0x7fffffffe2c0,0x27,0,0,0)
> >>   2786 sh       CALL  exit(0)
> >>   2811 rpc.lockd GIO   fd 3 wrote 39 bytes
> >>        "<30>Jun 11 15:43:10 rpc.lockd: Starting"
> >>   2811 rpc.lockd RET   sendto 39/0x27
> >>   2811 rpc.lockd CALL  sigaction(SIGALRM,0x7fffffffec20,0)
> >>   2811 rpc.lockd RET   sigaction 0
> >>   2811 rpc.lockd CALL  nlm_syscall(0,0x1e,0x4,0x801015040)
> >>   2811 rpc.lockd RET   nlm_syscall -1 errno 14 Bad address
> >
> > This is a really good clue.  nlm_syscall is dying with EFAULT.  The last
> > argument is a pointer to an array of char * pointers, and the only way
> > I can see it dying is if it fails to copyin() one of the strings pointed
> > to by those pointers.  You could try running rpc.lockd under gdb from
> > ports and setting a breakpoint on 'nlm_syscall' and then printing out
> > 'addr_count' and 'p addrs_at_(addr_count * 2)'.
> 
> Yes, I found that the kernel was trying to copyin() from NULL, and
> then found that corresponds to 'uaddr'.  After some tracing I found
> that the tightened condition for taddr2uaddr have enforced (correctly)
> buffer length passed from caller, which was not set correctly since ~9
> years ago (r177633, which sets the size to sizeof(pointer)) but never
> gets noticed because there is no check on that, so the solution seems
> to be to correctly set the length values to (allocated size), and that
> have fixed the issue for me.
> 
> The code could use some cleanups and I plan to do it at some later time.
> 
> > Unfortunately I'm not able to reproduce the failure on a test machine
> > I have running head post-ino64.
> 
> This should have been fixed by r319852 in -HEAD (
> https://svnweb.freebsd.org/base?view=revision&revision=319852 ), and
> I'll MFC the change after 3 days' settle  assuming there is no
> objections, as this is a regression.

(RE hat on)
The next 11.1 release builds start on the 16th, please try to make
your RFa to RE and complete the merge before that date, I would really
hate to have 11.1 go out without this fixed.

-- 
Rod Grimes                                                 rgrimes_at_freebsd.org