AW: NFS issues since upgrading to 13-RELEASE

From: Scheffenegger, Richard <Richard.Scheffenegger_at_netapp.com> Date: Thu, 15 Apr 2021 21:14:32 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:28 UTC

FWIW:

r367492 fixes an issue around "premature" transmission of an ACK due to the incoming segment only been partially processed at the time - related to in-kernel TCP consumers which use socket upcalls.

Rick mentioned, that the NFS server (one in-kernel TCP user) has stringent requirements on the state of the socket during the upcall, thus D29690 is retaining the lock on the socket buffer until TCP processing is finalized and the upcall can be done without running any risk for transmitting outdated information back to the other end.

However, I have no proper way to verify/validate this interaction.

My ask would be to test the behavior with D29690 first - but if similar hangs keep reoccurring, then revert r367492 (which will also mean more severe surgery on the TCP processing flow).

Thanks.

Richard Scheffenegger

-----Ursprüngliche Nachricht-----
Von: Rick Macklem <rmacklem_at_uoguelph.ca> 
Gesendet: Donnerstag, 15. April 2021 23:05
An: Allan Jude <allanjude_at_freebsd.org>; freebsd-current_at_freebsd.org
Cc: Richard Scheffenegger <rscheff_at_FreeBSD.org>; Juraj Lutter <otis_at_FreeBSD.org>
Betreff: Re: NFS issues since upgrading to 13-RELEASE

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

I wrote:
[stuff snipped]
>- Alternately you can try rscheff_at_'s alternate proposed patch that is 
>at
>  https://reviews.freebsd.og/D29690.
Oops, that's
    https:/reviews.freebsd.org/D29690

rick

  I have not yet had time to test this one, but since I cannot reproduce the hang, I can
  only do testing of it to see that it is "no worse" than reverting r367492 for my
  setup.

Please let us know which you choose and whether or not it fixes your problem.

>> Any pointers for troubleshooting this? I've been looking through vmstat, gstat, top, etc. when the problem occurs, but I haven't been able to pinpoint the issue. I can get pcap, but it would be from the hosts, because I don't have a 10G tap or managed switch.
>>
>
>run `nfsstat -d 1` and try to capture a few lines from before, during, 
>and after the stall, and that may provide some insight.
>
>Specifically, does the queue length grow, suggesting it is waiting on 
>the I/O subsystem, or does it just stop getting traffic all together.

If the revert of r367492 does not fix the problem, monitor the TCP connection(s) via "netstat -a" and, if possible, capture packets via tcpdump -s 0 -w hang.pcap host <nfs-client> or similar, run on the server.

Ideally the tcpdump would  be started before the "hang" occurs, but running one while the hang is occurring (until after it recovers) could also be useful.

Thanks for reporting this, rick

--
Allan Jude
_______________________________________________
freebsd-current_at_freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"