Konstantin Belousov wrote: On Mon, Jun 25, 2018 at 02:04:32AM +0000, Rick Macklem wrote: > Konstantin Belousov wrote: > >On Sat, Jun 23, 2018 at 09:03:02PM +0000, Rick Macklem wrote: > >> During testing of the pNFS server I have been frequently killing/restarting the nfsd. > >> Once in a while, the "slave" nfsd process doesn't terminate and a "ps axHl" shows: > >> 0 48889 1 0 20 0 5884 812 svcexit D - 0:00.01 nfsd: server > >> 0 48889 1 0 40 0 5884 812 rpcsvc I - 0:00.00 nfsd: server > >> ... more of the same > >> 0 48889 1 0 40 0 5884 812 rpcsvc I - 0:00.00 nfsd: server > >> 0 48889 1 0 -8 0 5884 812 rpcsvc I - 1:51.78 nfsd: server > >> 0 48889 1 0 -8 0 5884 812 rpcsvc I - 2:27.75 nfsd: server > >> > >> You can see that the top thread (the one that was created with the process) is > >> stuck in "D" on "svcexit". > >> The rest of the threads are still servicing NFS RPCs. [lots of stuff snipped] >Signals are put onto a signal queue between a time where the signal is >generated until the thread actually consumes it. I.e. the signal queue >is a container for the signals which are not yet acted upon. There is >one signal queue per process, and one signal queue for each thread >belonging to the process. When you signal the process, the signal is >put into some thread' signal queue, where the only criteria for the >selection of the thread is that the signal is not blocked. Since >SIGKILL is never blocked, it is put anywhere. > >Until signal is delivered by cursig()/postsig() loop, typically at the >AST handler, the only consequence of its presence are the EINTR/ERESTART >errors returned from the PCATCH-enabled sleeps. Ok, now I think I understand how this works. Thanks a lot for the explanation. > >Your description at the start of the message of the behaviour after > >SIGKILL, where other threads continued to serve RPCs, exactly matches > >above explanation. You need to add some global 'stop' flag, if it is not I looked at the code and there is already basically a "global stop flag". It's done by setting the sg_state variable to CLOSING for all thread groups in a function called svc_exit(). (I missed this when I looked before, so I didn't understand how all the threads normally terminate.) So, when I looked at svc_run_internal(), there is a loop while (state != closing) that calls cv_wait_sig()/cv_timedwait_sig() and when these return EINTR/ERESTART the call to svc_exit() is done to make the threads all return from the function. --> The only way in can get into the broken situation I see sometimes is if the top thread (called "ismaster" by the code) somehow returns from svc_run_internal() without calling svc_exit(), so that the state isn't set to "closing". Turns out there is only one place this can happen. It's this line: if (grp->sg_threadcount > grp->sg_maxthreads) break; I wouldn't have thought that sg_threadcount would have become ">" than sg_maxthreads, but when I looked at the output of "ps" that I pasted into the first message, there are 33 threads. (When I started the nfsd, I specified 32 threads, so I think it did the "break;" at this place to get out of the loop and return from svc_run_internal() without calling svc_exit().) I think changing the above line to: if (!ismaster && grp->sg_threadcount > grp->sg_maxthreads) will fix it. I'll test this and see if I can get it to fail. Thanks again for your help, rickReceived on Tue Jun 26 2018 - 23:05:11 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:16 UTC