[A new kind of experiment and partial results.] Given the zero'ed memory page(s) that for some of the example contexts include a page that should not be changing after initialization in my context (jemalloc global variables), I have attempted the following for such examples: A) Run gdb B) Attach to one of the live example processes. C) Check that the page is not zeroed yet. (print/x __je_sz_size2index_tab) D) protect the page containing the start of __je_sz_size2index_tab, using 0x1 as the PROT_READ mask. (print (int)mprotect(ADDRESS,1,0x1)) E) detach. The hope was to discover which of the following was involved: A) user-space code trying to write the page should get a SIGSEGV. In this case I'd likely be able to see what code was attempting the write. B) kernel-code doing something odd to the content or mapping of memory would not (or need not) lead to SIGSEGV. In this case I'd be unlikely to see what code lead to the zeros on the page. So far I've gotten only one failure example, nfsd during its handling of a SIGUSR1. Previous nfs mounts and dismounts worked fine, not asserting, indicating that at the time the page was not zeroed. I got no evidence of SIGSEGV from an attempted user space write to the page. But the nfsd.core shows the page as zeroed and the assert having caused abort(). That suggests the kernel side of things for what leads to the zeros. It turns out that just before the "unregsiteration()" activity is "killchildren()" activity: (gdb) list 971 972 static void 973 nfsd_exit(int status) 974 { 975 killchildren(); 976 unregistration(); 977 exit(status); 978 } (frame #12) used via: (gdb) list cleanup 954 /* 955 * Cleanup master after SIGUSR1. 956 */ 957 static void 958 cleanup(__unused int signo) 959 { 960 nfsd_exit(0); 961 } . . . and (for master): (void)signal(SIGUSR1, cleanup); This suggests the possibility that the zero'd pages could be associated with killing the child processes. (I've had a past aarch64 context where forking had problems with pages that were initially common to parent and child processes. In that context having the processes swap out [not just mostly paged out] and then swap back in was involved in showing the problem. The issue was fixed and was aarch64 specific. But it leaves me willing to consider fork-related memory management as possibly odd in some way for 32-bit powerpc.) Notes . . . Another possible kind of evidence: I've gone far longer with the machine doing just normal background processing with nothing failing on its own. This suggests that the (int)mprotect(ADDRESS,1,0x1) might be changing the context --or just doing the attach and detach in gdb does. I've nothing solid in this area so I'll ignore it, other than this note. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)Received on Mon May 11 2020 - 00:53:29 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC