Re: svn commit: r360233 - in head: contrib/jemalloc . . . : This partially breaks a 2-socket 32-bit powerpc (old PowerMac G4) based on head -r360311

From: Mark Millard <marklmi_at_yahoo.com> Date: Sun, 10 May 2020 19:53:19 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

[A new kind of experiment and partial results.]

Given the zero'ed memory page(s) that for some of
the example contexts include a page that should not
be changing after initialization in my context
(jemalloc global variables), I have attempted the
following for such examples:

A) Run gdb
B) Attach to one of the live example processes.
C) Check that the page is not zeroed yet.
   (print/x __je_sz_size2index_tab)
D) protect the page containing the start
   of __je_sz_size2index_tab, using 0x1
   as the PROT_READ mask.
   (print (int)mprotect(ADDRESS,1,0x1))
E) detach.

The hope was to discover which of the following
was involved:

A) user-space code trying to write the page should
   get a SIGSEGV. In this case I'd likely be able
   to see what code was attempting the write.

B) kernel-code doing something odd to the content
   or mapping of memory would not (or need not)
   lead to SIGSEGV. In this case I'd be unlikely
   to see what code lead to the zeros on the page.

So far I've gotten only one failure example, nfsd
during its handling of a SIGUSR1. Previous nfs
mounts and dismounts worked fine, not asserting,
indicating that at the time the page was not
zeroed.

I got no evidence of SIGSEGV from an attempted user
space write to the page. But the nfsd.core shows the
page as zeroed and the assert having caused abort().
That suggests the kernel side of things for what
leads to the zeros.

It turns out that just before the "unregsiteration()"
activity is "killchildren()" activity:

(gdb) list
971	
972	static void
973	nfsd_exit(int status)
974	{
975		killchildren();
976		unregistration();
977		exit(status);
978	}

(frame #12) used via:

(gdb) list cleanup
954	/*
955	 * Cleanup master after SIGUSR1.
956	 */
957	static void
958	cleanup(__unused int signo)
959	{
960		nfsd_exit(0);
961	}
. . .

and (for master):

        (void)signal(SIGUSR1, cleanup);

This suggests the possibility that the zero'd
pages could be associated with killing the
child processes. (I've had a past aarch64
context where forking had problems with
pages that were initially common to parent
and child processes. In that context having
the processes swap out [not just mostly
paged out] and then swap back in was
involved in showing the problem. The issue
was fixed and was aarch64 specific. But it
leaves me willing to consider fork-related
memory management as possibly odd in some
way for 32-bit powerpc.)

Notes . . .

Another possible kind of evidence: I've gone far
longer with the machine doing just normal background
processing with nothing failing on its own. This
suggests that the (int)mprotect(ADDRESS,1,0x1) might
be changing the context --or just doing the attach
and detach in gdb does. I've nothing solid in this
area so I'll ignore it, other than this note.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)