Re: svn commit: r360233 - in head: contrib/jemalloc . . . : This partially breaks a 2-socket 32-bit powerpc (old PowerMac G4) based on head -r360311

From: Mark Millard <marklmi_at_yahoo.com> Date: Wed, 13 May 2020 00:29:07 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

[stress alone is sufficient to have jemalloc asserts fail
in stress, no need for a multi-socket G4 either. No need
to involve nfsd, mountd, rpcbind or the like. This is not
a claim that I know all the problems to be the same, just
that a jemalloc reported failure in this simpler context
happens and zeroed pages are involved.]

Reminder: head -r360311 based context.

First I show a single CPU/core PowerMac G4 context failing
in stress. (I actually did this later, but it is the
simpler context.) I simply moved the media from the
2-socket G4 to this slower, single-cpu/core one.

cpu0: Motorola PowerPC 7400 revision 2.9, 466.42 MHz
cpu0: Features 9c000000<PPC32,ALTIVEC,FPU,MMU>
cpu0: HID0 8094c0a4<EMCP,DOZE,DPM,EIEC,ICE,DCE,SGE,BTIC,BHT>
real memory  = 1577857024 (1504 MB)
avail memory = 1527508992 (1456 MB)

# stress -m 1 --vm-bytes 1792M
stress: info: [1024] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: Failed assertion: "slab == extent_slab_get(extent)"
stress: FAIL: [1024] (415) <-- worker 1025 got signal 6
stress: WARN: [1024] (417) now reaping child worker processes
stress: FAIL: [1024] (451) failed run completed in 243s

(Note: 1792 is the biggest it allowed with M.)

The following still pages in and out and fails:

# stress -m 1 --vm-bytes 1290M
stress: info: [1163] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/arena_inlines_b.h:258: Failed assertion: "slab == extent_slab_get(extent)"
. . .

By contrast, the following had no problem for as
long as I let it run --and did not page in or out:

# stress -m 1 --vm-bytes 1280M
stress: info: [1181] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The 2 socket PowerMac G4 context with 2048 MiByte of RAM . . .

stress -m 1 --vm-bytes 1792M

did not (quickly?) fail or page. 1792
is as large as it would allow with M.

The following also did not (quickly?) fail
(and were not paging):

stress -m 2 --vm-bytes 896M
stress -m 4 --vm-bytes 448M
stress -m 8 --vm-bytes 224M

(Only 1 example was run at a time.)

But the following all did quickly fail (and were
paging):

stress -m 8 --vm-bytes 225M
stress -m 4 --vm-bytes 449M
stress -m 2 --vm-bytes 897M

(Only 1 example was run at a time.)

I'll note that when I exited an su process
I ended up with a:

<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:200: Failed assertion: "ret == sz_index2size_compute(index)"
Abort trap (core dumped)

and a matching su.core file. It appears
that stress's activity leads to other
processes also seeing examples of the
zeroed-page(s) problem (probably su had
paged some or had been fully swapped
out):

(gdb) bt
#0  thr_kill () at thr_kill.S:4
#1  0x503821d0 in __raise (s=6) at /usr/src/lib/libc/gen/raise.c:52
#2  0x502e1d20 in abort () at /usr/src/lib/libc/stdlib/abort.c:67
#3  0x502d6144 in sz_index2size_lookup (index=<optimized out>) at /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:200
#4  sz_index2size (index=<optimized out>) at /usr/src/contrib/jemalloc/include/jemalloc/internal/sz.h:207
#5  ifree (tsd=0x5008b018, ptr=0x50041460, tcache=0x5008b138, slow_path=<optimized out>) at jemalloc_jemalloc.c:2583
#6  0x502d5cec in __je_free_default (ptr=0x50041460) at jemalloc_jemalloc.c:2784
#7  0x502d62d4 in __free (ptr=0x50041460) at jemalloc_jemalloc.c:2852
#8  0x501050cc in openpam_destroy_chain (chain=0x50041480) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:113
#9  0x50105094 in openpam_destroy_chain (chain=0x500413c0) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111
#10 0x50105094 in openpam_destroy_chain (chain=0x50041320) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111
#11 0x50105094 in openpam_destroy_chain (chain=0x50041220) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111
#12 0x50105094 in openpam_destroy_chain (chain=0x50041120) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111
#13 0x50105094 in openpam_destroy_chain (chain=0x50041100) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:111
#14 0x50105014 in openpam_clear_chains (policy=0x50600004) at /usr/src/contrib/openpam/lib/libpam/openpam_load.c:130
#15 0x50101230 in pam_end (pamh=0x50600000, status=<optimized out>) at /usr/src/contrib/openpam/lib/libpam/pam_end.c:83
#16 0x1001225c in main (argc=<optimized out>, argv=0x0) at /usr/src/usr.bin/su/su.c:477

(gdb) print/x __je_sz_size2index_tab
$1 = {0x0 <repeats 513 times>}

Notes:

Given that the original problem did not involve
paging to the swap partition, may be just making
it to the Laundry list or some such is sufficient,
something that is also involved when the swap
space is partially in use (according to top). Or
sitting in the inactive list for a long time, if
that has some special status.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)