HEADSUP: UMA not reentrant / possible memory leak

From: Poul-Henning Kamp <phk_at_phk.freebsd.dk>
Date: Tue, 29 Jul 2003 23:11:30 +0200
[I'm CC'ing current because this seems to have a significant negative
impact on -current kernel stability, and we can use some more data,
in particular on non-i386 SMP machines]

Thanks to Lukas Ertl and Bosko we have found a clear indication that
UMA is in fact not reentrant (enough).

The indication of this is that the g_bio zone does not return to
zero USED as it should.

The attached patch adds an atomic counter in GEOM to count the number
of actually used items in the sysctl variable debug.ngbio.

Here is a typical output from my SMP box:

bang# sh a.sh
g_bio:           144,        0,     35,     77,     4281
debug.ngbio: 0
10:58PM  up 36 secs, 1 user, load averages: 0.65, 0.20, 0.07
g_bio:           144,        0,     66,    102,     5917
debug.ngbio: 0
10:58PM  up 56 secs, 3 users, load averages: 0.46, 0.18, 0.07
g_bio:           144,        0,     69,     99,    12352
debug.ngbio: 0
10:59PM  up 1 min, 3 users, load averages: 0.56, 0.22, 0.09
g_bio:           144,        0,    185,    123,    20023
debug.ngbio: 0
10:59PM  up 2 mins, 3 users, load averages: 0.62, 0.25, 0.10
g_bio:           144,        0,    227,     81,    28259
debug.ngbio: 0
10:59PM  up 2 mins, 3 users, load averages: 0.64, 0.28, 0.11
g_bio:           144,        0,    222,     86,    32256
debug.ngbio: 0
11:00PM  up 2 mins, 3 users, load averages: 0.74, 0.33, 0.13

Notice that the USED column fluctuates both up and down.  Other
machines are able to reproduce negative USED counts.

As you can see in the patch I have added a mutex around the zone
operations in order to see if that solved the issue, and it doesn't
seem to make any difference at all.

I am unable to tell if it is just the UMA zone statistics which
are f**ked up, or if the "important" data structures in UMA are
also victims of this.  The machines which Lukas and Bosko work
on seem to die after some short period of time, and this could
indicate that this is not just statistics being b0rked.

We see this problem also on GCC 3.2.2 machines.

HELP!

Poul-Henning

Index: geom_io.c
===================================================================
RCS file: /home/ncvs/src/sys/geom/geom_io.c,v
retrieving revision 1.44
diff -u -r1.44 geom_io.c
--- geom_io.c	18 Jun 2003 10:33:09 -0000	1.44
+++ geom_io.c	29 Jul 2003 20:51:55 -0000
_at__at_ -39,6 +39,7 _at__at_
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
+#include <sys/sysctl.h>
 #include <sys/malloc.h>
 #include <sys/bio.h>
 
_at__at_ -55,6 +56,12 _at__at_
 static u_int pace;
 static uma_zone_t	biozone;
 
+struct mtx gbiomutex;
+static int ngbio;
+SYSCTL_INT(_debug, OID_AUTO, ngbio, CTLFLAG_RD,
+    &ngbio, 0, "");
+
+
 #include <machine/atomic.h>
 
 static void
_at__at_ -116,15 +123,26 _at__at_
 {
 	struct bio *bp;
 
+	mtx_lock(&gbiomutex);
 	bp = uma_zalloc(biozone, M_NOWAIT | M_ZERO);
+	mtx_unlock(&gbiomutex);
+	if (bp != NULL)
+		atomic_add_int(&ngbio, 1);
 	return (bp);
 }
 
 void
 g_destroy_bio(struct bio *bp)
 {
-
+	if (bp == NULL) {
+		printf("g_destroy_bio(NULL)");
+		Debugger("foo");
+		return;
+	}
+	mtx_lock(&gbiomutex);
 	uma_zfree(biozone, bp);
+	mtx_unlock(&gbiomutex);
+	atomic_add_int(&ngbio, -1);
 }
 
 struct bio *
_at__at_ -132,8 +150,11 _at__at_
 {
 	struct bio *bp2;
 
+	mtx_lock(&gbiomutex);
 	bp2 = uma_zalloc(biozone, M_NOWAIT | M_ZERO);
+	mtx_unlock(&gbiomutex);
 	if (bp2 != NULL) {
+		atomic_add_int(&ngbio, 1);
 		bp2->bio_parent = bp;
 		bp2->bio_cmd = bp->bio_cmd;
 		bp2->bio_length = bp->bio_length;
_at__at_ -304,6 +325,7 _at__at_
  
 	bzero(&mymutex, sizeof mymutex);
 	mtx_init(&mymutex, "g_xdown", MTX_DEF, 0);
+	mtx_init(&gbiomutex, "gbio", MTX_DEF, 0);
 
 	for(;;) {
 		g_bioq_lock(&g_bio_run_down);


-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk_at_FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Received on Tue Jul 29 2003 - 12:11:34 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:16 UTC