Re: dogfooding over in clusteradm land

From: Kostik Belousov <kostikbel_at_gmail.com> Date: Tue, 3 Jan 2012 11:18:20 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC

On Tue, Jan 03, 2012 at 12:02:22AM -0800, Don Lewis wrote:
> On  2 Jan, Don Lewis wrote:
> > On  2 Jan, Don Lewis wrote:
> >> On  2 Jan, Florian Smeets wrote:
> > 
> >>> This does not make a difference. I tried on 32K/4K with/without journal
> >>> and on 16K/2K all exhibit the same problem. At some point during the
> >>> cvs2svn conversion the sycer starts to use 100% CPU. The whole process
> >>> hangs at that point sometimes for hours, from time to time it does
> >>> continue doing some work, but really really slow. It's usually between
> >>> revision 210000 and 220000, when the resulting svn file gets bigger than
> >>> about 11-12Gb. At that point an ls in the target dir hangs in state ufs.
> >>> 
> >>> I broke into ddb and ran all commands which i thought could be useful.
> >>> The output is at http://tb.smeets.im/~flo/giant-ape_syncer.txt
> >> 
> >> Tracing command syncer pid 9 tid 100183 td 0xfffffe00120e9000
> >> cpustop_handler() at cpustop_handler+0x2b
> >> ipi_nmi_handler() at ipi_nmi_handler+0x50
> >> trap() at trap+0x1a8
> >> nmi_calltrap() at nmi_calltrap+0x8
> >> --- trap 0x13, rip = 0xffffffff8082ba43, rsp = 0xffffff8000270fe0, rbp = 0xffffff88c97829a0 ---
> >> _mtx_assert() at _mtx_assert+0x13
> >> pmap_remove_write() at pmap_remove_write+0x38
> >> vm_object_page_remove_write() at vm_object_page_remove_write+0x1f
> >> vm_object_page_clean() at vm_object_page_clean+0x14d
> >> vfs_msync() at vfs_msync+0xf1
> >> sync_fsync() at sync_fsync+0x12a
> >> sync_vnode() at sync_vnode+0x157
> >> sched_sync() at sched_sync+0x1d1
> >> fork_exit() at fork_exit+0x135
> >> fork_trampoline() at fork_trampoline+0xe
> >> --- trap 0, rip = 0, rsp = 0xffffff88c9782d00, rbp = 0 ---
> >> 
> >> I thinks this explains why the r228838 patch seems to help the problem.
> >> Instead of an application call to msync(), you're getting bitten by the
> >> syncer doing the equivalent.  I don't know why the syncer is CPU bound,
> >> though.  From my understanding of the patch it only optimizes the I/O.
> >> Without the patch, I would expect that the syncer would just spend a lot
> >> of time waiting on I/O.  My guess is that this is actually a vm problem.
> >> There are nested loops in vm_object_page_clean() and
> >> vm_object_page_remove_write(), so you could be doing something that's
> >> causing lots of looping in that code.
> > 
> > Does the machine recover if you suspend cvs2svn?  I think what is
> > happening is that cvs2svn is continuing to dirty pages while the syncer
> > is trying to sync the file.  From my limited understanding of this code,
> > it looks to me like every time cvs2svn dirties a page, it will trigger a
> > call to vm_object_set_writeable_dirty(), which will increment
> > object->generation.  Whenever vm_object_page_clean() detects a change in
> > the generation count, it restarts its scan of the pages associated with
> > the object.  This is probably not optimal ...
> 
> Since the syncer is only trying to flush out pages that have been dirty
> for the last 30 seconds, I think that vm_object_set_writeable_dirty()
> should just make one pass through the object, ignoring generation, and
> then return when it is called from the syncer.  That should keep
> vm_object_set_writeable_dirty() from looping over the object again and
> again if another process is actively dirtying the object.
> 
This sounds very plausible. I think that there is no sense in restarting
the scan if it is requested in async mode at all. See below.

Would be thrilled if this finally solves the svn2cvs issues.

commit 41aaafe5e3be5387949f303b8766da64ee4a521f
Author: Kostik Belousov <kostik_at_sirion>
Date:   Tue Jan 3 11:16:30 2012 +0200

    Do not restart the scan in vm_object_page_clean() if requested
    mode is async.

    Proposed by:	truckman

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 716916f..52fc08b 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
_at__at_ -841,7 +841,8 _at__at_ rescan:
 		if (p->valid == 0)
 			continue;
 		if (vm_page_sleep_if_busy(p, TRUE, "vpcwai")) {
-			if (object->generation != curgeneration)
+			if ((flags & OBJPC_SYNC) != 0 &&
+			    object->generation != curgeneration)
 				goto rescan;
 			np = vm_page_find_least(object, pi);
 			continue;
_at__at_ -851,7 +852,8 _at__at_ rescan:

 		n = vm_object_page_collect_flush(object, p, pagerflags,
 		    flags, &clearobjflags);
-		if (object->generation != curgeneration)
+		if ((flags & OBJPC_SYNC) != 0 &&
+		    object->generation != curgeneration)
 			goto rescan;

 		/*