Re: dogfooding over in clusteradm land

From: Florian Smeets <flo_at_freebsd.org> Date: Tue, 03 Jan 2012 13:46:30 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03.01.2012 10:18, Kostik Belousov wrote:
> On Tue, Jan 03, 2012 at 12:02:22AM -0800, Don Lewis wrote:
>> On  2 Jan, Don Lewis wrote:
>>> On  2 Jan, Don Lewis wrote:
>>>> On  2 Jan, Florian Smeets wrote:
>>> 
>>>>> This does not make a difference. I tried on 32K/4K 
>>>>> with/without journal and on 16K/2K all exhibit the same 
>>>>> problem. At some point during the cvs2svn conversion the 
>>>>> sycer starts to use 100% CPU. The whole process hangs at 
>>>>> that point sometimes for hours, from time to time it does 
>>>>> continue doing some work, but really really slow. It's 
>>>>> usually between revision 210000 and 220000, when the 
>>>>> resulting svn file gets bigger than about 11-12Gb. At that 
>>>>> point an ls in the target dir hangs in state ufs.
>>>>> 
>>>>> I broke into ddb and ran all commands which i thought
>>>>> could be useful. The output is at 
>>>>> http://tb.smeets.im/~flo/giant-ape_syncer.txt
>>>> 
>>>> Tracing command syncer pid 9 tid 100183 td 0xfffffe00120e9000
>>>> cpustop_handler() at cpustop_handler+0x2b ipi_nmi_handler()
>>>> at ipi_nmi_handler+0x50 trap() at trap+0x1a8 nmi_calltrap()
>>>> at nmi_calltrap+0x8 --- trap 0x13, rip = 0xffffffff8082ba43,
>>>> rsp = 0xffffff8000270fe0, rbp = 0xffffff88c97829a0 ---
>>>> _mtx_assert() at _mtx_assert+0x13 pmap_remove_write() at
>>>> pmap_remove_write+0x38 vm_object_page_remove_write() at 
>>>> vm_object_page_remove_write+0x1f vm_object_page_clean() at 
>>>> vm_object_page_clean+0x14d vfs_msync() at vfs_msync+0xf1 
>>>> sync_fsync() at sync_fsync+0x12a sync_vnode() at 
>>>> sync_vnode+0x157 sched_sync() at sched_sync+0x1d1
>>>> fork_exit() at fork_exit+0x135 fork_trampoline() at
>>>> fork_trampoline+0xe --- trap 0, rip = 0, rsp =
>>>> 0xffffff88c9782d00, rbp = 0 ---
>>>> 
>>>> I thinks this explains why the r228838 patch seems to help 
>>>> the problem. Instead of an application call to msync(), 
>>>> you're getting bitten by the syncer doing the equivalent.  I 
>>>> don't know why the syncer is CPU bound, though.  From my 
>>>> understanding of the patch it only optimizes the I/O.
>>>> Without the patch, I would expect that the syncer would just
>>>> spend a lot of time waiting on I/O.  My guess is that this
>>>> is actually a vm problem. There are nested loops in 
>>>> vm_object_page_clean() and vm_object_page_remove_write(), so 
>>>> you could be doing something that's causing lots of looping 
>>>> in that code.
>>> 
>>> Does the machine recover if you suspend cvs2svn?  I think what 
>>> is happening is that cvs2svn is continuing to dirty pages
>>> while the syncer is trying to sync the file.  From my limited 
>>> understanding of this code, it looks to me like every time 
>>> cvs2svn dirties a page, it will trigger a call to 
>>> vm_object_set_writeable_dirty(), which will increment 
>>> object->generation.  Whenever vm_object_page_clean() detects a 
>>> change in the generation count, it restarts its scan of the 
>>> pages associated with the object.  This is probably not
>>> optimal ...
>> 
>> Since the syncer is only trying to flush out pages that have
>> been dirty for the last 30 seconds, I think that 
>> vm_object_set_writeable_dirty() should just make one pass
>> through the object, ignoring generation, and then return when it
>> is called from the syncer.  That should keep 
>> vm_object_set_writeable_dirty() from looping over the object 
>> again and again if another process is actively dirtying the 
>> object.
>> 
> This sounds very plausible. I think that there is no sense in 
> restarting the scan if it is requested in async mode at all. See 
> below.
> 
> Would be thrilled if this finally solves the svn2cvs issues.
> 
> commit 41aaafe5e3be5387949f303b8766da64ee4a521f Author: Kostik 
> Belousov <kostik_at_sirion> Date:   Tue Jan 3 11:16:30 2012 +0200
> 
> Do not restart the scan in vm_object_page_clean() if requested
> mode is async.
> 
> Proposed by:	truckman
> 
> diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index 
> 716916f..52fc08b 100644 --- a/sys/vm/vm_object.c +++ 
> b/sys/vm/vm_object.c _at__at_ -841,7 +841,8 _at__at_ rescan: if (p->valid == 0)
> continue; if (vm_page_sleep_if_busy(p, TRUE, "vpcwai")) { -			if 
> (object->generation != curgeneration) +			if ((flags & OBJPC_SYNC) 
> != 0 && +			    object->generation != curgeneration) goto rescan; 
> np = vm_page_find_least(object, pi); continue; _at__at_ -851,7 +852,8 _at__at_ 
> rescan:
> 
> n = vm_object_page_collect_flush(object, p, pagerflags, flags, 
> &clearobjflags); -		if (object->generation != curgeneration) +		if 
> ((flags & OBJPC_SYNC) != 0 && +		    object->generation != 
> curgeneration) goto rescan;
> 
> /*

Yes, the patch fixes the problem. The cvs2svn run completed this time.

     9132.25 real      8387.05 user       403.86 sys

I did not see any significant syncer activity in top -S anymore.

Thanks a lot.
Florian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8C+KYACgkQapo8P8lCvwkc+QCeLY8+OkEQo1/wB3J2TyjfXyc0
b0IAn1OJo1XUlBYPZRoU5NFSO5dnNbne
=IGEW
-----END PGP SIGNATURE-----