The attached patch modifies UMA so that it uses critical sections to protect access to per-CPU caches, rather than mutexes. With the recent commit of critical section optimizations by John Baldwin, this patch should offer a small but measurable performance improvement for paths that allocate and free memory as a significant part of their workload. Examples of affected scenarios might include high performance networking, IPC, etc. John's change was an important dependency, because on uniprocessor systems, atomic operations are basically regular integer operations, and so very cheap. As such, I didn't want to replace mutexes with a more expensive critical section uperation on UP, even if it sped up SMP with critical sections cheaper than mutexes there. The change relies on the fact that critical sections cost less than mutexes -- however, the extent to which this is the case varies based on CPU type. The performance difference for Xeon P4 hardware is quite noticeable, due to the extortionate cost of disabling interrupts and atomic operations on that system. However, on PIII hardware, it should be much less noticeable, due to the (relatively) low cost of those operations. In a UDP packet send test on a dual-Xeon P4 box, sending 0-byte payload packets with netblast, I saw around a 2%-3% performance improvement on SMP, but more around a 1% improvement on UP. While this isn't a big change, these things add up. Before committing the patch, I'd like to see additional performance and stability measurements. Specific notes: - I'd like to make sure we don't see performance regressions on other workloads or hardware architectures. Because the relative costs of the important instructions here vary quite a bit by architecture, as well as based on how the kernel is compiled, there is a risk that some might see regressions. - In an earlier version of this patch, a race was introduced with INVARIANTS where-in INVARIANTS checking of free'd memory occurred after memory was returned to the cache, resulting in false positive panics. This bug has been corrected. However, testing with and without INVARIANTS would be considered helpful. - Note that, because of ordering between mutexes and critical sections, there are now some cases, when running with INVARIANTS, where the zone mutex will be acquired twice instead of once for memory free. As a result, execution with INVARIANTS may show a performance loss. UMA running without INVARIANTS doesn't acquire the zone mutex an additional time. Robert N M Watson --- //depot/vendor/freebsd/src/sys/vm/uma_core.c 2005/02/24 06:30:36 +++ //depot/user/rwatson/percpu/sys/vm/uma_core.c 2005/04/06 10:33:02 _at__at_ -1,4 +1,5 _at__at_ /*- + * Copyright (c) 2004-2005 Robert N. M. Watson * Copyright (c) 2004, 2005, * Bosko Milekic <bmilekic_at_FreeBSD.org>. All rights reserved. * Copyright (c) 2002, 2003, 2004, 2005, _at__at_ -119,9 +120,6 _at__at_ /* This mutex protects the keg list */ static struct mtx uma_mtx; -/* These are the pcpu cache locks */ -static struct mtx uma_pcpu_mtx[MAXCPU]; - /* Linked list of boot time pages */ static LIST_HEAD(,uma_slab) uma_boot_pages = LIST_HEAD_INITIALIZER(&uma_boot_pages); _at__at_ -384,48 +382,19 _at__at_ zone_timeout(uma_zone_t zone) { uma_keg_t keg; - uma_cache_t cache; u_int64_t alloc; - int cpu; keg = zone->uz_keg; alloc = 0; /* - * Aggregate per cpu cache statistics back to the zone. - * - * XXX This should be done in the sysctl handler. - * - * I may rewrite this to set a flag in the per cpu cache instead of - * locking. If the flag is not cleared on the next round I will have - * to lock and do it here instead so that the statistics don't get too - * far out of sync. - */ - if (!(keg->uk_flags & UMA_ZFLAG_INTERNAL)) { - for (cpu = 0; cpu <= mp_maxid; cpu++) { - if (CPU_ABSENT(cpu)) - continue; - CPU_LOCK(cpu); - cache = &zone->uz_cpu[cpu]; - /* Add them up, and reset */ - alloc += cache->uc_allocs; - cache->uc_allocs = 0; - CPU_UNLOCK(cpu); - } - } - - /* Now push these stats back into the zone.. */ - ZONE_LOCK(zone); - zone->uz_allocs += alloc; - - /* * Expand the zone hash table. * * This is done if the number of slabs is larger than the hash size. * What I'm trying to do here is completely reduce collisions. This * may be a little aggressive. Should I allow for two collisions max? */ - + ZONE_LOCK(zone); if (keg->uk_flags & UMA_ZONE_HASH && keg->uk_pages / keg->uk_ppera >= keg->uk_hash.uh_hashsize) { struct uma_hash newhash; _at__at_ -613,6 +582,10 _at__at_ /* * Drains the per cpu caches for a zone. * + * NOTE: This may only be called while the zone is being turn down, and not + * during normal operation. This is necessary in order that we do not have + * to migrate CPUs to drain the per-CPU caches. + * * Arguments: * zone The zone to drain, must be unlocked. * _at__at_ -626,12 +599,20 _at__at_ int cpu; /* - * We have to lock each cpu cache before locking the zone + * XXX: It is safe to not lock the per-CPU caches, because we're + * tearing down the zone anyway. I.e., there will be no further use + * of the caches at this point. + * + * XXX: It would good to be able to assert that the zone is being + * torn down to prevent improper use of cache_drain(). + * + * XXX: We lock the zone before passing into bucket_cache_drain() as + * it is used elsewhere. Should the tear-down path be made special + * there in some form? */ for (cpu = 0; cpu <= mp_maxid; cpu++) { if (CPU_ABSENT(cpu)) continue; - CPU_LOCK(cpu); cache = &zone->uz_cpu[cpu]; bucket_drain(zone, cache->uc_allocbucket); bucket_drain(zone, cache->uc_freebucket); _at__at_ -644,11 +625,6 _at__at_ ZONE_LOCK(zone); bucket_cache_drain(zone); ZONE_UNLOCK(zone); - for (cpu = 0; cpu <= mp_maxid; cpu++) { - if (CPU_ABSENT(cpu)) - continue; - CPU_UNLOCK(cpu); - } } /* _at__at_ -828,7 +804,8 _at__at_ &flags, wait); if (mem == NULL) { if (keg->uk_flags & UMA_ZONE_OFFPAGE) - uma_zfree_internal(keg->uk_slabzone, slab, NULL, 0); + uma_zfree_internal(keg->uk_slabzone, slab, NULL, + SKIP_NONE); ZONE_LOCK(zone); return (NULL); } _at__at_ -1643,10 +1620,6 _at__at_ #ifdef UMA_DEBUG printf("Initializing pcpu cache locks.\n"); #endif - /* Initialize the pcpu cache lock set once and for all */ - for (i = 0; i <= mp_maxid; i++) - CPU_LOCK_INIT(i); - #ifdef UMA_DEBUG printf("Creating slab and hash zones.\n"); #endif _at__at_ -1793,6 +1766,9 _at__at_ uma_cache_t cache; uma_bucket_t bucket; int cpu; +#ifdef INVARIANTS + int count; +#endif int badness; /* This is the fast path allocation */ _at__at_ -1827,12 +1803,33 _at__at_ } } + /* + * If possible, allocate from the per-CPU cache. There are two + * requirements for safe access to the per-CPU cache: (1) the thread + * accessing the cache must not be preempted or yield during access, + * and (2) the thread must not migrate CPUs without switching which + * cache it accesses. We rely on a critical section to prevent + * preemption and migration. We release the critical section in + * order to acquire the zone mutex if we are unable to allocate from + * the current cache; when we re-acquire the critical section, we + * must detect and handle migration if it has occurred. + */ +#ifdef INVARIANTS + count = 0; +#endif zalloc_restart: + critical_enter(); cpu = PCPU_GET(cpuid); - CPU_LOCK(cpu); cache = &zone->uz_cpu[cpu]; zalloc_start: +#ifdef INVARIANTS + count++; + KASSERT(count < 10, ("uma_zalloc_arg: count == 10")); +#endif +#if 0 + critical_assert(); +#endif bucket = cache->uc_allocbucket; if (bucket) { _at__at_ -1845,12 +1842,12 _at__at_ KASSERT(item != NULL, ("uma_zalloc: Bucket pointer mangled.")); cache->uc_allocs++; + critical_exit(); #ifdef INVARIANTS ZONE_LOCK(zone); uma_dbg_alloc(zone, NULL, item); ZONE_UNLOCK(zone); #endif - CPU_UNLOCK(cpu); if (zone->uz_ctor != NULL) { if (zone->uz_ctor(item, zone->uz_keg->uk_size, udata, flags) != 0) { _at__at_ -1880,7 +1877,33 _at__at_ } } } + /* + * Attempt to retrieve the item from the per-CPU cache has failed, so + * we must go back to the zone. This requires the zone lock, so we + * must drop the critical section, then re-acquire it when we go back + * to the cache. Since the critical section is released, we may be + * preempted or migrate. As such, make sure not to maintain any + * thread-local state specific to the cache from prior to releasing + * the critical section. + */ + critical_exit(); ZONE_LOCK(zone); + critical_enter(); + cpu = PCPU_GET(cpuid); + cache = &zone->uz_cpu[cpu]; + bucket = cache->uc_allocbucket; + if (bucket != NULL) { + if (bucket != NULL && bucket->ub_cnt > 0) { + ZONE_UNLOCK(zone); + goto zalloc_start; + } + bucket = cache->uc_freebucket; + if (bucket != NULL && bucket->ub_cnt > 0) { + ZONE_UNLOCK(zone); + goto zalloc_start; + } + } + /* Since we have locked the zone we may as well send back our stats */ zone->uz_allocs += cache->uc_allocs; cache->uc_allocs = 0; _at__at_ -1904,8 +1927,8 _at__at_ ZONE_UNLOCK(zone); goto zalloc_start; } - /* We are no longer associated with this cpu!!! */ - CPU_UNLOCK(cpu); + /* We are no longer associated with this CPU. */ + critical_exit(); /* Bump up our uz_count so we get here less */ if (zone->uz_count < BUCKET_MAX) _at__at_ -2228,10 +2251,10 _at__at_ uma_bucket_t bucket; int bflags; int cpu; - enum zfreeskip skip; +#ifdef INVARIANTS + int count; +#endif - /* This is the fast path free */ - skip = SKIP_NONE; keg = zone->uz_keg; #ifdef UMA_DEBUG_ALLOC_1 _at__at_ -2240,25 +2263,50 _at__at_ CTR2(KTR_UMA, "uma_zfree_arg thread %x zone %s", curthread, zone->uz_name); + if (zone->uz_dtor) + zone->uz_dtor(item, keg->uk_size, udata); +#ifdef INVARIANTS + ZONE_LOCK(zone); + if (keg->uk_flags & UMA_ZONE_MALLOC) + uma_dbg_free(zone, udata, item); + else + uma_dbg_free(zone, NULL, item); + ZONE_UNLOCK(zone); +#endif /* * The race here is acceptable. If we miss it we'll just have to wait * a little longer for the limits to be reset. */ - if (keg->uk_flags & UMA_ZFLAG_FULL) goto zfree_internal; - if (zone->uz_dtor) { - zone->uz_dtor(item, keg->uk_size, udata); - skip = SKIP_DTOR; - } - +#ifdef INVARIANTS + count = 0; +#endif + /* + * If possible, free to the per-CPU cache. There are two + * requirements for safe access to the per-CPU cache: (1) the thread + * accessing the cache must not be preempted or yield during access, + * and (2) the thread must not migrate CPUs without switching which + * cache it accesses. We rely on a critical section to prevent + * preemption and migration. We release the critical section in + * order to acquire the zone mutex if we are unable to free to the + * current cache; when we re-acquire the critical section, we must + * detect and handle migration if it has occurred. + */ zfree_restart: + critical_enter(); cpu = PCPU_GET(cpuid); - CPU_LOCK(cpu); cache = &zone->uz_cpu[cpu]; zfree_start: +#ifdef INVARIANTS + count++; + KASSERT(count < 10, ("uma_zfree_arg: count == 10")); +#endif +#if 0 + critical_assert(); +#endif bucket = cache->uc_freebucket; if (bucket) { _at__at_ -2272,15 +2320,7 _at__at_ ("uma_zfree: Freeing to non free bucket index.")); bucket->ub_bucket[bucket->ub_cnt] = item; bucket->ub_cnt++; -#ifdef INVARIANTS - ZONE_LOCK(zone); - if (keg->uk_flags & UMA_ZONE_MALLOC) - uma_dbg_free(zone, udata, item); - else - uma_dbg_free(zone, NULL, item); - ZONE_UNLOCK(zone); -#endif - CPU_UNLOCK(cpu); + critical_exit(); return; } else if (cache->uc_allocbucket) { #ifdef UMA_DEBUG_ALLOC _at__at_ -2304,9 +2344,32 _at__at_ * * 1) The buckets are NULL * 2) The alloc and free buckets are both somewhat full. + * + * We must go back the zone, which requires acquiring the zone lock, + * which in turn means we must release and re-acquire the critical + * section. Since the critical section is released, we may be + * preempted or migrate. As such, make sure not to maintain any + * thread-local state specific to the cache from prior to releasing + * the critical section. */ - + critical_exit(); ZONE_LOCK(zone); + critical_enter(); + cpu = PCPU_GET(cpuid); + cache = &zone->uz_cpu[cpu]; + if (cache->uc_freebucket != NULL) { + if (cache->uc_freebucket->ub_cnt < + cache->uc_freebucket->ub_entries) { + ZONE_UNLOCK(zone); + goto zfree_start; + } + if (cache->uc_allocbucket != NULL && + (cache->uc_allocbucket->ub_cnt < + cache->uc_freebucket->ub_cnt)) { + ZONE_UNLOCK(zone); + goto zfree_start; + } + } bucket = cache->uc_freebucket; cache->uc_freebucket = NULL; _at__at_ -2328,8 +2391,8 _at__at_ cache->uc_freebucket = bucket; goto zfree_start; } - /* We're done with this CPU now */ - CPU_UNLOCK(cpu); + /* We are no longer associated with this CPU. */ + critical_exit(); /* And the zone.. */ ZONE_UNLOCK(zone); _at__at_ -2353,27 +2416,9 _at__at_ /* * If nothing else caught this, we'll just do an internal free. */ - zfree_internal: + uma_zfree_internal(zone, item, udata, SKIP_DTOR); -#ifdef INVARIANTS - /* - * If we need to skip the dtor and the uma_dbg_free in - * uma_zfree_internal because we've already called the dtor - * above, but we ended up here, then we need to make sure - * that we take care of the uma_dbg_free immediately. - */ - if (skip) { - ZONE_LOCK(zone); - if (keg->uk_flags & UMA_ZONE_MALLOC) - uma_dbg_free(zone, udata, item); - else - uma_dbg_free(zone, NULL, item); - ZONE_UNLOCK(zone); - } -#endif - uma_zfree_internal(zone, item, udata, skip); - return; } _at__at_ -2655,7 +2700,7 _at__at_ slab->us_flags = flags | UMA_SLAB_MALLOC; slab->us_size = size; } else { - uma_zfree_internal(slabzone, slab, NULL, 0); + uma_zfree_internal(slabzone, slab, NULL, SKIP_NONE); } return (mem); _at__at_ -2666,7 +2711,7 _at__at_ { vsetobj((vm_offset_t)slab->us_data, kmem_object); page_free(slab->us_data, slab->us_size, slab->us_flags); - uma_zfree_internal(slabzone, slab, NULL, 0); + uma_zfree_internal(slabzone, slab, NULL, SKIP_NONE); } void _at__at_ -2743,6 +2788,7 _at__at_ int cachefree; uma_bucket_t bucket; uma_cache_t cache; + u_int64_t alloc; cnt = 0; mtx_lock(&uma_mtx); _at__at_ -2766,15 +2812,9 _at__at_ LIST_FOREACH(z, &zk->uk_zones, uz_link) { if (cnt == 0) /* list may have changed size */ break; - if (!(zk->uk_flags & UMA_ZFLAG_INTERNAL)) { - for (cpu = 0; cpu <= mp_maxid; cpu++) { - if (CPU_ABSENT(cpu)) - continue; - CPU_LOCK(cpu); - } - } ZONE_LOCK(z); cachefree = 0; + alloc = 0; if (!(zk->uk_flags & UMA_ZFLAG_INTERNAL)) { for (cpu = 0; cpu <= mp_maxid; cpu++) { if (CPU_ABSENT(cpu)) _at__at_ -2784,9 +2824,12 _at__at_ cachefree += cache->uc_allocbucket->ub_cnt; if (cache->uc_freebucket != NULL) cachefree += cache->uc_freebucket->ub_cnt; - CPU_UNLOCK(cpu); + alloc += cache->uc_allocs; + cache->uc_allocs = 0; } } + alloc += z->uz_allocs; + LIST_FOREACH(bucket, &z->uz_full_bucket, ub_link) { cachefree += bucket->ub_cnt; } _at__at_ -2797,7 +2840,7 _at__at_ zk->uk_maxpages * zk->uk_ipers, (zk->uk_ipers * (zk->uk_pages / zk->uk_ppera)) - totalfree, totalfree, - (unsigned long long)z->uz_allocs); + (unsigned long long)alloc); ZONE_UNLOCK(z); for (p = offset + 12; p > offset && *p == ' '; --p) /* nothing */ ; --- //depot/vendor/freebsd/src/sys/vm/uma_int.h 2005/02/16 21:50:29 +++ //depot/user/rwatson/percpu/sys/vm/uma_int.h 2005/03/15 19:57:24 _at__at_ -342,16 +342,6 _at__at_ #define ZONE_LOCK(z) mtx_lock((z)->uz_lock) #define ZONE_UNLOCK(z) mtx_unlock((z)->uz_lock) -#define CPU_LOCK_INIT(cpu) \ - mtx_init(&uma_pcpu_mtx[(cpu)], "UMA pcpu", "UMA pcpu", \ - MTX_DEF | MTX_DUPOK) - -#define CPU_LOCK(cpu) \ - mtx_lock(&uma_pcpu_mtx[(cpu)]) - -#define CPU_UNLOCK(cpu) \ - mtx_unlock(&uma_pcpu_mtx[(cpu)]) - /* * Find a slab within a hash table. This is used for OFFPAGE zones to lookup * the slab structure.Received on Sun Apr 10 2005 - 18:19:29 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:31 UTC