On Mon, 26 Jan 2004, Peter Jeremy wrote: > >This last point is the clincher. The chip does NOT have enough "umphf". I > >actually managed to boot a -current (from back then) on a 80386SX and it > >was torturously slow. An ls(1) on my empty home directory took 15 seconds. > >My VAX is faster. > > This is a bug in FreeBSD 5.x - the performance in general has degraded > since 4.x. Performance degradation is often more obvious in lower end > machines. There are some areas where performance is improved, and several important areas where it's worse. I'd encourage all FreeBSD developers to look at areas where it's worse and fix things :-). That said, I know there's a fair amount of work going on relating to performance optimization, and hopefully we'll start to see some of those results in the near future. FWIW, I actually measure a pretty dramatic improvement in network benchmarks on 5.x relative to 4.x in the SMP case through increased parallelism and asynchrony. The areas I'm aware of that require particular attention at this point include: - Improving interrupt latency. We've moved to ithreads, but haven't spent enough time optimizing the performance of our ithread implementation. Bosko did a sample i386 implementation of light weight context switches last year, but at that time we didn't have enough device driver locking to take advantage of it. We're now in much better shape locking-wise, with a lot more just around the corner, so we need to focus on interrupt latency. We held a conference call a few days ago to get some of the interested parties together (Bosko, Jeff, et al), and it looks like Peter Wemm has foolishly signed up to update/re-implement on a recent 5.x. Use of the IO APIC is necessary for SMP systems, but also provides a fair amount of additional overhead. In some recent uniprocessor benchmarking, I saw an observable overhead for using 'device apic' -- it could be we want to back off the use of device apic on these systems. - General optimization of locking. We've put in a fair number of locks, and pushed Giant off some of the interesting paths (i.e., pipe locking). We now need to look at lock granularity. I recently committed some changes to our mutex profiling code to measure lock contention. I suspect we're not seeing a lot of contention, with the exception of Giant, and so we might actually want to look at reducing the number of locks using mutex pools (where possible) to lower memory overhead. We have a number of tools here that can help us, and now things are maturing locking wise, we should use them. We are also likely pretty close to pushing Giant further off a number of pieces of process-related code, which should help quite a bit with things like large builds. - Get the socket locking into the tree. Large parts of the network stack can now run Giant-free, and there are substantial outstanding patches for a lot more. Cleanup is required, but hopefully we'll see some patches posted for testing soon. There are some areas of the network stack that require substantial further attention -- for example, the KAME code requires additional locking work to run Gaint-free. - Reduce the overhead of in-kernel thread context switching. We do more context switching than we used to, not just because of ithreads, but also because we have used threads to increase asynchrony and serialize work queues. - Reduce the cost of lock operations. There have been some suggestions that our current mutexes consume more memory than necessary in non-debugging cases, and also are more expensive than necessary in some cases. - Explore additional use of the UMA slab allocator. In particular, see whether using it can help improve performance with System V IPC, where currently the implementation does its own memory caching and handling. There have also been some proposals to increase use of UMA in the network stack, use it further for sockets, etc. I know there has also been some experimentation with using UMA to replace the current mbuf allocator. - Trim unneeded fields from a number of kernel structures. As KSE went in, struct proc was broken out into a number of pieces. In some cases, variables lived on in multiple structures, and can now be cleaned out. Likewise in other kernel data structures. - Take better advantage of CPU class optimizations. There has been some discussion of providing HAL modules for the kernel, and libraries for userspace, based on the CPU type to improve performance. I.e., optimized mutex, memory zeroing, context switching, et al. Right now we do a fairly poor job at picking up these optimizations, and carry around a lot of memory overhead to support a large set. We need to do a better job where possible -- we should really see the results if we're able to optimize code such as the crypto code for specific CPUs. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert_at_fledge.watson.org Senior Research Scientist, McAfee ResearchReceived on Sun Jan 25 2004 - 17:36:28 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:39 UTC