On Fri, 19 Jan 2007, Peter Jeremy wrote: > On Thu, 2007-Jan-18 18:03:20 +1100, Bruce Evans wrote: >> On Wed, 17 Jan 2007, Matthew Dillon wrote: >>> Alignment is critical. If the data is not aligned, don't bother. 128 >>> bits means 16 byte alignment. >> >> The above benchmark output is for aligned data :-). I don't try hard to >> optimize or benchmark misaligned cases. > > How realistic is this? Has anyone collected statistics on the size and > alignment of bzero/bcopy calls? How much of the time is the size known > at compile time? I think perfect alignment is very realistic. If not, it is an application bug :), just like for misaligned integer accesses on arches that allow this. In the kernel, other parts of the kernel are the application and it is reasonable to require perfect alignment. I recently did a dynamic search for misaligned (but only 32-bit non-aligned) bxx's (maybe only bzeros) in low-level network code and found only a couple. For the original i586 FPU optimizations, I gatherer statistics for bcopy/bzero. IIRC, alignment (64-bit?) was normal, at least for the large copies of interest, and large bcopys were so uncommon that it was a complete waste of time to optimize them (at least for my applications). Large bzeros/copyins/copyouts are more common. FreeBSD has some optimizations in low-level networking code for bcopys with a small size that is known at compile time (just use gcc's builtin_memcpy). These were lost to -ffreestanding and/or gcc's aggressive optimization of things like printf using the builtin printf. (-ffreestanding implies -fno-builtin, and no one cared enough about the loss to turn builtins back on. If you turn them back on, then they should be turned on individually as recommended in gcc.info to avoid conflicts. This is easy enough for the memcpy builtin but messy if you want all the old builtins starting with strlen.) I looked at these lost optimizations again while trying to optimize the low- level networking code for packets-per-second. The difficulty of implementing memcpy/bcopy perfectly is shown by gcc's builtin not being very close to getting it right for small fixed sizes even with -march=... I lost interest in this for now when I found that optimizations were impossible to measure because the packet rate depends mysteriously on the layout of the text section. My changes may have given +10%, but unrelated changes gave +-30%. The most mysterious one was -20% when cvs updated added ~500 bytes of object code that is never executed. Using builtin memcpy didn't have a noticeable effect here. BruceReceived on Fri Jan 19 2007 - 06:14:26 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:05 UTC