On Sat, 24 Dec 2011, Alexander Best wrote: > On Sat Dec 24 11, Bruce Evans wrote: >> On Sat, 24 Dec 2011, Alexander Best wrote: >> >>> On Sat Dec 24 11, Bruce Evans wrote: >>>> On Fri, 23 Dec 2011, Alexander Best wrote: >>> ... >>>>> the gcc(1) man page states the following: >>>>> >>>>> " >>>>> This extra alignment does consume extra stack space, and generally >>>>> increases code size. Code that is sensitive to stack space usage, >>>>> such as embedded systems and operating system kernels, may want to >>>>> reduce the preferred alignment to -mpreferred-stack-boundary=2. >>>>> " >>>>> >>>>> the comment in sys/conf/kern.mk however sorta suggests that the default >>>>> alignment of 4 bytes might improve performance. >>>> >>>> The default stack alignment is 16 bytes, which unimproves performance. >>> >>> maybe the part of the comment in sys/conf/kern.mk, which mentions that a >>> stack >>> alignment of 16 bytes might improve micro benchmark results should be >>> removed. >>> this would prevent people (like me) from thinking, using a stack alignment >>> of >>> 4 bytes is a compromise between size and efficiently. it isn't! currently a >>> stack alignment of 16 bytes has no advantages towards one with 4 bytes on >>> i386. >> >> I think the comment is clear enough. It it mentions all the tradeoffs. >> It is only slightly cryptic in saying that these are tradeoffs and that >> the configuration is our best guess at the best tradeoff -- it just says >> "while" for both. It goes without saying that we don't use our worst >> guess. Anyone wanting to change this should run benchmarks and beware >> that micro-benchmarks are especially useless. The changed comment is not >> so good since it no longer mentions micro-bencharmarks or says "while". > > if micro benchmark results aren't of any use, why should the claim that the > default stack alignment of 16 bytes might produce better outcome stay? Because: - the actual claim is the opposite of that (it is that the default 16-byte alignments is probably a loss overall) - the claim that the default 16-byte alignment may benefit micro-benchmarks is true, even without the weaselish miswording of "might" in it. There is always at least 1 micro-benchmark that will benefit from almost any change, and here we expect a benefit in many microbenchmarks that don't bust the caches. Except, 16-byte alignment isn't supported (*) in the kernel, so we actually expect a loss from many microbenchmarks that don't bust the caches. - the second claim warns inexperienced benchmarkers not to claim that the default is better because it is better in microbenchmarks. > it doesn't seem as if anybody has micro benchmarked 16 bytes vs. 4 bytes stack > alignment, until now. so the micro benchmark statement in the comment seems to > be pure speculation. No, it is obviously true. > even worse...it indicates that by removing the > -mpreferred-stack-boundary=2 flag, one can gain a performance boost by > sacrifying a few more bytes of kernel (and module) size. No, it is part of the sentence explaining why removing the -mpreferred-stack-boundary=2 flag will probably regain the "overall loss" that is avoided by using the flag. > this suggests that the behavior -mpreferred-stack-boundary=2 vs. not specyfing > it, losely equals the semantics of -Os vs. -O2. No, -Os guarantees slower execution by forcing optimization to prefer space savings over time savings in more ways. Except, -Os is completely broken in -current (in the kernel), and gives very large negative space savings (about 50%). It last worked with gcc-3. Its brokenness with gcc-4 is related to kern.pre.mk still specifying -finline-limit flags that are more suitable for gcc-3 (gcc has _many_ flags for giving more delicate control over inlining, and better defaults for them) and excessive inlining in gcc-4 given by -funit-at-a-time -finline-functions-called-once. These apparently cause gcc's inliner to go insane with -Os. When I tried to fix this by reducing inlining, I couldn't find any threshold that fixed -Os without breaking inlining of functions that are declared inline. (*) A primary part of the lack of support for 16-byte stack alignment in the kernel no special stack alignment for the main kernel entry point, namely syscall(). From i386/exception.s: % SUPERALIGN_TEXT % IDTVEC(int0x80_syscall) At this point, the stack has 5 words on it (it was 16-byte aligned before that). % pushl $2 /* sizeof "int 0x80" */ % subl $4,%esp /* skip over tf_trapno */ % pushal % pushl %ds % pushl %es % pushl %fs % SET_KERNEL_SREGS % cld % FAKE_MCOUNT(TF_EIP(%esp)) % pushl %esp We "push" 14 more words. This gives perfect misaligment to the worst odd word boundary (perfect if only word boundaries are allowed). gcc wants the stack to be aligned to a 4*n word boundary before function calls, but here we have a 4*n+3 word boundary. (4*n+3 is worse than 4*n+1 since 2 more words instead of 4 will cross the next 16-byte boundary). % call syscall Using the default -mpreferred-stack-boundary will preserve the perfect misaligment across all C functions called by syscall(). % add $4, %esp % MEXITCOUNT % jmp doreti Old versions didn't have the pessimization of pushing the frame pointer. This is a minor pessimization, except it uses more stack, unless you use the default -mpreferred-stack-boundary. Without this, only 18 words were pushed, so the misalignment was imperfect (to a 4*n+2 word boundary). If the default stack alignment is any use at all (in the kernel), then it is mainly to prevent 64-bit data types being laid out across cache line boundaries. Alignment to a 4*n+2 word boundary gives that just as well as alignment to a 4*n+0 word boundary. I tested using the default -mpreferred-stack-boundary in FreeBSD-~5.2, which doesn't push the frame pointer. This gave the expected results, except the optimization for a microbenchmark was surprisingly large. For a macro-benchmark, I built some kernels. This seemed to take a little longer (about 0.2%, and not statistically significant). But the time for a clock_gettime() microbenchmark was reduced from 271 ns per call to 263.5 ns per call. That's with the stack for clock_gettime() imperfectly misaligned to a 4*n+2 word boundary. But changing the stack alignment by subtracting more from the stack in syscons made little difference, unless it was changed to an odd byte boundary (then clock_gettime() took about 324 ns). amd64 is of course more careful about this (since its ABI requires 16-byte alignment). According to log messages, the initial %rsp (before anything is pushed onto it in the above) is offset by 8 bytes or so, as necessary to make the final %rsp come out aligned. Pushing the frame pointer would have broken this. However, on amd64, the first arg is passed in %rdi, so there is no push to pass the frame pointer and the stack remains aligned. When the frame pointer was passed "by reference", adjusting the stack after the pushes would have broken the reference, so the offset method was essential. Now it is not needed (unless we want or need frame to be aligned, since %rdi can pass the frame pointer wherever the frame is, and the offset method becomes a minor optimization. If you remove the -mpreferred-stack-boundary=2 optimization, be sure to remove this one too, since it is tinier. BruceReceived on Sun Dec 25 2011 - 14:09:59 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC