Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

From: Bruce Evans <brde_at_optusnet.com.au> Date: Mon, 26 Dec 2011 02:09:54 +1100 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC

On Sat, 24 Dec 2011, Alexander Best wrote:

> On Sat Dec 24 11, Bruce Evans wrote:
>> On Sat, 24 Dec 2011, Alexander Best wrote:
>>
>>> On Sat Dec 24 11, Bruce Evans wrote:
>>>> On Fri, 23 Dec 2011, Alexander Best wrote:
>>> ...
>>>>> the gcc(1) man page states the following:
>>>>>
>>>>> "
>>>>> This extra alignment does consume extra stack space, and generally
>>>>> increases code size.  Code that is sensitive to stack space usage,
>>>>> such as embedded systems and operating system kernels, may want to
>>>>> reduce the preferred alignment to -mpreferred-stack-boundary=2.
>>>>> "
>>>>>
>>>>> the comment in sys/conf/kern.mk however sorta suggests that the default
>>>>> alignment of 4 bytes might improve performance.
>>>>
>>>> The default stack alignment is 16 bytes, which unimproves performance.
>>>
>>> maybe the part of the comment in sys/conf/kern.mk, which mentions that a
>>> stack
>>> alignment of 16 bytes might improve micro benchmark results should be
>>> removed.
>>> this would prevent people (like me) from thinking, using a stack alignment
>>> of
>>> 4 bytes is a compromise between size and efficiently. it isn't! currently a
>>> stack alignment of 16 bytes has no advantages towards one with 4 bytes on
>>> i386.
>>
>> I think the comment is clear enough.  It it mentions all the tradeoffs.
>> It is only slightly cryptic in saying that these are tradeoffs and that
>> the configuration is our best guess at the best tradeoff -- it just says
>> "while" for both.  It goes without saying that we don't use our worst
>> guess.  Anyone wanting to change this should run benchmarks and beware
>> that micro-benchmarks are especially useless.  The changed comment is not
>> so good since it no longer mentions micro-bencharmarks or says "while".
>
> if micro benchmark results aren't of any use, why should the claim that the
> default stack alignment of 16 bytes might produce better outcome stay?

Because:
- the actual claim is the opposite of that (it is that the default 16-byte
   alignments is probably a loss overall)
- the claim that the default 16-byte alignment may benefit micro-benchmarks
   is true, even without the weaselish miswording of "might" in it.  There
   is always at least 1 micro-benchmark that will benefit from almost any
   change, and here we expect a benefit in many microbenchmarks that don't
   bust the caches.  Except, 16-byte alignment isn't supported (*) in the
   kernel, so we actually expect a loss from many microbenchmarks that
   don't bust the caches.
- the second claim warns inexperienced benchmarkers not to claim that the
   default is better because it is better in microbenchmarks.

> it doesn't seem as if anybody has micro benchmarked 16 bytes vs. 4 bytes stack
> alignment, until now. so the micro benchmark statement in the comment seems to
> be pure speculation.

No, it is obviously true.

> even worse...it indicates that by removing the
> -mpreferred-stack-boundary=2 flag, one can gain a performance boost by
> sacrifying a few more bytes of kernel (and module) size.

No, it is part of the sentence explaining why removing the
-mpreferred-stack-boundary=2 flag will probably regain the "overall loss"
that is avoided by using the flag.

> this suggests that the behavior -mpreferred-stack-boundary=2 vs. not specyfing
> it, losely equals the semantics of -Os vs. -O2.

No, -Os guarantees slower execution by forcing optimization to prefer
space savings over time savings in more ways.  Except, -Os is completely
broken in -current (in the kernel), and gives very large negative space
savings (about 50%).  It last worked with gcc-3.  Its brokenness with
gcc-4 is related to kern.pre.mk still specifying -finline-limit flags
that are more suitable for gcc-3 (gcc has _many_ flags for giving more
delicate control over inlining, and better defaults for them) and
excessive inlining in gcc-4 given by -funit-at-a-time
-finline-functions-called-once.  These apparently cause gcc's inliner
to go insane with -Os.  When I tried to fix this by reducing inlining,
I couldn't find any threshold that fixed -Os without breaking inlining
of functions that are declared inline.

(*) A primary part of the lack of support for 16-byte stack alignment in
the kernel no special stack alignment for the main kernel entry point,
namely syscall().  From i386/exception.s:

% 	SUPERALIGN_TEXT
% IDTVEC(int0x80_syscall)

At this point, the stack has 5 words on it (it was 16-byte aligned before
that).

% 	pushl	$2			/* sizeof "int 0x80" */
% 	subl	$4,%esp			/* skip over tf_trapno */
% 	pushal
% 	pushl	%ds
% 	pushl	%es
% 	pushl	%fs
% 	SET_KERNEL_SREGS
% 	cld
% 	FAKE_MCOUNT(TF_EIP(%esp))
% 	pushl	%esp

We "push" 14 more words.  This gives perfect misaligment to the worst odd
word boundary (perfect if only word boundaries are allowed).  gcc wants
the stack to be aligned to a 4*n word boundary before function calls,
but here we have a 4*n+3 word boundary.  (4*n+3 is worse than 4*n+1
since 2 more words instead of 4 will cross the next 16-byte boundary).

% 	call	syscall

Using the default -mpreferred-stack-boundary will preserve the perfect
misaligment across all C functions called by syscall().

% 	add	$4, %esp
% 	MEXITCOUNT
% 	jmp	doreti

Old versions didn't have the pessimization of pushing the frame pointer.
This is a minor pessimization, except it uses more stack, unless you
use the default -mpreferred-stack-boundary.  Without this, only 18 words
were pushed, so the misalignment was imperfect (to a 4*n+2 word
boundary).  If the default stack alignment is any use at all (in the
kernel), then it is mainly to prevent 64-bit data types being laid out
across cache line boundaries.  Alignment to a 4*n+2 word boundary gives
that just as well as alignment to a 4*n+0 word boundary.

I tested using the default -mpreferred-stack-boundary in FreeBSD-~5.2,
which doesn't push the frame pointer.  This gave the expected results,
except the optimization for a microbenchmark was surprisingly large.
For a macro-benchmark, I built some kernels.  This seemed to take a
little longer (about 0.2%, and not statistically significant).  But
the time for a clock_gettime() microbenchmark was reduced from 271 ns
per call to 263.5 ns per call.  That's with the stack for clock_gettime()
imperfectly misaligned to a 4*n+2 word boundary.  But changing the
stack alignment by subtracting more from the stack in syscons made
little difference, unless it was changed to an odd byte boundary (then
clock_gettime() took about 324 ns).

amd64 is of course more careful about this (since its ABI requires
16-byte alignment).  According to log messages, the initial %rsp
(before anything is pushed onto it in the above) is offset by 8
bytes or so, as necessary to make the final %rsp come out aligned.
Pushing the frame pointer would have broken this.  However, on
amd64, the first arg is passed in %rdi, so there is no push to
pass the frame pointer and the stack remains aligned.  When the
frame pointer was passed "by reference", adjusting the stack
after the pushes would have broken the reference, so the offset
method was essential.  Now it is not needed (unless we want or
need frame to be aligned, since %rdi can pass the frame pointer
wherever the frame is, and the offset method becomes a minor
optimization.  If you remove the -mpreferred-stack-boundary=2
optimization, be sure to remove this one too, since it is tinier.

Bruce