Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

From: Bruce Evans <brde_at_optusnet.com.au> Date: Sat, 24 Dec 2011 17:16:33 +1100 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC

On Fri, 23 Dec 2011, Alexander Best wrote:

> is -mpreferred-stack-boundary=2 really necessary for i386 builds any longer?
> i built GENERIC (including modules) with and without that flag. the results
> are:

The same as it has always been.  It avoids some bloat.

> 1654496	bytes with the flag set
> vs.
> 1654952	bytes with the flag unset

I don't believe this.  GENERIC is enormously bloated, so it has size
more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes
is hard to believe.  I get a savings of 9K (text) in a 5MB kernel.
Changing the default target arch from i386 to pentium-undocumented has
reduced the text space savings a little, since the default for passing
args is now to preallocate stack space for them and store to this,
instead of to push them; this preallocation results in more functions
needing to allocate some stack space explicitly, and when some is
allocated explicitly, the text space cost for this doesn't depend on
the size of the allocation.

Anyway, the savings are mostly from from avoiding cache misses from
sparse allocation on stacks.

Also, FreeBSD-i386 hasn't been programmed to support aligned stacks:
- KSTACK_PAGES on i386 is 2, while on amd64 it is 4.  Using more
   stack might push something over the edge
- not much care is taken to align the initial stack or to keep the
   stack aligned in calls from asm code.  E.g., any alignment for
   mi_startup() (and thus proc0?) is accidental.  This may result
   in perfect alignment or perfect misalignment.  Hopefully, more
   care is taken with thread startup.  For gcc, the alignment is
   done bogusly in main() in userland, but there is no main() in
   the kernel.  The alignment doesn't matter much (provided the
   perfect misalignment is still to a multiple of 4), but when it
   matters, the random misalignment that results from not trying to
   do it at all is better than perfect misalignment from getting it
   wrong.  With 4-byte alignment, the only cases that it helps are
   with 64-bit variables.

> the gcc(1) man page states the following:
>
> "
> This extra alignment does consume extra stack space, and generally
> increases code size.  Code that is sensitive to stack space usage,
> such as embedded systems and operating system kernels, may want to
> reduce the preferred alignment to -mpreferred-stack-boundary=2.
> "
>
> the comment in sys/conf/kern.mk however sorta suggests that the default
> alignment of 4 bytes might improve performance.

The default stack alignment is 16 bytes, which unimproves performance.

clang handles stack alignment correctly (only does it when it is needed)
so it doesn't need a -mpreferred-stack-boundary option and doesn't
always break without alignment in main().  Well, at least it used to,
IIRC.  Testing it now shows that it does the necessary andl of the
stack pointer for __aligned(32), but for __aligned(16) it now assumes
that the stack is aligned by the caller.  So it now needs
-mpreferred-stack-boundary=2, but doesn't have it.  OTOH, clang doesn't
do the andl in main() like gcc does (unless you put a dummy __aligned(32)
there), but requires crt to pass an aligned stack.

Bruce