On Fri, Mar 27, 2015 at 10:40:57PM +0100, Jilles Tjoelker wrote: > On Fri, Mar 27, 2015 at 03:26:17PM -0400, Eric van Gyzen wrote: > > In a nutshell: > > > Clang emits SSE instructions on amd64 in the common path of > > pthread_mutex_unlock. This reduces performance by a non-trivial > > amount. I'd like to disable SSE in libthr. > > How about saving and restoring the FPU/SSE state eagerly instead of the > current CR0.TS-based lazy method? There is overhead associated with #NM > exception handling (fpudna) which is not worth it if FPU/SSE are used > often. This would apply to userland threads only; kernel threads > normally do not use FPU/SSE and handle the FPU/SSE state manually if > they do. First, we have no choice but saving the FPU context when a thread is switched from. It is not practical to try to keep the state in the hardware, since fetching it to other core is too troublesome. Second, the biggest overhead of #NM is the reading of FPU context from memory (or cache), not the handler itself. The save area for SSE-capable machines, i.e. all amd64, is ~400 bytes, and XSAVEOPT does not help much for reading of legacy FPU + XMM state. It does help for YMM. That said, your proposal would force all threads to pay higher cost at the context switch time, increasing latency. > > There is performance improvement potential in using SSE for optimizing > string functions, for example. Even a simple SSE2 strlen easily > outperforms the already optimized lib/libc/string/strlen.c in a > microbenchmark, and many other string functions are slow byte-at-a-time > implementations. If the program does a lot of work with FPU between switches, the cost is obviously mitigated. Note that even for the worst case of the reported microbenchmark, the measured overhead is ~10-15%. So if string ops are indeed take significant share of the program time, the FPU #NM handling cost should be very low even with the current scheme.Received on Sat Mar 28 2015 - 06:34:17 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:56 UTC