Re: SSE in libthr

From: Julian Elischer <julian_at_freebsd.org> Date: Sat, 28 Mar 2015 21:54:08 +0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:56 UTC

On 3/28/15 5:44 AM, Konstantin Belousov wrote:
> On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote:
>> On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen_at_FreeBSD.org> wrote:
>>> In a nutshell:
>>>
>>> Clang emits SSE instructions on amd64 in the common path of
>>> pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
>>> like to disable SSE in libthr.
>>>
>>> In more detail:
>>>
>>> In libthr/thread/thr_mutex.c, we find the following:
>>>
>>> 	#define MUTEX_INIT_LINK(m)              do {            \
>>> 	        (m)->m_qe.tqe_prev = NULL;                      \
>>> 	        (m)->m_qe.tqe_next = NULL;                      \
>>> 	} while (0)
>>>
>>> In 9.1, clang 3.1 emits two ordinary mov instructions:
>>>
>>> 	movq   $0x0,0x8(%rax)
>>> 	movq   $0x0,(%rax)
>>>
>>> Since 10.0 and clang 3.3, clang emits these SSE instructions:
>>>
>>> 	xorps  %xmm0,%xmm0
>>> 	movups %xmm0,(%rax)
>>>
>>> Although these look harmless enough, using the FPU can reduce performance by
>>> incurring extra overhead due to context-switching the FPU state.
>>>
>>> As I mentioned, this code is used in the common path of pthread_mutex_unlock.  I
>>> have a simple test program that creates four threads, all contending for a
>>> single mutex, and measures the total number of lock acquisitions over several
>>> seconds.  When libthr is built with SSE, as is current, I get around 53 million
>>> locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
>>> shows around 790,000 calls to fpudna versus 10 calls.  There could be other
>>> factors involved, but I presume that the FPU context switches account for most
>>> of the change in performance.
>>>
>>> Even when I add some SSE usage in the application--incidentally, these same
>>> instructions--building libthr without SSE improves performance from 53.5 million
>>> to 55.8 million (4.3%).
>>>
>>> In the real-world application where I first noticed this, performance improves
>>> by 3-5%.
>>>
>>> I would appreciate your thoughts and feedback.  The proposed patch is below.
>>>
>>> Eric
>>>
>>>
>>>
>>> Index: base/head/lib/libthr/arch/amd64/Makefile.inc
>>> ===================================================================
>>> --- base/head/lib/libthr/arch/amd64/Makefile.inc	(revision 280703)
>>> +++ base/head/lib/libthr/arch/amd64/Makefile.inc	(working copy)
>>> _at__at_ -1,3 +1,8 _at__at_
>>> #$FreeBSD$
>>>
>>> SRCS+=	_umtx_op_err.S
>>> +
>>> +# Using SSE incurs extra overhead per context switch,
>>> +# which measurably impacts performance when the application
>>> +# does not otherwise use FP/SSE.
>>> +CFLAGS+=-mno-sse
>> Good catch!
>>
>> Regarding your patch, I think we should disable even more, if possible.  How about:
>>
>> CFLAGS+=        -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3
> I think so.
>
> Also, this should be done for libc as well, both on i386 and amd64.
> I am not sure, should compiler-rt be included into the set ?
the point is that clang will do this anywhere it can, because it isn't 
taking into account the
side effects, just the speed of the commands themselves.

> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>