On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote: > On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen_at_FreeBSD.org> wrote: > > > > In a nutshell: > > > > Clang emits SSE instructions on amd64 in the common path of > > pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd > > like to disable SSE in libthr. > > > > In more detail: > > > > In libthr/thread/thr_mutex.c, we find the following: > > > > #define MUTEX_INIT_LINK(m) do { \ > > (m)->m_qe.tqe_prev = NULL; \ > > (m)->m_qe.tqe_next = NULL; \ > > } while (0) > > > > In 9.1, clang 3.1 emits two ordinary mov instructions: > > > > movq $0x0,0x8(%rax) > > movq $0x0,(%rax) > > > > Since 10.0 and clang 3.3, clang emits these SSE instructions: > > > > xorps %xmm0,%xmm0 > > movups %xmm0,(%rax) > > > > Although these look harmless enough, using the FPU can reduce performance by > > incurring extra overhead due to context-switching the FPU state. > > > > As I mentioned, this code is used in the common path of pthread_mutex_unlock. I > > have a simple test program that creates four threads, all contending for a > > single mutex, and measures the total number of lock acquisitions over several > > seconds. When libthr is built with SSE, as is current, I get around 53 million > > locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace > > shows around 790,000 calls to fpudna versus 10 calls. There could be other > > factors involved, but I presume that the FPU context switches account for most > > of the change in performance. > > > > Even when I add some SSE usage in the application--incidentally, these same > > instructions--building libthr without SSE improves performance from 53.5 million > > to 55.8 million (4.3%). > > > > In the real-world application where I first noticed this, performance improves > > by 3-5%. > > > > I would appreciate your thoughts and feedback. The proposed patch is below. > > > > Eric > > > > > > > > Index: base/head/lib/libthr/arch/amd64/Makefile.inc > > =================================================================== > > --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) > > +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) > > _at__at_ -1,3 +1,8 _at__at_ > > #$FreeBSD$ > > > > SRCS+= _umtx_op_err.S > > + > > +# Using SSE incurs extra overhead per context switch, > > +# which measurably impacts performance when the application > > +# does not otherwise use FP/SSE. > > +CFLAGS+=-mno-sse > > Good catch! > > Regarding your patch, I think we should disable even more, if possible. How about: > > CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 I think so. Also, this should be done for libc as well, both on i386 and amd64. I am not sure, should compiler-rt be included into the set ?Received on Fri Mar 27 2015 - 20:44:58 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:56 UTC