On 3/28/15 5:44 AM, Konstantin Belousov wrote: > On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote: >> On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen_at_FreeBSD.org> wrote: >>> In a nutshell: >>> >>> Clang emits SSE instructions on amd64 in the common path of >>> pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd >>> like to disable SSE in libthr. >>> >>> In more detail: >>> >>> In libthr/thread/thr_mutex.c, we find the following: >>> >>> #define MUTEX_INIT_LINK(m) do { \ >>> (m)->m_qe.tqe_prev = NULL; \ >>> (m)->m_qe.tqe_next = NULL; \ >>> } while (0) >>> >>> In 9.1, clang 3.1 emits two ordinary mov instructions: >>> >>> movq $0x0,0x8(%rax) >>> movq $0x0,(%rax) >>> >>> Since 10.0 and clang 3.3, clang emits these SSE instructions: >>> >>> xorps %xmm0,%xmm0 >>> movups %xmm0,(%rax) >>> >>> Although these look harmless enough, using the FPU can reduce performance by >>> incurring extra overhead due to context-switching the FPU state. >>> >>> As I mentioned, this code is used in the common path of pthread_mutex_unlock. I >>> have a simple test program that creates four threads, all contending for a >>> single mutex, and measures the total number of lock acquisitions over several >>> seconds. When libthr is built with SSE, as is current, I get around 53 million >>> locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace >>> shows around 790,000 calls to fpudna versus 10 calls. There could be other >>> factors involved, but I presume that the FPU context switches account for most >>> of the change in performance. >>> >>> Even when I add some SSE usage in the application--incidentally, these same >>> instructions--building libthr without SSE improves performance from 53.5 million >>> to 55.8 million (4.3%). >>> >>> In the real-world application where I first noticed this, performance improves >>> by 3-5%. >>> >>> I would appreciate your thoughts and feedback. The proposed patch is below. >>> >>> Eric >>> >>> >>> >>> Index: base/head/lib/libthr/arch/amd64/Makefile.inc >>> =================================================================== >>> --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) >>> +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) >>> _at__at_ -1,3 +1,8 _at__at_ >>> #$FreeBSD$ >>> >>> SRCS+= _umtx_op_err.S >>> + >>> +# Using SSE incurs extra overhead per context switch, >>> +# which measurably impacts performance when the application >>> +# does not otherwise use FP/SSE. >>> +CFLAGS+=-mno-sse >> Good catch! >> >> Regarding your patch, I think we should disable even more, if possible. How about: >> >> CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 > I think so. > > Also, this should be done for libc as well, both on i386 and amd64. > I am not sure, should compiler-rt be included into the set ? the point is that clang will do this anywhere it can, because it isn't taking into account the side effects, just the speed of the commands themselves. > _______________________________________________ > freebsd-current_at_freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org" >Received on Sat Mar 28 2015 - 12:54:21 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:56 UTC