Re: SSE in libthr

From: Adrian Chadd <adrian_at_freebsd.org> Date: Fri, 27 Mar 2015 17:43:14 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:56 UTC

On 27 March 2015 at 16:03, Alan Somers <asomers_at_freebsd.org> wrote:
> On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd <adrian_at_freebsd.org> wrote:
>> hi,
>>
>> please don't try to microoptimise crap like strlen().
>>
>> The TL;DR for performant high-throughput code is: if strlen() or
>> memcpy() is the thing that's costing you the most, you're doing it
>> wrong.
>>
>>
>>
>> -adrian
>
> I respectfully disagree.  A well-optimized libc will benefit
> _every_single_program_ that uses strlen.  That includes Apache, Samba,
> Memcached, Quake, and basically every single program that every single
> FreeBSD user uses.  There's no reason that 3rd party software
> maintainers should have to rewrite basic libc functions in order to
> get decent performance on FreeBSD.  And the downsides are so small!
> In 2015, we should assume by default that most userland software is
> using SIMD instructions.  As Eric noticed, Clang emits them freely.
> What's the point to lazily saving the SSE registers on context
> switches if essentially all programs compiled from Ports will be using
> those registers anyway?  I agree with Jilles; I think we should always
> save the SSE registers for userland programs.

That's fine, but those benchmarks and improvements also have to take
into account the environment that these programs are running in, and
all of the other things that are going on with it.

Fixing strlen() to use SSE2 is great, but if the gains are offset by
fpu save/restore when doing fine grain locking that's blocking under
real world workloads, what's the benefit? What about if the system is
context switching over a million times a second? These are real life
things I see servers running all of the above software /do/.

One only knows with benchmarking, not microbenchmarking.

Microbenchmarks are great. They serve a purpose, which is "how the
heck is the current silicon I'm running on run some code that I've
cleverly crafted to hopefully run well."

I'm totally for saving/restoring SSE registers for userland programs.
But that's not where that kind of "make stuff fast" work should stop.
If it does, and that's where your benchmarking for the real world
stops, then you're doing it wrong.

Everything is a toss-up. For this userland based netmap packet pushing
app, SEE may be nice for some instructions, but know what else screws
things? The fact that the default scheduler policy is terrible and
crap gets scheduled /everywhere/ under any appreciable amount of load.
That the context switch rate is high, the interrupt rate is also high,
and with a little locking going on, I see fpu save/restore occur for a
non-insignificant fraction of CPU. Optimising strlen() or memcpy() is
great, but when my system context switches a million times a second,
we're never going to reach the steady state that these CPUs can really
crank out real work at under those conditions.

So, cool. Please keep poking at that stuff. But if you stop short of
making the system actually /be able to take advantage of them under
load/, I respectfully ask for a nice knob I can use to turn them off.
:)

-adrian

(Know where the slowdowns for memcached are? Hint - not strlen or
memcpy. Yes, I've been down that rabbit hole recently. Know what /i/
have? 1 million UDP transactions a second working on 16 core
sandybridge systems. Know what I didn't optimise? memcpy or strlen.
The network stack locking and pthreads overhead is what sucks.)