Re: libthr and 1:1 threading.

From: Terry Lambert <tlambert2_at_mindspring.com> Date: Wed, 02 Apr 2003 06:05:18 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:02 UTC

Alexander Leidinger wrote:
> On Tue, 01 Apr 2003 23:28:01 -0800
> Terry Lambert <tlambert2_at_mindspring.com> wrote:
> > The primary performance reasoning behind a 1:1 kernel threading
> > implementation, relative to the user space single kernel entry
> > scheduler in the libc_r implementation is SMP scalability for
> > threaded applications.
> 
> I think Jeff (or someone else?) said, that some web browsers gain
> "something" too (serialization issues with libc_r)? I had the impression
> that this also applies to UP systems.
> 
> Do I misremember this? If not, does it not apply to UP systems as well?

FWIW: the libc_r reentrancy isn't fixed by a 1:1 model for
anything but calls for which there are no non-blocking
alternative kernel APIs.  That means you are saved on things
like System V IPC interfaces (which you were saved on before,
if you used my kqueue patch for System V IPC to turn it into
events), but the file I/O, which is mostly what a browser is
doing, is not significantly improved.  And for calls for
which entrancy must be serialized, like the resolver, you
are stuck serializing in user space.

There is potential for some additional interleaved I/O occurring,
when the kernel is multiply entrant, rather than EWOULDBLOCK plus
a call conversion using a non-blocking file descriptor, that's
true.  I'm not sure that anything is gained by this in the UP
case, however.

Oh, and your process gets to compete for quantum as if it
were "number of threads" processes.  This is actually wrong,
if you are PTHREAD_SCOPE_PROCESS, and setting the other
option, PTHREAD_SCOPE_SYSTEM, requires root priviledges (but
is on by default in the libthr implementation, because a 1:1
model requires more sophisticated scheduler support).

You can get this same "cheat for extra quantum" effect with
libc_r by using the same root priviledge to "nice" your
process to a higher priority relative to the default.

The other issue with going from the user space model to a
kernel 1:1 model is that you are trading additional context
switch overhead, by way of not fully utilizing your quantum, when
you go into the kernel to block, instead of doing the call
conversion: when you go to sleep, you hand the remainder of
your quantum back to the system in 1:1, whereas with the libc_r,
you hand it to another of your own threads.  This is also what
the N:M addresses.

This also makes you less efficient when it comes to context
switch overhead, since without thread-in-group affinity, when
you give up the quantum, you are likely to take a full context
switch (%cr3 reload for a different process; for example, cron
runs once a second, and there are other "standard" system
processes which are far from idle).  In other words, you only
see a benefit for a lone threaded application on a relatively
quiecent system, which is OK, I guess, for a microbenchmark,
but pretty poor when it comes to real world systems, which
tend to be more heavily loaded.

The last time and of this difference in overhead between the
models was measured in any meaningful way (AFAIK) was in an
article in (November?) 2001 in:

	Parallel and Distributed Systems, IEEE Trans. on (T-PDS)

Which is where most of the academic work on scheduling and
load balancing for parallel and distributed systems takes
place.

Unfortunately, these papers are not available online, unless
you are an IEEE member, who also subscribes to T-PDS.  Maybe
you can find them, if you work at IBM and have access to a
good technical library (e.g. Almaden), or if you have access
to the Berkeley or Stanford libraries, or some other good
technical library.

There could have been a more recent paper, starting about 8
months back, which I don't know about (I need to take a trip
to the library myself ;^)).

The bottom line is that the 1:1 model is primarily useful for
improving SMP scalability at this time, and that the overhead
tradeoffs that are there right now don't really favor it, IMO,
in a UP system.

FreeBSD's libc_r has been in developement for nearly a decade;
it is a *very good* pthreads implementation, and the only
places it falls down, really, are in SMP scalability, and some
blocking system calls that aren't amenable to conversion.

You'll note that the System V IPC interfaces are not supported
in a threaded environment by the POSIX standard; most of these
blocking cases aren't really a problem.  User space serialization
in things like the resolver, which only opens one socket to do
lookups, and can only deal with a single response context are a
cross everyone bears.  Mostly this bothers no one, because all
browsers cache responses, there is response locality, and it's
not a problem unless you compile in support for multiple protocol
families (IPv4 + IPv6 support in the same browser usually means
waiting for an IPv6 lookup timeout, if the remote host is not an
RFC compliant DNS server, and fails to return an immediate reject
response, like it's supposed to do).  If you "fix" that for
libthr, you also "fix" it for libc_r.

-- Terry