Alexander Leidinger wrote: > On Tue, 01 Apr 2003 23:28:01 -0800 > Terry Lambert <tlambert2_at_mindspring.com> wrote: > > The primary performance reasoning behind a 1:1 kernel threading > > implementation, relative to the user space single kernel entry > > scheduler in the libc_r implementation is SMP scalability for > > threaded applications. > > I think Jeff (or someone else?) said, that some web browsers gain > "something" too (serialization issues with libc_r)? I had the impression > that this also applies to UP systems. > > Do I misremember this? If not, does it not apply to UP systems as well? FWIW: the libc_r reentrancy isn't fixed by a 1:1 model for anything but calls for which there are no non-blocking alternative kernel APIs. That means you are saved on things like System V IPC interfaces (which you were saved on before, if you used my kqueue patch for System V IPC to turn it into events), but the file I/O, which is mostly what a browser is doing, is not significantly improved. And for calls for which entrancy must be serialized, like the resolver, you are stuck serializing in user space. There is potential for some additional interleaved I/O occurring, when the kernel is multiply entrant, rather than EWOULDBLOCK plus a call conversion using a non-blocking file descriptor, that's true. I'm not sure that anything is gained by this in the UP case, however. Oh, and your process gets to compete for quantum as if it were "number of threads" processes. This is actually wrong, if you are PTHREAD_SCOPE_PROCESS, and setting the other option, PTHREAD_SCOPE_SYSTEM, requires root priviledges (but is on by default in the libthr implementation, because a 1:1 model requires more sophisticated scheduler support). You can get this same "cheat for extra quantum" effect with libc_r by using the same root priviledge to "nice" your process to a higher priority relative to the default. The other issue with going from the user space model to a kernel 1:1 model is that you are trading additional context switch overhead, by way of not fully utilizing your quantum, when you go into the kernel to block, instead of doing the call conversion: when you go to sleep, you hand the remainder of your quantum back to the system in 1:1, whereas with the libc_r, you hand it to another of your own threads. This is also what the N:M addresses. This also makes you less efficient when it comes to context switch overhead, since without thread-in-group affinity, when you give up the quantum, you are likely to take a full context switch (%cr3 reload for a different process; for example, cron runs once a second, and there are other "standard" system processes which are far from idle). In other words, you only see a benefit for a lone threaded application on a relatively quiecent system, which is OK, I guess, for a microbenchmark, but pretty poor when it comes to real world systems, which tend to be more heavily loaded. The last time and of this difference in overhead between the models was measured in any meaningful way (AFAIK) was in an article in (November?) 2001 in: Parallel and Distributed Systems, IEEE Trans. on (T-PDS) Which is where most of the academic work on scheduling and load balancing for parallel and distributed systems takes place. Unfortunately, these papers are not available online, unless you are an IEEE member, who also subscribes to T-PDS. Maybe you can find them, if you work at IBM and have access to a good technical library (e.g. Almaden), or if you have access to the Berkeley or Stanford libraries, or some other good technical library. There could have been a more recent paper, starting about 8 months back, which I don't know about (I need to take a trip to the library myself ;^)). The bottom line is that the 1:1 model is primarily useful for improving SMP scalability at this time, and that the overhead tradeoffs that are there right now don't really favor it, IMO, in a UP system. FreeBSD's libc_r has been in developement for nearly a decade; it is a *very good* pthreads implementation, and the only places it falls down, really, are in SMP scalability, and some blocking system calls that aren't amenable to conversion. You'll note that the System V IPC interfaces are not supported in a threaded environment by the POSIX standard; most of these blocking cases aren't really a problem. User space serialization in things like the resolver, which only opens one socket to do lookups, and can only deal with a single response context are a cross everyone bears. Mostly this bothers no one, because all browsers cache responses, there is response locality, and it's not a problem unless you compile in support for multiple protocol families (IPv4 + IPv6 support in the same browser usually means waiting for an IPv6 lookup timeout, if the remote host is not an RFC compliant DNS server, and fails to return an immediate reject response, like it's supposed to do). If you "fix" that for libthr, you also "fix" it for libc_r. -- TerryReceived on Wed Apr 02 2003 - 04:07:40 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:02 UTC