Re: libthr and 1:1 threading.

From: Terry Lambert <tlambert2_at_mindspring.com> Date: Wed, 02 Apr 2003 14:39:36 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:02 UTC

Robert Watson wrote:
> On Wed, 2 Apr 2003, Terry Lambert wrote:
> > Is the disk I/O really that big of an issue?  All writes will be on
> > underlying non-blocking descriptors; I guess you are saying that the
> > interleaved I/O is more important, further down the system call
> > interface than the top, and this becomes an issue?
> 
> The I/O issue is a big deal for things like mysql, yes.

I'm not interested in the MySQL code; I've been in that threads
code, deep (made it work on AIX).  There are a lot better ways
to deal with MySQL's latency issues.  For example, the two phase
commit stall that it has is not necessary on a Soft Updates FS,
but, like qmail, it does it anyway, introducing latency.

But Mozilla... the only issue I could see is interleaved network
I/O, but that should not be an issue; HTTP is request/response,
with the lion's share of the data coming as a result of the
response.  In other words, the rendering engine's going to have
to wait for data back from the remote server, and that's got to
be the primary latency.

The only way I see for disk I/O to be involved in Mozilla is in
local cache?  You can turn that off.

> > It seems to me that maybe the correct fix for this is to use AIO
> > instead of non-blocking I/O, then?
> 
> Well, they're both fixes.  Another issue for applications that are
> threaded and may be bumping up against the system memory limits is whether
> or not the whole process stalls on a page fault or memory mapping fault,
> or whether it's just the thread.

This is what I meant by "deeper in the system calll layer".  IMO,
if you are stalled on something like this on an async fd, then it
should queue the fault anyway, and return to user space for the
next request.  This may just be a bug in the kernel processing of
demand faults on vnodes associated with async fd's (FWIW, System V
and Solaris both queue the fault for kernel processing, and then
return to user space).

> If you have an application that is accessing a large memory mapped
> file, there may be some long kernel sleeps as you pull in the pages.

Again, this stall, if the fd in question is async, should be taken
as system time, not as an application stall.  That's a little harder,
in this case, because you want to fail the pagein with the moral
equivalent of "EAGAIN" plus a forced threads context switch after
queing the fault request.  It's a lot easier if we are talking about
an explicit read() request, and not a fault, (obviously 8-)), since
it should be expecting the possibility of an EAGAIN in the read()
case.

I guess, in this case, without an N:M upcall to indicate the forced
context switch, you would just stall.

This, though, doesn't seem to be an issue for Mozilla, to me.

> Certainly, you can argue that the application should be structured
> to make all I/O explicit and asynchronous, but for various reasons,
> that's not the case :-).

The mmap'ed file case is obviously not something that can be
handled without an explicit contract between user and kernel
for notification of the pagein temporary failure (I would use
a signal for that, probably, as a gross first approximation,
but per-process signal handling is currently not happy...).

> Our VM and VFS subsystems may have limited concurrency from
> an SMPng perspective, but probably have enough that a marked
> benefit should be seen there too (you might have to wait for
> another thread to block in the subsystem, but that will be a
> short period of time compared to how long it takes to service
> the page from disk).

I would argue that this was in error, at least for explicit
handing of non-blocking I/O.  SVR4 and Solaris don't suffer
from a stall, in the face of a missing page in a demand page-in
on an async fd.  The process of servicing the page-in is really
independent of the process of retrying the call to see if the
page is there yet.

I think this is probably classifiable as a serious deficiency in
the FreeBSD VM page handling, and in the decoupling of the fd and
the VM backing object.  I'm not sure how I'd go about fixing it;
you'd have to pass the fd struct down as well, to get the flags,
I think.

Actually, people have wanted this for a while, in order to properly
support per open instance data on cloned devices (e.g. this is
needed in order to support multiple instances of VMWare on a single
FreeBSD host).

> > The GUI thread issues are something I hadn't considered; I don't
> > generally think of user space CPU intensive operations like that,
> > but I guess it has to be rendered some time.  8-).
> 
> One of the problems I've run into is where you lose interactivity during
> file saves and other disk-intensive operations in OpenOffice.  Other
> windows could in theory still be processing UI events, such as menu
> clicks, etc, but since you're dumping several megabytes of data to disk or
> doing interactive file operations that require waiting on disk latency,
> you end up with a fairly nasty user experience.  One way to explore this
> effect is to do a side-by-side comparison of the behavior of OpenOffice
> and Mozilla linked against libc_r and linuxthreads.

I don't think this is enough, actually.  I think you will find
X11 over-the-wire stall barriers here.  This is one of the reasons
I asked about X11 itself.  There are a lot of places in a lot of
code where people call "Xsync".  I don't think there's a reasonable
way to avoid this, given the usual reason it's being called (e.g.
to render individual widgets atomically, in order to provide for
the appearance of speed from "draw individual widgets fast, rather
than all widgets at the same time, slow").

> I haven't actually instrumented the kernel, but it might be quite
> interesting to do so--attempt to estimate the total impact of disk
> stalls on libc_r.

Yes, it would be very interesting.  The only approach I could come
up with for this, though, is to force the libthr 1:1 code through
the fd locking code from libc_r (per my previous suggestion on
benchmarking) to seperate out the stall domains.

I think this would be a lot of work; I'm still considering whether
it really needs to be done enough for me to jump in and do it,
given that the world is going to change out from under us yet
again on N:M.  I don't want to provide an excuse for people to
complain about "lack of benchmarks showing the value of N:M over
1:1, when there are `obviously' benchmarks available" (give them
two columns, and they will insist on three later).

However, if we are talking disk I/O, looking again at the paging
path for vnode_pager as it operates on demand-pages for not-present
pages being demanded as a result of reads on async fd's, it's clear
that there is some "low hanging fruit" that libthr gets by virtue
of multiple blocking contexts that libc_r cannot (at present) get.

8-(.

> From a purely qualitivative perspective, there is quite a noticeable
> difference.

I understand that.  I've noticed it myself.  I just can't be as
sure as everyone else seems to think they are about where it is
coming from.  8-) 8-).

> > Has anyone tried compiling X11 to use libthr?
> 
> Not sure.

If it won a significant speedup, it would be overwhelming evidence,
I think... (hint hint to some reader 8-)).

> > Also, any ETA on the per process signal mask handing bug in libthr?
> > Might not be safe to convert everything up front, in a rush of eager
> > enthusiasm...
> 
> Can't speculate on that, except one thing that is useful to note is that
> many serious threaded applications are already being linked against
> linuxthreads on FreeBSD, which arguably has poorer semantics when it comes
> to signals, credentials, etc, than libthr already :-).  For example, most
> sites I've talked to that deploy mysql do it with linuxthreads rather than
> libc_r to avoid the I/O issues, as well as avoid deadlocks.  There are
> enough bits of the kernel (for example, POSIX fifos) that don't handle
> non-blocking operation that libc_r can stall or get into I/O buffer
> deadlocks.  I seem to recall someone mentioning (but can't confirm easily)
> that Netscape at one point relied on using pipes to handle some sorts of
> asynchronous events and wakeups within the same process.  If that pipe
> filled, the process would block on a pipe write for a pipe that would
> never drain.

I understand that.  I guess I view these deadlocks as a failure
to adhere to the defined API, and so would call them application
bugs.  And then ignore them as irrelevent.  8-).

> I can think of a couple of other interesting excercises to explore the
> problem -- implementing AIO "better" using the KSE primitives mixed
> between userspace and kernel, reimplementing libc_r to attempt to use AIO
> rather than a select loop where possible, etc.  It might be quite
> interesting to see whether (for a bounded number of threads, due to our
> AIO implementation), a libc_r that used AIO rather than select
> demonstrated some of the performance improvements we see with 1:1 via
> linuxthreads (and now libthr).

I would expect that if this were a paging stall issue, and you
were right about where the speedup is actually coming from, that
you would see the same improvements.

It could be that there is just a terrible case in Mozilla, where
it makes a blocking system call on a FIFO or something, alarms out
of it, and continues on its way, vastly exaggerating the library
differences.  That's the problem with subjective differences in
behaviour, rather than objective differences.  I think the only
think that would help there is if someone were to profile Mozilla
operations that "feel faster" under libthr with both libc_r and
libthr, and see where the time is actually being spent.  Instead
of merely noting "it feels faster".  I guess that's "an exercise
for the student"... maybe someone else can do it, or maybe DSL
availability will fly out of SBC's rear-end into my area.  8-|.

> I'm not sure if there are any open source
> tools available to easily track process and thread scheduling and
> blocking, but there have been several pretty useful visual analysis and
> tracing tools for realtime.  Some basic tools for thread tracing and
> visualization exist for Mac OS X, and presumably other COTS platforms.
> ktrace on FreeBSD makes some attempt to track context switches, but
> without enough context (har har) to be useful for this kind of analysis.

Profiling should show this out, once you prune epicycles from
the scheduling.  I don't think it will work at all for libthr,
at this point, though.  8-(.

> I've been thinking about tweaking the local scheduler to put a bit more
> information into ktr and alq about blocking circumstances as well as some
> way to constrain the tracing to a particular bit of the process hierarchy
> with an inheritance flag of some sort.  It might be quite helpful for
> understanding some of the nasty threading blocking/timing issues that we
> already run into with libc_r, and will continue to run into as our
> threading evolves.

Yes.  I'd really like to see *where* the difference comes from;
as I said, short of building my own libthr that incorporates parts
of libc_r, I can't see how to do this on the current FreeBSD.  Just
knowing there's a qualitative difference "from somewhere" is
useless, IMO.

Maybe someone who needs a Master's degree will get the approval
of their Thesis adviser, and step up and do the work we are both
talking about but not doing... ;^).

-- Terry