Robert Watson wrote: > On Wed, 2 Apr 2003, Terry Lambert wrote: > > Is the disk I/O really that big of an issue? All writes will be on > > underlying non-blocking descriptors; I guess you are saying that the > > interleaved I/O is more important, further down the system call > > interface than the top, and this becomes an issue? > > The I/O issue is a big deal for things like mysql, yes. I'm not interested in the MySQL code; I've been in that threads code, deep (made it work on AIX). There are a lot better ways to deal with MySQL's latency issues. For example, the two phase commit stall that it has is not necessary on a Soft Updates FS, but, like qmail, it does it anyway, introducing latency. But Mozilla... the only issue I could see is interleaved network I/O, but that should not be an issue; HTTP is request/response, with the lion's share of the data coming as a result of the response. In other words, the rendering engine's going to have to wait for data back from the remote server, and that's got to be the primary latency. The only way I see for disk I/O to be involved in Mozilla is in local cache? You can turn that off. > > It seems to me that maybe the correct fix for this is to use AIO > > instead of non-blocking I/O, then? > > Well, they're both fixes. Another issue for applications that are > threaded and may be bumping up against the system memory limits is whether > or not the whole process stalls on a page fault or memory mapping fault, > or whether it's just the thread. This is what I meant by "deeper in the system calll layer". IMO, if you are stalled on something like this on an async fd, then it should queue the fault anyway, and return to user space for the next request. This may just be a bug in the kernel processing of demand faults on vnodes associated with async fd's (FWIW, System V and Solaris both queue the fault for kernel processing, and then return to user space). > If you have an application that is accessing a large memory mapped > file, there may be some long kernel sleeps as you pull in the pages. Again, this stall, if the fd in question is async, should be taken as system time, not as an application stall. That's a little harder, in this case, because you want to fail the pagein with the moral equivalent of "EAGAIN" plus a forced threads context switch after queing the fault request. It's a lot easier if we are talking about an explicit read() request, and not a fault, (obviously 8-)), since it should be expecting the possibility of an EAGAIN in the read() case. I guess, in this case, without an N:M upcall to indicate the forced context switch, you would just stall. This, though, doesn't seem to be an issue for Mozilla, to me. > Certainly, you can argue that the application should be structured > to make all I/O explicit and asynchronous, but for various reasons, > that's not the case :-). The mmap'ed file case is obviously not something that can be handled without an explicit contract between user and kernel for notification of the pagein temporary failure (I would use a signal for that, probably, as a gross first approximation, but per-process signal handling is currently not happy...). > Our VM and VFS subsystems may have limited concurrency from > an SMPng perspective, but probably have enough that a marked > benefit should be seen there too (you might have to wait for > another thread to block in the subsystem, but that will be a > short period of time compared to how long it takes to service > the page from disk). I would argue that this was in error, at least for explicit handing of non-blocking I/O. SVR4 and Solaris don't suffer from a stall, in the face of a missing page in a demand page-in on an async fd. The process of servicing the page-in is really independent of the process of retrying the call to see if the page is there yet. I think this is probably classifiable as a serious deficiency in the FreeBSD VM page handling, and in the decoupling of the fd and the VM backing object. I'm not sure how I'd go about fixing it; you'd have to pass the fd struct down as well, to get the flags, I think. Actually, people have wanted this for a while, in order to properly support per open instance data on cloned devices (e.g. this is needed in order to support multiple instances of VMWare on a single FreeBSD host). > > The GUI thread issues are something I hadn't considered; I don't > > generally think of user space CPU intensive operations like that, > > but I guess it has to be rendered some time. 8-). > > One of the problems I've run into is where you lose interactivity during > file saves and other disk-intensive operations in OpenOffice. Other > windows could in theory still be processing UI events, such as menu > clicks, etc, but since you're dumping several megabytes of data to disk or > doing interactive file operations that require waiting on disk latency, > you end up with a fairly nasty user experience. One way to explore this > effect is to do a side-by-side comparison of the behavior of OpenOffice > and Mozilla linked against libc_r and linuxthreads. I don't think this is enough, actually. I think you will find X11 over-the-wire stall barriers here. This is one of the reasons I asked about X11 itself. There are a lot of places in a lot of code where people call "Xsync". I don't think there's a reasonable way to avoid this, given the usual reason it's being called (e.g. to render individual widgets atomically, in order to provide for the appearance of speed from "draw individual widgets fast, rather than all widgets at the same time, slow"). > I haven't actually instrumented the kernel, but it might be quite > interesting to do so--attempt to estimate the total impact of disk > stalls on libc_r. Yes, it would be very interesting. The only approach I could come up with for this, though, is to force the libthr 1:1 code through the fd locking code from libc_r (per my previous suggestion on benchmarking) to seperate out the stall domains. I think this would be a lot of work; I'm still considering whether it really needs to be done enough for me to jump in and do it, given that the world is going to change out from under us yet again on N:M. I don't want to provide an excuse for people to complain about "lack of benchmarks showing the value of N:M over 1:1, when there are `obviously' benchmarks available" (give them two columns, and they will insist on three later). However, if we are talking disk I/O, looking again at the paging path for vnode_pager as it operates on demand-pages for not-present pages being demanded as a result of reads on async fd's, it's clear that there is some "low hanging fruit" that libthr gets by virtue of multiple blocking contexts that libc_r cannot (at present) get. 8-(. > From a purely qualitivative perspective, there is quite a noticeable > difference. I understand that. I've noticed it myself. I just can't be as sure as everyone else seems to think they are about where it is coming from. 8-) 8-). > > Has anyone tried compiling X11 to use libthr? > > Not sure. If it won a significant speedup, it would be overwhelming evidence, I think... (hint hint to some reader 8-)). > > Also, any ETA on the per process signal mask handing bug in libthr? > > Might not be safe to convert everything up front, in a rush of eager > > enthusiasm... > > Can't speculate on that, except one thing that is useful to note is that > many serious threaded applications are already being linked against > linuxthreads on FreeBSD, which arguably has poorer semantics when it comes > to signals, credentials, etc, than libthr already :-). For example, most > sites I've talked to that deploy mysql do it with linuxthreads rather than > libc_r to avoid the I/O issues, as well as avoid deadlocks. There are > enough bits of the kernel (for example, POSIX fifos) that don't handle > non-blocking operation that libc_r can stall or get into I/O buffer > deadlocks. I seem to recall someone mentioning (but can't confirm easily) > that Netscape at one point relied on using pipes to handle some sorts of > asynchronous events and wakeups within the same process. If that pipe > filled, the process would block on a pipe write for a pipe that would > never drain. I understand that. I guess I view these deadlocks as a failure to adhere to the defined API, and so would call them application bugs. And then ignore them as irrelevent. 8-). > I can think of a couple of other interesting excercises to explore the > problem -- implementing AIO "better" using the KSE primitives mixed > between userspace and kernel, reimplementing libc_r to attempt to use AIO > rather than a select loop where possible, etc. It might be quite > interesting to see whether (for a bounded number of threads, due to our > AIO implementation), a libc_r that used AIO rather than select > demonstrated some of the performance improvements we see with 1:1 via > linuxthreads (and now libthr). I would expect that if this were a paging stall issue, and you were right about where the speedup is actually coming from, that you would see the same improvements. It could be that there is just a terrible case in Mozilla, where it makes a blocking system call on a FIFO or something, alarms out of it, and continues on its way, vastly exaggerating the library differences. That's the problem with subjective differences in behaviour, rather than objective differences. I think the only think that would help there is if someone were to profile Mozilla operations that "feel faster" under libthr with both libc_r and libthr, and see where the time is actually being spent. Instead of merely noting "it feels faster". I guess that's "an exercise for the student"... maybe someone else can do it, or maybe DSL availability will fly out of SBC's rear-end into my area. 8-|. > I'm not sure if there are any open source > tools available to easily track process and thread scheduling and > blocking, but there have been several pretty useful visual analysis and > tracing tools for realtime. Some basic tools for thread tracing and > visualization exist for Mac OS X, and presumably other COTS platforms. > ktrace on FreeBSD makes some attempt to track context switches, but > without enough context (har har) to be useful for this kind of analysis. Profiling should show this out, once you prune epicycles from the scheduling. I don't think it will work at all for libthr, at this point, though. 8-(. > I've been thinking about tweaking the local scheduler to put a bit more > information into ktr and alq about blocking circumstances as well as some > way to constrain the tracing to a particular bit of the process hierarchy > with an inheritance flag of some sort. It might be quite helpful for > understanding some of the nasty threading blocking/timing issues that we > already run into with libc_r, and will continue to run into as our > threading evolves. Yes. I'd really like to see *where* the difference comes from; as I said, short of building my own libthr that incorporates parts of libc_r, I can't see how to do this on the current FreeBSD. Just knowing there's a qualitative difference "from somewhere" is useless, IMO. Maybe someone who needs a Master's degree will get the approval of their Thesis adviser, and step up and do the work we are both talking about but not doing... ;^). -- TerryReceived on Wed Apr 02 2003 - 12:41:04 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:02 UTC