The biggest issue we've had (w/ DragonFly) on things like database benchmarks, and MySQL in particular, is with the large number of fsync() calls MySQL makes and less with SMP. SMP only really matters when one is operating out of the cache. Read-heavy from-cache operations on an open descriptor can run without the BGL on DragonFly with the flip of a sysctl but it doesn't have nearly the same effect as, say, disabling fsync has in tests which blow-out the system caches. Disk I/O is a huge bottleneck so anything disk-bound tends to be less reliant on cpu parallelism. Anything with NCQ, such as AHCI, will greatly improve random disk reads and mixed reads & writes, though it also has a tendancy to give writes priority over reads due to write I/Os returning nearly instantly (until the disk's own cache fills up, anyway, which is another issue entirely). e.g. if you have 32 tags and dedicate 1 tag for writes then load up all 32 tags (31 parallel reads and 1 parallel write), the write bandwidth will wind up being far more then 1/32 available disk bandwidth. fsync() is an area where UFS can operate quite efficiently, at least insofar as block-replacement write()'s which do not have to extend the file's size. I dunno about ZFS but for something like HAMMER an efficient fsync() requires implementing a forward (REDO) log to remove all seeks and degenerate into only linear writes for the fsync() operation itself. I've made some progress there but I still have a ways to go. SSD vs HD will skew the effect different subsystems have on performance, of course, though even a SSD would benefit from a forward-log capable of devolving the entire fsync() to a single device write I/O. SMP becomes more important as I/O subsystems get faster. One area where locking seems to matter more than SMP is when one is mixing read() and write() operations on the same vnode. Here the issue tends to be either: (1) Holding an exclusive vnode lock during a write() while blocked on the buffer cache, thus interfering with read()s. Moving to an offset-range lock for read/write to ensure read/write atomicy, and to deal with inode updates, solves this issue. (I have the offset-range locks in DFly but I haven't turned off the exclusive vnode lock for write()'s yet). I don't quite recall but I think linux has given up on guaranteeing read/write atomicy. Unlocking the vnode while blocked on the buffer cache would also work, as long as read vs write atomicy mechanics can be maintained for the duration. Pre-caching/pre-creating the buffer cache buffers with the vnode unlocked also helps, but increases cpu overhead as you have to lookup each buffer twice. or (2) A large number of buffer cache buffers undergoing physical write I/O at once, and thus in a locked state for a long period of time causing read()'s of the same buffers to block for similarly long periods of time. Limiting the number of buffer cache buffers you queue to the underlying device at any given moment (via bawrite()) mostly solves this problem. Note that I am not talking about the disk device's queue here (NCQ vs not NCQ doesn't matter for disk writes)... you have to actually NOT issue the bawrite() in the first place so the buffer remains unlocked until the very last moment. I implemented this on DFly along with pipelining fixes in the buffer flusher thread and got very interesting blogbench results during the pre-cache-blowout phase of the test. Basically blogbench was able to write() at full speed (with disk I/O saturated 100% with writes) without any detrimental effect on read()s (which were being satisfied at full speed from the VM/buffer cache) during that phase. Before the fixes the two would interfere with each other quite a bit. In fact, reducing the amount of time a buffer cache buffer undergoing write() I/O remains locked is a really difficult problem because you have to tune the data rate to match the disk drive's actual write pipeline (which changes depending on the simultanious read load) so the disk drive's own caches don't get saturated with dirty data and stall-out the write I/O's that were queued to it (leading to the related dirty buffer cache buffer remaining locked longer). I haven't been able to automate it, there's no way to query the disk, but I have been able to tune things manually. In anycase, I think the key takeaway here is that there are at least four (and probably more) different subsystems in the codebase which must be addressed to get good benchmark results. Many of these benchmarks are doing simultanious reads and writes which tend to tickle (and require) that all the bottlenecks be addressed. SMP becomes more important when system caches are well-utilized. Disk scheduling and buffer cache management becomes more important as the disk gets more saturated. -MattReceived on Fri Dec 04 2009 - 21:26:20 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:58 UTC