CC'd current_at_ because it might be of more general interest; Mark and I did a phone call a few days ago to discuss how to improve the performance of entropy harvesting, which occurs along several critical paths for network performance. This was a followup to previous e-mail and other online discussion. Some of the ideas here are specific to entropy gathering, but others apply more generally. We discussed a few things, including: - Currently, the harvesting code uses several spin mutexes, including per source fifo mutex and a free list mutex. The result is four mutex operations per entropy event, two for the free list and two for the source fifo. I suggested exploring whether coalescing some of these mutexes would help lower locking overhead without negatively impacting contention/parallelism. In particular, possibly using per-fifo free lists, or just using a single spin mutex over all harvesting. - Currently, we gather all entropy we can find in the harvesting code in the entropy souce thread, subject to fairly high fifo bounds, and then "let the yarrow thread sort it out". Via e-mail and on the phone, we discussed ways to expose back pressure to the harvesting code to avoid paying the cost of harvesting if we have "adequate entropy" or if there's little demand for randomness. We discussed lockless approaches for exposing this so that a lockless read could be used by interrupt handlers, etc, to avoid the cost of locking when entropy wasn't required. I pointed at my modification to the harvest code to do a lockless check for a full fifo as an example of safe use of lockless reads for performance purposes. I don't have a useful opinion on how we decide if we have "enough" entropy, other than to note that on a busy system, we're probably awash in it and doing a lot of work for little incremental benefit, especially on high load network systems. - We discussed moving from a floating buffer model to a ring buffer that requires less synchronization, or even permits very week consistency synchronization. We didn't go into a lot of detail here, other than to observe that there are a variety of techniques that could be used to lower cost, lower synchronization, increase locality, etc. - Currently, the yarrow thread performs a lot of sequential lock operations as it iterates over the entropy fifos and releases buffers. While this happens only intermittently, it happens a fair amount over time, and the current code layout results in possibly excessive mutex use. I suggested moving to a model where entropy records are moved "en masse" to a thread local entropy queue or fifo from the global fifo between two lock operations to avoid grabbing and releasing the locks many times during processing. Likewise for the free list. Note that every spin mutex acquire disables interrupts... Our current queue(9) macros don't all lend themselves to an O(1) move of the contents of some list types from one head to another, but even O(n) would probably be an improvement. I pointed at som similar use of queues in the network code to avoid thrashing driver locks when processing multiple packets, amortizing the locking cost over multiple packets. - We discussed more generally when it is safe to not use locking, such as when we use pointer/integer size reads or writes and some staleness or inconsistency is acceptable. We also discussed that this might mean using a lockless read to decide if something might happen (subject to a race), and then re-checking once locks are acquired in order to perform a safe read-modify-write. I.e., a lockless check on fifo size to make sure there might be room, then acquire the mutex and re-read the value to make sure there really is room before adding the record. This is an optimistic concurrencyy/synchronization approacah since it optimizes based on the assumption we won't need to do the work, or that we don't need to do the work frequently enough to amortize the additional check cost given that it's lower than an occasional lock. - And to summarize a point previously discussed in e-mail: right now mbuf havesting in the ethernet input path is incorrectly harvesting from a free'd mbuf pointer. This means we're harvesting both kernel pointers instead of packet data, that we may harvest from an unmapped page due to the object being free'd, etc. We discussed in the phone call whether there was benefit to reaping there and you observed that the timing information was useful, and that the ethernet header information might still be useful if correctly reaped. I was unclear that the ethernet header information would be useful, although timing might well be (subject to cost). It's worth observing, though, that if you've already gathered entropy information from the interrupt, there is probably little incremental benefit to gathering in the ethernet input routines, especially since the time to perform incremental processing on each additional packet handled by an ethernet interrupt is likely fairly constant. - I provided some general guidance on locking: it's expensive. Right now we're far more concened about the cost of lock operations than the cost of contention, and so general guidance has been "err on the side of fewer locks now, it's easier to diagnose than too many". We have some reasonable tools for diagnosing lock use -- I pointed at MUTEX_PROFILING (although observed it's expensive) and KTR. I have information on using KTR at http://www.watson.org/~robert/freebsd/netperf/ktr/. The MUTEX_PROFILING man page is pretty decent and the results instructive. I noted that the reason I had chosen to focus some time on the entropy gathering code was that it showed up fairly clearly in interrupt handling traces from KTR, and that KTR was very helpful (if you're willing to invest the time to use it well) in getting a picture of how the system is behaving, and that my recent modifications to KTR provide a lot more useful context information than we used to have. As a follow-up, I re-read the harvesting code again after our conversation, in particular to learn about the cost of harvesting time information. I observed that get_cyclecounter() is very cheap on modern hardware, but that on older hardware without a TSC, it's extraordinarily expensive. In particular, the #ifdef code on i386 suggests that i486 systems may not have a TSC, and insead read the system clock (ouch!). We may want to investigate what approaches we can use to mitigate this, especially if systems like soekris boxes don't have TSC. Right now, the API for retrieving cycle counts does not allow the caller to easily distinguish those two cases, and it may be we need to teach it to do that so that we can allow the caller to decide it doesn't want to pay the higher cost. Finally, a question I raised by e-mail previously and think it's worth thinking about is how to decide what impact we're having on entropy quality. In particular, are there scores or other easily understood numerical or statistical results that can be exposed to a user or developer to allow them to measure the impact on entropy quality and availability resulting from specific configuration, optimization, etc. This would allow us to more easily say "X is or is not worth the performance cost", or at least, allow our users to make that choice and provide more information about when they should make that choice. This also paves the way for better auto-tuning. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert_at_fledge.watson.org Principal Research Scientist, McAfee ResearchReceived on Sat Aug 14 2004 - 13:28:36 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:06 UTC