On Sunday, 7 November 2004 at 12:06:26 +1100, freebsd_at_newmillennium.net.au wrote: > On Sunday, 7 November 2004 10:23 AM, Greg 'groggy' Lehey wrote: >> >> 1. Too small a stripe size. If you (our anonymous user, who was >> using a single dd process) have to perform multiple transfers for >> a single request, the results will be slower. > > I'm using the recommended 279kb from the man page. That's fine from the performance point of view. However, the recommendation is suboptimal. The order of magnitude is OK, but it should preferably be a multiple of the ufs block size, by default 32 kB. As a result, 288 kB would be better. But this won't affect the results of your testing with dd. >> 2. There may be some overhead in GEOM that slows things down. If >> this is the case, something should be done about it. > > (Disclaimer: I have only looked at the code, not put in any debugging to > verify the situation. Also, my understanding is that the term "stripe" > refers to the data in a plex which when read sequentially results in all > disks being accessed exactly once, i.e. "A(n) B(n) C(n) P(n)" rather > than blocks from a single subdisk, i.e. "A(n)", where (n) represents a > group of contiguous blocks. Please correct me if I am wrong) I haven't looked at the code at all :-( > I can see a pontential place for slowdown here . . . If gvinum is reading entire stripes, yes. I'll leave Lukas to comment on whether this is really what's happening. > In geom_vinum_plex.c, line 575 > > ... > > It seems we are holding back all requests to a currently active stripe, > even if it is just a read and would never write anything back. I think > the following conditions should apply: > > - If the current transactions on the stripe are reads, and we want to > issue another read, let it through > - If the current transactions on the stripe are reads, and we want to > issue a write, queue it > - If the current transactions on the stripe are writes, and we want to > issue another write, queue it (but see below) > - If the current transactions on the stripe are writes, and we want to > issue a read, queue it if it overlaps the data being written, or if the > plex is degraded and the request requires the parity to be read, > otherwise, let it through If this is correct, this is a very different strategy from Vinum. If we're talking about the corresponding code, Vinum locks the stripe and serializes access to that stripe only. For normal sized volumes, this means almost no clashes. > We could also optimize writing a bit by doing the following: > > 1. To calculate parity, we could simply read the old data (that was > about to be overwritten), and the old parity, and recalculate the parity > based on that information, rather than reading in all the stripes (based > on the assumption that the original parity was correct). Yes. This is what Vinum does. > This would still take approximately the same amount of time, but > would leave the other disks in the stripe available for other I/O. In fact, the locking there is quite complicated. I think the method I described above gets round the issues. > 2. If there are two or more writes pending for the same stripe (that > is, up to the point that the data|parity has been written), they > should be condensed into a single operation so that there is a > single write to the parity, rather than one write for each > operation. This way, we should be able to get close to (N -1) * disk > throughput for large sequential writes, without compromising the > integrity of the parity on disk. That's one possible optimization. It would certainly help optimizee the situation. If the multiple requests coalesce to a whole-stripe write, you no longer need to read the old parity. > 3. When calculating parity as per (2), we should operate on whole > blocks (as defined by the underlying device). This provides the > benefit of being able to write a complete block to the subdisk, so > the underlying mechanism does not have to do a read/update/write > operation to write a partial block. I'm not sure what you're saying here. If it's a repeat of my last sentence, yes, but only sometimes. With a stripe size in the order of 300 kB, you're talking 1 or 2 MB per "block" (i.e. stripe across all disks). That kind of write doesn't happen very often. At the other end, all disks support a "block" (or "sector") of 512 B, and that's the granularity of the system. Take a look at http://www.vinumvm.org/vinum/intro.html for more details. None of this implies anything in gvinum. I haven't had time to look at it. I had assume that it would copy the Vinum optimizations. Greg -- See complete headers for address and phone numbers.
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:21 UTC