Re: Gvinum RAID5 performance

From: Greg 'groggy' Lehey <grog_at_FreeBSD.org> Date: Sun, 7 Nov 2004 13:10:14 +1030 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:21 UTC

On Sunday,  7 November 2004 at 12:06:26 +1100, freebsd_at_newmillennium.net.au wrote:
> On  Sunday, 7 November 2004 10:23 AM, Greg 'groggy' Lehey wrote:
>>
>> 1.  Too small a stripe size.  If you (our anonymous user, who was
>>     using a single dd process) have to perform multiple transfers for
>>     a single request, the results will be slower.
>
> I'm using the recommended 279kb from the man page.

That's fine from the performance point of view.  However, the
recommendation is suboptimal.  The order of magnitude is OK, but it
should preferably be a multiple of the ufs block size, by default 32
kB.  As a result, 288 kB would be better.  But this won't affect the
results of your testing with dd.

>> 2.  There may be some overhead in GEOM that slows things down.  If
>>     this is the case, something should be done about it.
>
> (Disclaimer: I have only looked at the code, not put in any debugging to
> verify the situation. Also, my understanding is that the term "stripe"
> refers to the data in a plex which when read sequentially results in all
> disks being accessed exactly once, i.e. "A(n) B(n) C(n) P(n)" rather
> than blocks from a single subdisk, i.e. "A(n)", where (n) represents a
> group of contiguous blocks. Please correct me if I am wrong)

I haven't looked at the code at all :-(

> I can see a pontential place for slowdown here . . .

If gvinum is reading entire stripes, yes.  I'll leave Lukas to comment
on whether this is really what's happening.

> In geom_vinum_plex.c, line 575
>
> ...
>
> It seems we are holding back all requests to a currently active stripe,
> even if it is just a read and would never write anything back. I think
> the following conditions should apply:
>
> - If the current transactions on the stripe are reads, and we want to
> issue another read, let it through
> - If the current transactions on the stripe are reads, and we want to
> issue a write, queue it
> - If the current transactions on the stripe are writes, and we want to
> issue another write, queue it (but see below)
> - If the current transactions on the stripe are writes, and we want to
> issue a read, queue it if it overlaps the data being written, or if the
> plex is degraded and the request requires the parity to be read,
> otherwise, let it through

If this is correct, this is a very different strategy from Vinum.  If
we're talking about the corresponding code, Vinum locks the stripe and
serializes access to that stripe only.  For normal sized volumes, this
means almost no clashes.

> We could also optimize writing a bit by doing the following:
>
> 1. To calculate parity, we could simply read the old data (that was
> about to be overwritten), and the old parity, and recalculate the parity
> based on that information, rather than reading in all the stripes (based
> on the assumption that the original parity was correct).

Yes.  This is what Vinum does.

> This would still take approximately the same amount of time, but
> would leave the other disks in the stripe available for other I/O.

In fact, the locking there is quite complicated.  I think the method I
described above gets round the issues.

> 2. If there are two or more writes pending for the same stripe (that
> is, up to the point that the data|parity has been written), they
> should be condensed into a single operation so that there is a
> single write to the parity, rather than one write for each
> operation. This way, we should be able to get close to (N -1) * disk
> throughput for large sequential writes, without compromising the
> integrity of the parity on disk.

That's one possible optimization.  It would certainly help optimizee
the situation.  If the multiple requests coalesce to a whole-stripe
write, you no longer need to read the old parity.

> 3. When calculating parity as per (2), we should operate on whole
> blocks (as defined by the underlying device). This provides the
> benefit of being able to write a complete block to the subdisk, so
> the underlying mechanism does not have to do a read/update/write
> operation to write a partial block.

I'm not sure what you're saying here.  If it's a repeat of my last
sentence, yes, but only sometimes.  With a stripe size in the order of
300 kB, you're talking 1 or 2 MB per "block" (i.e. stripe across all
disks).  That kind of write doesn't happen very often.  At the other
end, all disks support a "block" (or "sector") of 512 B, and that's
the granularity of the system.  

Take a look at http://www.vinumvm.org/vinum/intro.html for more
details.

None of this implies anything in gvinum.  I haven't had time to look
at it.  I had assume that it would copy the Vinum optimizations.  

Greg
--
See complete headers for address and phone numbers.