RE: Gvinum RAID5 performance

From: <freebsd_at_newmillennium.net.au> Date: Sun, 7 Nov 2004 12:06:26 +1100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:21 UTC

> -----Original Message-----
> From: Greg 'groggy' Lehey [mailto:grog_at_FreeBSD.org] 
> Sent: Sunday, 7 November 2004 10:23 AM
> To: Lukas Ertl
> Cc: freebsd_at_newmillennium.net.au; freebsd-current_at_FreeBSD.org
> Subject: Re: Gvinum RAID5 performance
> 
> 1.  Too small a stripe size.  If you (our anonymous user, who was
>     using a single dd process) have to perform multiple transfers for
>     a single request, the results will be slower.

I'm using the recommended 279kb from the man page.

> 2.  There may be some overhead in GEOM that slows things down.  If
>     this is the case, something should be done about it.

(Disclaimer: I have only looked at the code, not put in any debugging to
verify the situation. Also, my understanding is that the term "stripe"
refers to the data in a plex which when read sequentially results in all
disks being accessed exactly once, i.e. "A(n) B(n) C(n) P(n)" rather
than blocks from a single subdisk, i.e. "A(n)", where (n) represents a
group of contiguous blocks. Please correct me if I am wrong)

I can see a pontential place for slowdown here . . . 

In geom_vinum_plex.c, line 575

/*
 * RAID5 sub-requests need to come in correct order, otherwise
 * we trip over the parity, as it might be overwritten by
 * another sub-request.
 */
if (pbp->bio_driver1 != NULL &&
    gv_stripe_active(p, pbp)) {
	/* Park the bio on the waiting queue. */
	pbp->bio_cflags |= GV_BIO_ONHOLD;
	bq = g_malloc(sizeof(*bq), M_WAITOK | M_ZERO);
	bq->bp = pbp;
	mtx_lock(&p->bqueue_mtx);
	TAILQ_INSERT_TAIL(&p->wqueue, bq, queue);
	mtx_unlock(&p->bqueue_mtx);
}

It seems we are holding back all requests to a currently active stripe,
even if it is just a read and would never write anything back. I think
the following conditions should apply:

- If the current transactions on the stripe are reads, and we want to
issue another read, let it through
- If the current transactions on the stripe are reads, and we want to
issue a write, queue it
- If the current transactions on the stripe are writes, and we want to
issue another write, queue it (but see below)
- If the current transactions on the stripe are writes, and we want to
issue a read, queue it if it overlaps the data being written, or if the
plex is degraded and the request requires the parity to be read,
otherwise, let it through

We could also optimize writing a bit by doing the following:

1. To calculate parity, we could simply read the old data (that was
about to be overwritten), and the old parity, and recalculate the parity
based on that information, rather than reading in all the stripes (based
on the assumption that the original parity was correct). This would
still take approximately the same amount of time, but would leave the
other disks in the stripe available for other I/O.

2. If there are two or more writes pending for the same stripe (that is,
up to the point that the data|parity has been written), they should be
condensed into a single operation so that there is a single write to the
parity, rather than one write for each operation. This way, we should be
able to get close to (N -1) * disk throughput for large sequential
writes, without compromising the integrity of the parity on disk.

3. When calculating parity as per (2), we should operate on whole blocks
(as defined by the underlying device). This provides the benefit of
being able to write a complete block to the subdisk, so the underlying
mechanism does not have to do a read/update/write operation to write a
partial block.

Comments?

-- 
Alastair D'Silva           mob: 0413 485 733
Networking Consultant      fax: 0413 181 661
New Millennium Networking  web: http://www.newmillennium.net.au