RE: Gvinum RAID5 performance

From: Lukas Ertl <le_at_FreeBSD.org>
Date: Sun, 7 Nov 2004 11:40:59 +0100 (CET)
On Sun, 7 Nov 2004 freebsd_at_newmillennium.net.au wrote:

> In geom_vinum_plex.c, line 575
>
> /*
> * RAID5 sub-requests need to come in correct order, otherwise
> * we trip over the parity, as it might be overwritten by
> * another sub-request.
> */
> if (pbp->bio_driver1 != NULL &&
>    gv_stripe_active(p, pbp)) {
> 	/* Park the bio on the waiting queue. */
> 	pbp->bio_cflags |= GV_BIO_ONHOLD;
> 	bq = g_malloc(sizeof(*bq), M_WAITOK | M_ZERO);
> 	bq->bp = pbp;
> 	mtx_lock(&p->bqueue_mtx);
> 	TAILQ_INSERT_TAIL(&p->wqueue, bq, queue);
> 	mtx_unlock(&p->bqueue_mtx);
> }
>
> It seems we are holding back all requests to a currently active stripe,
> even if it is just a read and would never write anything back.

No, only writes are held back.  pbp->bio_driver1 is NULL when it's a 
normal read.

> 1. To calculate parity, we could simply read the old data (that was
> about to be overwritten), and the old parity, and recalculate the parity
> based on that information, rather than reading in all the stripes (based
> on the assumption that the original parity was correct). This would
> still take approximately the same amount of time, but would leave the
> other disks in the stripe available for other I/O.

That's how it's already done: old parity, old data is read.  New parity, 
new data is written.

> 2. If there are two or more writes pending for the same stripe (that is,
> up to the point that the data|parity has been written), they should be
> condensed into a single operation so that there is a single write to the
> parity, rather than one write for each operation. This way, we should be
> able to get close to (N -1) * disk throughput for large sequential
> writes, without compromising the integrity of the parity on disk.
>
> 3. When calculating parity as per (2), we should operate on whole blocks
> (as defined by the underlying device). This provides the benefit of
> being able to write a complete block to the subdisk, so the underlying
> mechanism does not have to do a read/update/write operation to write a
> partial block.

These are interesting ideas and I'm gonna think about it.

thanks,
le

-- 
Lukas Ertl                         http://homepage.univie.ac.at/l.ertl/
le_at_FreeBSD.org                     http://people.freebsd.org/~le/
Received on Sun Nov 07 2004 - 09:41:04 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:21 UTC