> -----Original Message----- > From: Brad Knowles [mailto:brad_at_stop.mail-abuse.org] > Sent: Monday, 1 November 2004 9:48 PM > To: Alastair D'Silva > Cc: current_at_freebsd.org > Subject: Re: Gvinum RAID5 performance > > Keep in mind that if you've got a five disk RAID-5 array, then > for any given block, four of those disks are data and would have to > be accessed on every read operation anyway, and only one disk would > be parity. The more disks you have in your RAID array, the lower the > parity to data ratio, and the less benefit you would get from > checking parity in background. > Not quite true. The general expectation I have is that a RAID5 setup would be used in a situation where the array will encounter a very high ratio of reads to writes, so optimising for the common case by avoiding reading every disk in the stripe makes sense. By avoiding reading the whole stripe every time a read request is issued (at least, without caching the results), the expected throughput of the array would be a little less than (N - 1) * (drive througput), whereas the current implementation gives us an expected throughput of less than (drive throughput). A brief look at geom_vinum_raid5.c indicates that this was the original intention, with data for undegraded reads coming from a single subdisk. I'm guessing that there is some serialisation deeper down - I don't have time to look into it tonight but maybe tomorrow. If someone could point me to what processes the queue added to by GV_ENQUEUE, it would save me some time :) If my guess about serialisation is correct, it would explain why my drives are flickering instead of being locked on solid, since they should be read in parallel (the block size specified in dd was significantly larger than the stripe size, so the read requests *should* have been issued in parallel). > Most disks do now have track caches, and they do read and write > entire tracks at once. However, given the multititudes of > permutations that go on with data addressing (including bad sector > mapping, etc...), what the disk thinks of as a "track" may have > absolutely no relationship whatsoever to what the OS or driver sees > as related or contiguous data. > > Therefore, the track cache may not contribute in any meaningful > way to what the RAID-5 implementation needs in terms of a stripe > cache. Moreover, the RAID-5 implementation already knows that it > needs to do a full read/write of the entire stripe every time it > accesses or writes data to that stripe, and this could easily have > destructive interference with the on-disk track cache. Ok, this makes sense - my understanding of how the on-disk cache operates is somewhat lacking. > going on as the data is being accessed. Fundamentally, RAID-5 is not > going to be as fast as directly reading the underlying disk. Well, my point is that RAID5 should have a greater throughput than a single drive in reading an undegraded volume, since consecutive (or random non-conflicting) data can be pulled from different drives, the same way it can in a conventional stripe or mirror. Verifying the parity on every request is pointless, as not only does it hinder performance, but a simple XOR parity check does not tell you where the error was, only that there was an error. Hmm, now theres an interesting idea - implement an ECC style alg for the parity calculation to protect against flipped bits - probably not significantly more computationally intensive than the simple parity (maybe twice as much, on the assumption that the parity for each word is calculated once for each row and once for each column), and it would provide the software with enough information to regenerate the faulty data, and provide the user with advance notice of a failing drive. > > I think both approaches have the ability to increase overall > > reliability as well as improve performance since the > drives will not > > be worked as hard. > > A "lazy read parity" RAID-5 implementation might have slightly > increased performance over normal RAID-5 on the same I would say an (N-2) increase in throughput is significant, rather than slight, and from a quick glance at the code, this is the way the author intended it to operate. Of course, I would really love it if Lukas could share his knowledge on this, since he wrote the code :) BTW, Lukas, I don't buy into the offset calculation justification for poor performance - the overhead is minimal compared to drive access times. On a side note, I was thinking of the following for implementing growable RAID5: First, have a few bytes in the vinum header for that subdisk/plex/volume/whatever (there is a header somewhere describing the plexes right?) which stores how much of the volume has been converted to the new (larger) volume. Now, for every new stripe, read the appropriate data from the old stripe, write it to disk and update the header. If the power fails at any point, the header won't be updated, and the original stripe will still be intact, so we can resume it as needed. The only problem that I can see is that if the power fails (or other disaster occurs) during the first few stripes processed, there is uncertainty in the data as what is on disk may be from either layout. To combat this, maybe the first few stripes should be moved on a block by block basis, rather than a whole stripe at a time. -- Alastair D'Silva mob: 0413 485 733 Networking Consultant fax: 0413 181 661 New Millennium Networking web: http://www.newmillennium.net.auReceived on Mon Nov 01 2004 - 12:36:21 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:20 UTC