> If this is correct, this is a very different strategy from > Vinum. If we're talking about the corresponding code, Vinum > locks the stripe and serializes access to that stripe only. > For normal sized volumes, this means almost no clashes. My point is that the access to the stripe should only be serialized under certain conditions. In my case, the ability to stream large files to/from the RAID5 volume is hampered by this situation. This is my understanding of what is happening: 1. Userland app requests to read in a big chunk of data 2. Somewhere in the OS, the request is broken into smaller chunks and issued to Gvinum 3. GVinum queues all requests issued to it 4. GVinum's worker walks the queue and starts processing the first request (blocking subsequent requests since they're likely to be in the same stripe) 5. The first request is retired, and the next request is processed If we follow the logic I outlined in the previous mail, we should instead have something like this: 1. Userland app requests to read in a big chunk of data 2. Somewhere in the OS, the request is broken into smaller chunks and issued to Gvinum 3. GVinum queues all requests issued to it 4. GVinum's worker walks the queue and starts processing the first request 5. Gvinum checks the next request, realizes that its a read and only other reads are pending for that stripe (repeat) 6. Read requests are retired (in no particular order, but the code that split it into smaller requests should handle that) In the first scenario, large sequential reads are first broken into smaller chunks, and each chunk is processed sequentially. In the second scenario, each smaller chunk is processed in parallel, so all the drives in the array are worked simultaneously. > > 3. When calculating parity as per (2), we should operate on whole > > blocks (as defined by the underlying device). This provides the > > benefit of being able to write a complete block to the > subdisk, so the > > underlying mechanism does not have to do a > read/update/write operation > > to write a partial block. > > I'm not sure what you're saying here. If it's a repeat of my > last sentence, yes, but only sometimes. With a stripe size > in the order of 300 kB, you're talking 1 or 2 MB per "block" > (i.e. stripe across all disks). That kind of write doesn't > happen very often. At the other end, all disks support a > "block" (or "sector") of 512 B, and that's the granularity of > the system. I'm referring here to the underlying blocks of the block device itself. My understanding of block devices is that they cannot operate on a part of a block - they must read/write the whole block in one operation. If we were to write only the data to be updated, the low-level driver (or maybe the drive itself, depending on the hardware and implementation) must first read the block into a buffer, update the relevant part of the buffer, then write the result back out. Since we had to read the whole block to begin with, we could use this information to construct the whole block to be written, so the underlying driver would not have to do the read operation. -- Alastair D'Silva mob: 0413 485 733 Networking Consultant fax: 0413 181 661 New Millennium Networking web: http://www.newmillennium.net.auReceived on Sun Nov 07 2004 - 03:26:50 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:21 UTC