Re: My experies with gvirstor

From: Ivan Voras <ivoras_at_fer.hr> Date: Mon, 30 Apr 2007 00:22:38 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:09 UTC

Patrick Tracanelli wrote:
> Here is what I got as my experiences with gvirstor so far.
> 
> With kern.geom.virstor.debug=15 get really slow. While accessing
> (newfs'ing) /dev/virstor/home system starts to get 98% of CPU cycles. No
> problem after all, just mentioning in case this shouldnt happen.

I don't see how to avoid it, since this mode generates a real mountain
of messages :)

..
>  3542977888, 3543354240, 3543730592, 3544106944, 3544483296, 3544859648,
> 3545236000, 3545612352, 3545988704, 3546365056,
>  3546741408, 3547117760, 3547494112,
> 
> and it STOPS. Checking for debug I can find out that BIO Delaying is
> working, because I get:
> 
> GEOM_VIRSTOR[1]: All physical space allocated for home
> GEOM_VIRSTOR[2]: Delaying BIO (size=65536) until free physical space can
> be found on virstor/home

Ok, this is because UFS creates cylindar groups across the drive, and
though they are basically small, each of them allocates an entire 4 MB
chunk. Thus the problems.

> If I ./gvirstor add home ad4s1, things start to work back. But ad4s1 is
> way too small and Is not enough. But at least a 1G device I can create:
> 
> /dev/virstor/home    946G    4.0K    870G     0%    /usr/home4

Nice.

> So my question, how can I make the math to find out how much real space
> I will need to create a gvirstor device sized N?
> 
> # ./gvirstor status home
>         Name             Status  Components
> virstor/home  43% physical free  ad2s1

> Since it is a 40GB devie, something close to 34GB was used to store
> structure of a 1TB device. Is this usage related to the chunk size?

The idea that is to be followed here is (chunk_size * number_of_cgs) is
the smallest physical space required for newfs to finish. I'll have to
find out the formula by which newfs caluclates how many cylindar groups
it wants to create to give you a precise answer.

It would be very interesting if you choose a different chunk size than
the default 4 MB. For example, try using chunks of 512 KB (gvirstor
create -m 512 ...).

Also, if you're in the mood for it, try benchmarking one and the other
chunk size (both newfs time, and bonnie++) would be interesting.

> however, if I export gvirstor device, the other side (ggate client) can
> only import it if it is umounted in the local machine (the one where
> gvirstor resides):
> 
> /dev/ggate0    946G    4.0K    870G     0%    /mnt
> 
> If I try to mount it, I get:
> 
> # mount /dev/virstor/home /usr/home4
> mount: /dev/virstor/home: Operation not permitted

This is the limitation of the UFS file system, not GEOM & its classes.
You can search the archives for many lamentations about how people are
missing a real distributed & concurrent file system in FreeBSD.

> That´s bad fun :( I thought I could do more lego play. This seems like
> the same problem I had in the past, trying to export a mounted gmirror
> device.

Yes, it's the same problem.

> iostat -w1 ad0 ad2
> 
> I can see there is no performance difference comparing writings to the
> ad provider or to a gvirstor provider. I can also see that the disk
> usage is one provider each time. I only get activiry on ad0 when ad2 has
> ended up its space. gstat shows me the same thing.

Yes, it will fill up the virstor device one drive at a time, in order in
which they have been added. If you want multiple devices to be used at
the same time, you'll have to add a gstripe "lego brick" in the setup :)

> However, let me ask something. Is metadata information updated
> synchronously?

Yes for the virstor. Virstor metadata needs only to be updated when a
new physical block is allocated.

For example, let's assume a virstor device that has 5 chunks of 1 MB:

[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

So, the above line represents 5 MB of virtual storage. Let's say an
application (and, for this argument, this includes the file system),
writes one byte to the position "2 MB". Now the second chunk gets its
physical backing and virstor metadata is written to reflect this. The
new situation is:

[ 1 ] [.2.] [ 3 ] [ 4 ] [ 5 ]

(the dots represent a virtual chunk with physical backing). Now, when
the application writes a single byte at position "2 MB + 1", that byte
gets written to the same already allocated second chunk, so there's no
need to allocate another chunk, and there's no need to write virstor
metadata (AKA "the allocation table"). But when an application writes to
position "4 MB", another chunk gets physical backing:

[ 1 ] [.2.] [ 3 ] [.4.] [ 5 ]

etc., etc. The unallocated chunks remain as "holes" - reading from them
produces bytes out of thin air (AKA "the infinite supply of zeroes"),
and writing to them allocates chunks (if they aren't already allocated)
and writes to physical storage.

> I ask it because removing /usr/home4/40G.bin (rm /usr/home4/40G.bin)
> takes about 1 and a half minute to finish (newfs was made with -U flag).

Ok, now things get to be interesting. Let's see how the allocation goes
from the physical device side. Let's assume we have a drive that can
hold 2 MB. That is, two chunks of 1 MB:

{ 1 } { 2 }

When the first allocation in the above example happens, virtual chunk
[2] is mapped to physical chunk {1}, and now we have:

{.1.} { 2 }

When the second allocation happens, virtual chunk [4] is mapped to
physical chunk {2}:

{.1.} {.2.}

And the mapping table contains something like:

[1]->? [2]->{1} [3]->? [4]->{2}

The reason why newfs creates cylinder groups is speed. Cylinder groups
group "nearby" files, where "nearby" is determined by some heuristics,
including "belonging to the same directory".

One cylindar group is usually somewhere around 200 MB in size. This
means that it can hold 200 MB of files in a small area on the hard drive
platter, so that jumping from one file to the next involves very little
seeking. When using virstor, the space occupied by a single cylinder
group suddenly becomes scattered around the hard drive platter,
defeating the purpose of grouping, and introducing much more seeks.
(This is a bit simplified, but correct in principle)

There are three ways to "fix" this:

1. Use huge chunk sizes, like 200 MB. (but cg size also cannot be
reliably calculated in advance, and huge chunk sizes will badly
influence the "savings" in storage virstor device can provide)
2. Use a medium that doesn't have seek penalties, such as solid state
memory (flash drives)
3. Use gjournal, and set the journal on a non-virstor device. This way,
most writes (and unlink() calls have lots of writes) will go to the
journal device first. (gjournal is available only in 7-CURRENT)