File system blocks alignment

From: Alexander Motin <mav_at_FreeBSD.org> Date: Fri, 25 Dec 2009 12:58:07 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:59 UTC

Hi.

Recently WD released first series of ATA disks with increased physical
sector size. It makes writes not matching with 4K blocks inefficient
there. So I propose to get back to the question of optimal FS block
alignment. This topic is also important for most of RAIDs having striped
nature, such as RAID0/3/5/... and flash drives with simple controller
(such as MMC/SD cards).

As I have no one of those WD disks yet, I have made series of tests with
RAID0, made by geom_stripe, to check general idea. I've tested the most
describing case: 2-disk RAID0 with 16K stripe, 16K FS block and many 16K
random I/Os (reads in this test, to avoid FS locking). Same load pattern
 but with writes I had on my busy disk-bound MySQL servers, so it is
quite real.

Test one, default partitioning.

%gstripe label -s 16384 data /dev/ada1 /dev/ada2
%fdisk -I /dev/stripe/data
%disklabel -w /dev/stripe/datas1
%disklabel /dev/stripe/datas1

# /dev/stripe/datas1:

8 partitions:

#        size   offset    fstype   [fsize bsize bps/cpg]

  a: 1250274611       16    unused        0     0

  c: 1250274627        0    unused        0     0         # "raw" part,
don't edit
%diskinfo -v /dev/stripe/datas1a

/dev/stripe/datas1a

        512             # sectorsize

        640140600832    # mediasize in bytes (596G)

        1250274611      # mediasize in sectors

        16384           # stripesize

        7680            # stripeoffset

        77825           # Cylinders according to firmware.

        255             # Heads according to firmware.

        63              # Sectors according to firmware.

As you can see, fdisk aligned partition to the "track length" of 63
sectors and disklabel added offset of 16 sectors. As result, file system
will start at quite odd place of the RAID stripe.
I've created UFS file system, pre-wrote 4GB file and run tests (raidtest
was patched to generate only 16K requests):
%raidtest test -d /mnt/qqq -n 1
Requests per second: 112
%raidtest test -d /mnt/qqq -n 64
Requests per second: 314
Before each test FS was unmounted to flush caches.

Test two, FS manually aligned with disklabel.
%disklabel /dev/stripe/datas1
# /dev/stripe/datas1:
8 partitions:
#        size   offset    fstype   [fsize bsize bps/cpg]
  a: 1250274578       33    unused        0     0
  c: 1250274627        0    unused        0     0         # "raw" part,
don't edit
%diskinfo -v /dev/stripe/datas1a
/dev/stripe/datas1a
        512             # sectorsize
        640140583936    # mediasize in bytes (596G)
        1250274578      # mediasize in sectors
        16384           # stripesize
        0               # stripeoffset
        77825           # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
File system aligned with stripe.
%raidtest test -d /mnt/qqq -n 1

Requests per second: 133

%raidtest test -d /mnt/qqq -n 64

Requests per second: 594

The difference is quite significant. Unaligned RAID0 access causes two
disks involved in it's handling, while aligned one leaves one of disks
free for another request, doubling performance.

As we have now mechanism for reporting stripe size and offset for any
partition to user-level, it should be easy to make disk partitioning and
file system creation tools to use it automatically.

Stripe size/offset reporting now supported by ada and mmcsd disk drivers
and most of GEOM modules. It would be nice to fetch that info from
hardware RAIDs also, where possible.

-- 
Alexander Motin