NCQ vs UFS/ZFS benchmark [Was: Re: FreeBSD 8.0 Performance (at Phoronix)]

From: Alexander Motin <mav_at_FreeBSD.org> Date: Thu, 03 Dec 2009 03:09:31 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:58 UTC

Ivan Voras wrote:
> If you have a drive to play with, could you also check UFS vs ZFS on
> both ATA & AHCI? To try and see if the IO scheduling of ZFS plays nicely.
> 
> For benchmarks I suggest blogbench and bonnie++ (in ports) and if you
> want to bother, randomio, http://arctic.org/~dean/randomio .

I have looked on randomio and found that it is also tuned to test
physical drive, and it does almost the same as raidtest. The main
difference that raidtest uses pre-generated test patterns, so it's
results are much more repeatable. What bonnie++ does is another
question, I prefer trust results which I can explain.

So I have spent several hours to quickly compare UFS and ZFS in several
scenarios, using ata(4) and ahci(4) drivers. It is not a strict
research, but I have checked every digit at least twice, some unexpected
or deviating ones even more.

I have pre-written 20GB file on empty file systems and used raidtest to
generate random rix of 10000 read/write requests of random size (512B -
128KB) to those files. Every single run took about a minute, total
transfer size per run was about 600MB. I have used the same request
pattern in all tests.

Test 1: raidtest with O_DIRECT flag (default) on UFS file system:
ata(4), 1 process               tps: 70
ata(4), 32 processes            tps: 71
ahci(4), 1 process              tps: 72
ahci(4), 32 processes           tps: 81

gstat shown that most of time only one request at a time was running on
disk. Looks like read or read-modify-write operations (due to many short
writes in test pattern) are heavily serialized in UFS, even when several
processes working with the same file. It has almost eliminated effect of
NCQ in this test.

Test 2: Same as before, but without O_DIRECT flag:
ata(4), 1 process, first        tps: 78
ata(4), 1 process, second       tps: 469
ata(4), 32 processes, first     tps: 83
ata(4), 32 processes, second    tps: 475
ahci(4), 1 process, first       tps: 79
ahci(4), 1 process, second      tps: 476
ahci(4), 32 processes, first    tps: 93
ahci(4), 32 processes, second   tps: 488

Without O_DIRECT flag UFS was able to fit all accessed information into
buffer cache on second run. Second run uses buffer cache for all reads,
writes are not serialized, but NCQ effect is minimal in this situation.
First run is still mostly serialized.

Test 3: Same as 2, but with ZFS (i386 without tuning)
ata(4), 1 process, first        tps: 75
ata(4), 1 process, second       tps: 73
ata(4), 32 processes, first     tps: 98
ata(4), 32 processes, second    tps: 97
ahci(4), 1 process, first       tps: 77
ahci(4), 1 process, second      tps: 80
ahci(4), 32 processes, first    tps: 139
ahci(4), 32 processes, second   tps: 142

Data doesn't fit into cache. Multiple parallel requests give some effect
even with legacy driver, but with NCQ enabled it gives much more, almost
doubling performance!

Teste 4: Same as 3, but with kmem_size=1900M and arc_max=1700M.
ata(4), 1 process, first        tps: 90
ata(4), 1 process, second       tps: ~160-300
ata(4), 32 processes, first     tps: 112
ata(4), 32 processes, second    tps: ~190-322
ahci(4), 1 process, first       tps: 90
ahci(4), 1 process, second      tps: ~140-300
ahci(4), 32 processes, first    tps: 180
ahci(4), 32 processes, second   tps: ~280-550

Data slightly cached on first run and heavily cached on second. But even
such (maximum of I can dedicate on my i386) amount of memory it is not
enough to cache all data. Second run gives different device access
pattern each time and very random results.

Test 5: Same as 3, but with 2 disks:
ata(4), 1 process, first        tps: 80
ata(4), 1 process, second       tps: 79
ata(4), 32 processes, first     tps: 186
ata(4), 32 processes, second    tps: 181
ahci(4), 1 process, first       tps: 79
ahci(4), 1 process, second      tps: 110
ahci(4), 32 processes, first    tps: 287
ahci(4), 32 processes, second   tps: 290

Data doesn't fit into cache. Second disk gives almost no improvements
for serialized requests. Multiple parallel requests double speed even
with legacy driver, because of spreading requests between drives. Adding
NCQ support significantly rises speed even more.

As conclusion:
- in this particular test ZFS scaled well with parallel requests,
effectively using multiple disks. NCQ shown great benefits. But i386
constraints are significantly limited ZFS caching abilities.
- UFS behaves very poorly in this test. Even with parallel workload it
often serializes device accesses. May be results would be different if
there would be separate file for each process, or with some other
options, but I think pattern I have used is also possible in some
applications. Only benefit UFS shown here is more effective memory
management on i386, leading to higher cache effectiveness.

It would be nice if somebody explained that UFS behavior.

-- 
Alexander Motin