Re: NCQ vs UFS/ZFS benchmark [Was: Re: FreeBSD 8.0 Performance (at Phoronix)]

From: Ivan Voras <ivoras_at_freebsd.org>
Date: Thu, 03 Dec 2009 10:00:25 +0100
Alexander Motin wrote:
> Ivan Voras wrote:
>> If you have a drive to play with, could you also check UFS vs ZFS on
>> both ATA & AHCI? To try and see if the IO scheduling of ZFS plays nicely.
>>
>> For benchmarks I suggest blogbench and bonnie++ (in ports) and if you
>> want to bother, randomio, http://arctic.org/~dean/randomio .


> gstat shown that most of time only one request at a time was running on
> disk. Looks like read or read-modify-write operations (due to many short
> writes in test pattern) are heavily serialized in UFS, even when several
> processes working with the same file. It has almost eliminated effect of
> NCQ in this test.
> 
> Test 2: Same as before, but without O_DIRECT flag:
> ata(4), 1 process, first        tps: 78
> ata(4), 1 process, second       tps: 469
> ata(4), 32 processes, first     tps: 83
> ata(4), 32 processes, second    tps: 475
> ahci(4), 1 process, first       tps: 79
> ahci(4), 1 process, second      tps: 476
> ahci(4), 32 processes, first    tps: 93
> ahci(4), 32 processes, second   tps: 488

Ok, so this is UFS, normal caching.

> Data doesn't fit into cache. Multiple parallel requests give some effect
> even with legacy driver, but with NCQ enabled it gives much more, almost
> doubling performance!

You've seen queueing in gstat for ZFS+NCQ?

> Teste 4: Same as 3, but with kmem_size=1900M and arc_max=1700M.
> ata(4), 1 process, first        tps: 90
> ata(4), 1 process, second       tps: ~160-300
> ata(4), 32 processes, first     tps: 112
> ata(4), 32 processes, second    tps: ~190-322
> ahci(4), 1 process, first       tps: 90
> ahci(4), 1 process, second      tps: ~140-300
> ahci(4), 32 processes, first    tps: 180
> ahci(4), 32 processes, second   tps: ~280-550

And this is ZFS with some tuning. I've also seen high deviation in 
performance on ZFS so it seems normal.

> As conclusion:
> - in this particular test ZFS scaled well with parallel requests,
> effectively using multiple disks. NCQ shown great benefits. But i386
> constraints are significantly limited ZFS caching abilities.
> - UFS behaves very poorly in this test. Even with parallel workload it
> often serializes device accesses. May be results would be different if

I wouldn't say UFS behaves poorly from your results. It looks like only 
the multiprocess case is bad on the UFS. For single-process access the 
difference in favour of ZFS is ~10 TPS on the first case and UFS is 
apparently much better in all cases but the last on the second try. This 
may be explained if you have a large variation between runs.

Also, did you use the whole drive for the file system? In cases like 
this it would be interesting to create a special partition (in all 
cases, on all drives), covering only a small segment on the disk 
(thinking of the drive as a rotational media, made of cylinders). For 
example, a partition of size of 30 GB covering only the outer tracks.

> there would be separate file for each process, or with some other
> options, but I think pattern I have used is also possible in some
> applications. Only benefit UFS shown here is more effective memory
> management on i386, leading to higher cache effectiveness.
> 
> It would be nice if somebody explained that UFS behavior.

Possibly, read-only access to memory cache structures is protected by 
read-only locks, which are efficient, and ARC is more complicated than 
it's worth? But others should have better guesses :)
Received on Thu Dec 03 2009 - 08:00:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:58 UTC