Re: Reserved space (WAS: How to calculate bsdlabel size)

From: Bruce Evans <bde_at_zeta.org.au>
Date: Mon, 9 Feb 2004 04:17:40 +1100 (EST)
On Fri, 6 Feb 2004, Garrett Wollman wrote:

> <<On Fri, 6 Feb 2004 09:44:58 +0100 (CET), =?iso-8859-1?q?Claus=20Guttesen?= <cguttesen_at_yahoo.dk> said:
>
> > Does the algorithm(s) rely only on percentage of free
> > space? On a five TB (netto) filesystem eigth percent
> > is approx. 410 GB which seems quite alot.
>
> A good Data Structures text will prove to you that the efficiency of
> hashing algorithms of the sort that the UFS block allocator uses
> depends only on the occupancy ratio and not on the absolute number of
> free hash slots.

This seems obvious.  The allocation is at best close to being random,
at least for a nearly full disk, so when the disk is 92% full the
chance of your preferred block being free is only 8% and the chance
of the next block being free is not much better...; this is independent
of the disk size because the allocator doesn't really know where free
blocks are -- it has to guess and depend on their distribution being
random enough for it to find one soon.

OTOH, I think the efficiency of the hashing algorithm is now almost
irrelevant.  The average number of probes to find a free block with
minfree = m (in percent) and the disk full (except for the minfree
reserve) and the free blocks randomly allocated is 100 / (100 - m).
For fixed m, this is independent of the disk size, so allocation is
now very fast relative to i/o speeds compared with when ffs was new
since CPU.  What matters relatively more now is the average seek between
the blocks.  This is given by the above number less 1 (m / (100 -m)).
The -1 term is from counting seeks to the next block as null.  Table
giving some of these numbers:

  m   avsearch avseek
  -   -------- ------
  0:    1.00    0.00
 10:    1.11    0.11
 20:    1.25    0.25
 30:    1.43    0.43
 40:    1.67    0.67
 50:    2.00    1.00
 60:    2.50    1.50
 70:    3.33    2.33
 80:    5.00    4.00
 90:   10.00    9.00
 99:  100.00   99.00

This shows that a disk that is 90% full is very likely to be 10 times
slower than a disk that is 0% full.  Blocks will be spaced 10 apart
on average even if we don't seek about a lot from file to file, so we
will have to wait 10 times longer to read the blocks unless we are
lucky enough for something to have cached the in-between blocks _and_
to backtrack soon enough for the blocks to be still in the cache when
we want them (writes don't require so much luck to cache but might not
be cached at all).  But backtracking is often the wrong thing to do and
not what the allocator prefers to do and probably not what drive caches
do.

The table also shows that disks much more than 50% full may be quite
slow.  All real disks that I've tested are even slower than the table
indicates (mainly due to large seeks to other cylinder groups, and
large backwards seeks I believe).  E.g., my FreeBSD src tree can be
written and read at almost half the maximum disk speed (20MB/s out of
40MB/s) when copied to a new partition, but reading it from its usual
place goes at 3-5% of the maximum disk speed (1 or 2 MB/s) despite
me copying my usr partition back and forth to defragment it less than
a year ago.  Some analysis of seek offsets:

source tree (src dir on fragmented usr partition; files in find(1) order):
---
avgblksize = 7388.1, totnblk = 405082, avgseek = 11989.1, totseek = 673126241
fragmentation = 12.0%, or 58.8% (discounting full blocks)
ncontig = 7620, nforward = 28375, nreverse = 20149
nforward1-1 = 509, nreverse1-1 = 0
nforward2-3 = 604, nreverse2-3 = 134
nforward4-7 = 840, nreverse4-7 = 163
nforward8-15 = 891, nreverse8-15 = 412
nforward16-31 = 1003, nreverse16-31 = 456
nforward32-63 = 1127, nreverse32-63 = 492
nforward64-127 = 1324, nreverse64-127 = 664
nforward128-255 = 1571, nreverse128-255 = 798
nforward256-511 = 1840, nreverse256-511 = 1097
nforward512-1023 = 2439, nreverse512-1023 = 1573
nforward1024-2047 = 3259, nreverse1024-2047 = 2347
nforward2048-4095 = 4089, nreverse2048-4095 = 3305
nforward4096-8191 = 4147, nreverse4096-8191 = 3851
nforward8192-16383 = 2525, nreverse8192-16383 = 2620
nforward16384-32767 = 1055, nreverse16384-32767 = 1164
nforward32768-65535 = 402, nreverse32768-65535 = 371
nforward65536-131071 = 210, nreverse65536-131071 = 185
nforward131072-262143 = 264, nreverse131072-262143 = 232
nforward262144-524287 = 175, nreverse262144-524287 = 161
nforward524288-1048575 = 75, nreverse524288-1048575 = 85
nforward1048576-2097151 = 26, nreverse1048576-2097151 = 39

destination tree (src dir on new partition; files in find(1) order):
---
src tree (on usr partition; files in default find(1) order):
avgblksize = 7917.1, totnblk = 405064, avgseek = 2137.0, totseek = 111959698
fragmentation = 7.4%, or 36.4% (discounting full blocks)
ncontig = 22344, nforward = 20423, nreverse = 9623
nforward1-1 = 2249, nreverse1-1 = 0
nforward2-3 = 4343, nreverse2-3 = 41
nforward4-7 = 3457, nreverse4-7 = 325
nforward8-15 = 2116, nreverse8-15 = 1777
nforward16-31 = 2109, nreverse16-31 = 1828
nforward32-63 = 1872, nreverse32-63 = 1646
nforward64-127 = 1491, nreverse64-127 = 1350
nforward128-255 = 1005, nreverse128-255 = 965
nforward256-511 = 707, nreverse256-511 = 662
nforward512-1023 = 467, nreverse512-1023 = 441
nforward1024-2047 = 291, nreverse1024-2047 = 309
nforward2048-4095 = 66, nreverse2048-4095 = 91
nforward4096-8191 = 20, nreverse4096-8191 = 19
nforward8192-16383 = 3, nreverse8192-16383 = 2
nforward16384-32767 = 7, nreverse16384-32767 = 5
nforward32768-65535 = 95, nreverse32768-65535 = 58
nforward65536-131071 = 21, nreverse65536-131071 = 12
nforward131072-262143 = 42, nreverse131072-262143 = 28
nforward262144-524287 = 41, nreverse262144-524287 = 37
nforward524288-1048575 = 14, nreverse524288-1048575 = 18
nforward1048576-2097151 = 6, nreverse1048576-2097151 = 9
nforward8388608-16777215 = 1, nreverse8388608-16777215 = 0

Rawer data for this starts with:
---
fs_bsize = 8192
fs_fsize = 1024
24320:	lbn 0 blkno 196616
24321:	lbn 0 blkno 196617
24322:	lbn 0 blkno 196618
24323:	lbn 0 blkno 196619
26252:	lbn 0 blkno 215775
24325:	lbn 0 blkno 196621
24326:	lbn 0 blkno 196622
24327:	lbn 0 blkno 196623
24328:	lbn 0 blkno 197400
24329:	lbn 0 blkno 197401
25529:	lbn 0 blkno 202399
24331:	lbn 0 blkno 197403
24332:	lbn 0 blkno 197404
24333:	lbn 0 blkno 197405
24334:	lbn 0 blkno 197406
25930:	lbn 0 blkno 205972
24588:	lbn 0 blkno 199776-199783
	lbn 1 blkno 202285

This is for the fragmented partition.  The numbers in the first column are
inode numbers in find(1) order and the meaning of the blkno's should be
obvious.  The first few files are tiny and consist of only 1 frag (#24320
is ".").  Then there is a 9K file with 8 necessarly-contiguous frags forming
a block and another 1K in a frag a long seek away.  Seeks are counted
whenever a blkno doesn't immediately follow the previous one.

The above shows that the fragmented case has to seek about 6 times as far
to read the whole src tree in find(1) order.  Actual i/o times for
"tar cf /dev/zero .":

IBM 30MB DTLA fragmented 70% full: 199 seconds
WD 1200JB new 3% full:              22.5 seconds

Most applications don't touch so many files, and then I believe that
the read times don't matter much (the working set stays cached) but
large seeks for writes are maximally harmful (there won't be enough
writes to fill in the gaps even if there is lots of write caching, so
physical seeks will be large)

Bruce
Received on Sun Feb 08 2004 - 08:17:51 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:42 UTC