On Fri, 6 Feb 2004, Garrett Wollman wrote: > <<On Fri, 6 Feb 2004 09:44:58 +0100 (CET), =?iso-8859-1?q?Claus=20Guttesen?= <cguttesen_at_yahoo.dk> said: > > > Does the algorithm(s) rely only on percentage of free > > space? On a five TB (netto) filesystem eigth percent > > is approx. 410 GB which seems quite alot. > > A good Data Structures text will prove to you that the efficiency of > hashing algorithms of the sort that the UFS block allocator uses > depends only on the occupancy ratio and not on the absolute number of > free hash slots. This seems obvious. The allocation is at best close to being random, at least for a nearly full disk, so when the disk is 92% full the chance of your preferred block being free is only 8% and the chance of the next block being free is not much better...; this is independent of the disk size because the allocator doesn't really know where free blocks are -- it has to guess and depend on their distribution being random enough for it to find one soon. OTOH, I think the efficiency of the hashing algorithm is now almost irrelevant. The average number of probes to find a free block with minfree = m (in percent) and the disk full (except for the minfree reserve) and the free blocks randomly allocated is 100 / (100 - m). For fixed m, this is independent of the disk size, so allocation is now very fast relative to i/o speeds compared with when ffs was new since CPU. What matters relatively more now is the average seek between the blocks. This is given by the above number less 1 (m / (100 -m)). The -1 term is from counting seeks to the next block as null. Table giving some of these numbers: m avsearch avseek - -------- ------ 0: 1.00 0.00 10: 1.11 0.11 20: 1.25 0.25 30: 1.43 0.43 40: 1.67 0.67 50: 2.00 1.00 60: 2.50 1.50 70: 3.33 2.33 80: 5.00 4.00 90: 10.00 9.00 99: 100.00 99.00 This shows that a disk that is 90% full is very likely to be 10 times slower than a disk that is 0% full. Blocks will be spaced 10 apart on average even if we don't seek about a lot from file to file, so we will have to wait 10 times longer to read the blocks unless we are lucky enough for something to have cached the in-between blocks _and_ to backtrack soon enough for the blocks to be still in the cache when we want them (writes don't require so much luck to cache but might not be cached at all). But backtracking is often the wrong thing to do and not what the allocator prefers to do and probably not what drive caches do. The table also shows that disks much more than 50% full may be quite slow. All real disks that I've tested are even slower than the table indicates (mainly due to large seeks to other cylinder groups, and large backwards seeks I believe). E.g., my FreeBSD src tree can be written and read at almost half the maximum disk speed (20MB/s out of 40MB/s) when copied to a new partition, but reading it from its usual place goes at 3-5% of the maximum disk speed (1 or 2 MB/s) despite me copying my usr partition back and forth to defragment it less than a year ago. Some analysis of seek offsets: source tree (src dir on fragmented usr partition; files in find(1) order): --- avgblksize = 7388.1, totnblk = 405082, avgseek = 11989.1, totseek = 673126241 fragmentation = 12.0%, or 58.8% (discounting full blocks) ncontig = 7620, nforward = 28375, nreverse = 20149 nforward1-1 = 509, nreverse1-1 = 0 nforward2-3 = 604, nreverse2-3 = 134 nforward4-7 = 840, nreverse4-7 = 163 nforward8-15 = 891, nreverse8-15 = 412 nforward16-31 = 1003, nreverse16-31 = 456 nforward32-63 = 1127, nreverse32-63 = 492 nforward64-127 = 1324, nreverse64-127 = 664 nforward128-255 = 1571, nreverse128-255 = 798 nforward256-511 = 1840, nreverse256-511 = 1097 nforward512-1023 = 2439, nreverse512-1023 = 1573 nforward1024-2047 = 3259, nreverse1024-2047 = 2347 nforward2048-4095 = 4089, nreverse2048-4095 = 3305 nforward4096-8191 = 4147, nreverse4096-8191 = 3851 nforward8192-16383 = 2525, nreverse8192-16383 = 2620 nforward16384-32767 = 1055, nreverse16384-32767 = 1164 nforward32768-65535 = 402, nreverse32768-65535 = 371 nforward65536-131071 = 210, nreverse65536-131071 = 185 nforward131072-262143 = 264, nreverse131072-262143 = 232 nforward262144-524287 = 175, nreverse262144-524287 = 161 nforward524288-1048575 = 75, nreverse524288-1048575 = 85 nforward1048576-2097151 = 26, nreverse1048576-2097151 = 39 destination tree (src dir on new partition; files in find(1) order): --- src tree (on usr partition; files in default find(1) order): avgblksize = 7917.1, totnblk = 405064, avgseek = 2137.0, totseek = 111959698 fragmentation = 7.4%, or 36.4% (discounting full blocks) ncontig = 22344, nforward = 20423, nreverse = 9623 nforward1-1 = 2249, nreverse1-1 = 0 nforward2-3 = 4343, nreverse2-3 = 41 nforward4-7 = 3457, nreverse4-7 = 325 nforward8-15 = 2116, nreverse8-15 = 1777 nforward16-31 = 2109, nreverse16-31 = 1828 nforward32-63 = 1872, nreverse32-63 = 1646 nforward64-127 = 1491, nreverse64-127 = 1350 nforward128-255 = 1005, nreverse128-255 = 965 nforward256-511 = 707, nreverse256-511 = 662 nforward512-1023 = 467, nreverse512-1023 = 441 nforward1024-2047 = 291, nreverse1024-2047 = 309 nforward2048-4095 = 66, nreverse2048-4095 = 91 nforward4096-8191 = 20, nreverse4096-8191 = 19 nforward8192-16383 = 3, nreverse8192-16383 = 2 nforward16384-32767 = 7, nreverse16384-32767 = 5 nforward32768-65535 = 95, nreverse32768-65535 = 58 nforward65536-131071 = 21, nreverse65536-131071 = 12 nforward131072-262143 = 42, nreverse131072-262143 = 28 nforward262144-524287 = 41, nreverse262144-524287 = 37 nforward524288-1048575 = 14, nreverse524288-1048575 = 18 nforward1048576-2097151 = 6, nreverse1048576-2097151 = 9 nforward8388608-16777215 = 1, nreverse8388608-16777215 = 0 Rawer data for this starts with: --- fs_bsize = 8192 fs_fsize = 1024 24320: lbn 0 blkno 196616 24321: lbn 0 blkno 196617 24322: lbn 0 blkno 196618 24323: lbn 0 blkno 196619 26252: lbn 0 blkno 215775 24325: lbn 0 blkno 196621 24326: lbn 0 blkno 196622 24327: lbn 0 blkno 196623 24328: lbn 0 blkno 197400 24329: lbn 0 blkno 197401 25529: lbn 0 blkno 202399 24331: lbn 0 blkno 197403 24332: lbn 0 blkno 197404 24333: lbn 0 blkno 197405 24334: lbn 0 blkno 197406 25930: lbn 0 blkno 205972 24588: lbn 0 blkno 199776-199783 lbn 1 blkno 202285 This is for the fragmented partition. The numbers in the first column are inode numbers in find(1) order and the meaning of the blkno's should be obvious. The first few files are tiny and consist of only 1 frag (#24320 is "."). Then there is a 9K file with 8 necessarly-contiguous frags forming a block and another 1K in a frag a long seek away. Seeks are counted whenever a blkno doesn't immediately follow the previous one. The above shows that the fragmented case has to seek about 6 times as far to read the whole src tree in find(1) order. Actual i/o times for "tar cf /dev/zero .": IBM 30MB DTLA fragmented 70% full: 199 seconds WD 1200JB new 3% full: 22.5 seconds Most applications don't touch so many files, and then I believe that the read times don't matter much (the working set stays cached) but large seeks for writes are maximally harmful (there won't be enough writes to fill in the gaps even if there is lots of write caching, so physical seeks will be large) BruceReceived on Sun Feb 08 2004 - 08:17:51 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:42 UTC