Apparent strange disk behaviour in 6.0

From: Julian Elischer <julian_at_elischer.org> Date: Wed, 27 Jul 2005 23:54:45 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:39 UTC

I've been playing around with some raid arrays.
I've notived some odd things.

firstly on a 2+HTT (i.e 4virtual) CPU system with one SCSI array and an 
ATA drive,
copying data from the ATA drive to the SCSI array seems to be a slower 
than it was on 4.x.
secondly

systat -vmstat never shows either of the drives as being 100% busy.
teh most I've seen is the ATA drive being 70% busy.
for example this is theoretically a disk IO bound system but:

Disks   ad0   da0 pass0 pass1 pass2
KB/t  19.40 11.68  0.00  0.00  0.00
tps     440   539     0     0     0
MB/s   8.34  6.14  0.00  0.00  0.00
% busy   48    50     0     0     0

I don't know how reliable that is however.
I HAVE noticed however that the sum of the busy percents for the two 
drives seems to always
be less that 110%.  If one goes up then the other goes down. Not knowing 
how these numbers
are calculated, it's hard to know whether that means anything.

Physically looking at the array, the disks spend a LOT of time doing 
nothing.
The array controller is obviously clustering the writes and seems to be 
writing
them out every 2 seconds but the disks are only busy for about 1/4 of 
that time.

I don't know how reliable that is however as an indication but whatever 
the bottle neck
is it's not the drives.
The array controller is reporting back that it hardly ever has a queue 
of more
than 1 thing to do, even though tags are set to 253 (occasionally th 
controller will report it has 20 to do
but the next instant it's caught up again)

I plan on net booting the same machine on 4.11 again and doing the same 
tests.

If I REALLY get the disks 100% busy by doing:

dd if=/dev/zero of=/raid1/bigfile bs=128k count=1000000,
then the system becomes so unresponsive that it takes about 10 seconds
for a ^C to get through to stop the dd.

a systat -vmstat

running at the same time on another window slows down and then just updates
every now and then. At no stage however does it show anything getting 
close to 100% of cpu time.
interrupt time is at about 15% and system time at anout 20%.

The odd thing is that a tip talking to the raid controller
continues to sho resposive behavior, continuing to update the raid stats 
page.
and the network seems to be bringing those to me just fine so teh com 
ports and
the network are at least able to function, even if everything else 
seizes up.

iostat sometimes continues to run
and this is what it showed during one section where the rest of teh 
system seemd pretty unresponsive:
      tty             ad0              da0            pass0             cpu
 tin tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
  53   79  6.13  13  0.08  16.68 1746 28.44   0.00   0  0.00   0  0 22  5 73
 604  836  6.00   2  0.01  16.00 1749 27.32   0.00   0  0.00   0  0 28  5 67
 168  240  7.97  31  0.24  128.00  40  4.96   0.00   0  0.00   0  0 27  9 64
 173  251 11.27  11  0.12  16.00 3047 47.61   0.00   0  0.00   0  0 30  5 65
 222  299 12.93  46  0.58  21.72 2092 44.37   0.00   0  0.00   0  0 34  5 60
 225  302 13.29  34  0.44  128.00  39  4.87   0.00   0  0.00   0  0 40 16 43
 172  250  6.82  34  0.23  30.45 217  6.44   0.00   0  0.00   0  0 52 15 33
 191  268  6.22   9  0.05  16.72 1559 25.44   0.00   0  0.00   0  0 18  3 80
 200  278 10.45  31  0.32  18.78 1007 18.46   0.00   0  0.00   0  0 54 11 34
 192  270 12.00   1  0.01  16.00 2827 44.18   0.00   0  0.00   0  0 34  6 59
 213  728  8.80  40  0.34  18.68 1225 22.34   0.00   0  0.00   0  0 42 11 47
 201  250 10.29  11  0.11  128.00   3  0.41   0.00   0  0.00   0  0 22  5 74
 186  281  8.20  37  0.30  125.85  49  5.98   0.00   0  0.00   0  0 33 11 56
 225  302  4.00   3  0.01  16.00 2977 46.52   0.00   0  0.00   0  0 29  4 66

I'm guessing that there may be a red-hot mutex somewhere in the kernel..
not sure what though..