Re: Scrub incredibly slow with 13.0-RC3 (as well as RC1 & 2)

From: Michael Gmelin <freebsd_at_grem.de> Date: Fri, 26 Mar 2021 13:29:45 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC

On Fri, 26 Mar 2021 10:37:47 +0100
Mathieu Chouquet-Stringer <me+freebsd_at_mathieu.digital> wrote:

> On Thu, Mar 25, 2021 at 08:55:12AM +0000, Matt Churchyard wrote:
> > Just an a aside, I did post a message a few weeks ago with a similar
> > problem on 13 (as well as snapshot issues). Scrub seemed ok for a
> > short while, but then ground to a halt. It would take 10+ minutes to
> > go 0.01%, with everything appearing fairly idle. I finally gave up
> > and stopped it after about 20 hours. Moving to 12.2 and rebuilding
> > the pool, the system scrubbed the same data in an hour, and I've
> > just scrubbed the same system after a month of use with about 4
> > times the data in 3 hours 20. As far as I'm aware, both should be
> > using effectively the same "new" scrub code.
> >
> > Will be interesting if you find a cause as I didn't get any response
> > to what for me was a complete showstopper for moving to 13.  
> 
> Bear with me, I'm slowly resilvering now... But same thing, it's not
> even maxing out my slow drives... Looks like it'll take 2 days...
> 
> I did some flame graphs using dtrace. The first one is just the output
> of that:
> dtrace -x stackframes=100 -n 'profile-99 /arg0/ { _at_[stack()] =
> count(); } tick-60s { exit(0); }'
> 
> Clearly my machine is not busy at all.
> And the second is the output of pretty much the same thing except I'm
> only capturing pid 31 which is the one busy.
> dtrace -x stackframes=100 -n 'profile-99 /arg0 && pid == 31/ {
> _at_[stack()] = count(); } tick-60s { exit(0); }'
> 
> One striking thing is how many times hpet_get_timecount is present...

Does tuning of

- vfs.zfs.scrub_delay
- vfs.zfs.resilver_min_time_ms
- vfs.zfs.resilver_delay

make a difference?

Best,
Michael

-- 
Michael Gmelin