Re: SCHED_ULE makes 256Mbyte i386 unusable

From: Rodney W. Grimes <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> Date: Sun, 22 Apr 2018 07:36:09 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC

> Konstantin Belousov wrote:
> >On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
> >> Konstantin Belousov wrote:
> >> >On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
> >> >> I decided to start a new thread on current related to SCHED_ULE, since I see
> >> >> more than just performance degradation and on a recent current kernel.
> >> >> (I cc'd a couple of the people discussing performance problems in freebsd-stable
> >> >>  recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
> >> >>
> >> >> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
> >> >> current/head kernel, I would see about a 30% performance degradation (elapsed
> >> >> run time for a kernel build over NFSv4.1) when the server kernel was built with
> >> >> options SCHED_ULE
> >> >> instead of
> >> >> options SCHED_4BSD
> So, now that I have decreased the number of nfsd kernel threads to 32, it works
> with both schedulers and with essentially the same performance. (ie. The 30%
> performance degradation has disappeared.)
> 
> >> >>
> >> >> Now, with a kernel from a couple of days ago, the
> >> >> options SCHED_ULE
> >> >> kernel becomes unusable shortly after starting testing.
> >> >> I have seen two variants of this:
> >> >> - Became essentially hung. All I could do was ping the machine from the network.
> >> >> - Reported "vm_thread_new: kstack allocation failed
> >> >>   and then any attempt to do anything gets "No more processes".
> >> >This is strange.  It usually means that you get KVA either exhausted or
> >> >severly fragmented.
> >> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
> >> kernel is working ok now. I haven't done enough to compare performance yet.
> >> Maybe I'll post again when I have some numbers.
> >>
> >> >Enter ddb, it should be operational since pings are replied.  Try to see
> >> >where the threads are stuck.
> >> I didn't do this, since reducing the number of kernel threads seems to have fixed
> >> the problem. For the pNFS server, the nfsd threads will spawn additional kernel
> >> threads to do proxies to the mirrored DS servers.
> >>
> >> >> with the only difference being a kernel built with
> >> >> options SCHED_4BSD
> >> >> everything works and performs the same as the Dec 2017 kernel.
> >> >>
> >> >> I can try rolling back through the revisions, but it would be nice if someone
> >> >> could suggest where to start, because it takes a couple of hours to build a
> >> >> kernel on this system.
> >> >>
> >> >> So, something has made things worse for a head/current kernel this winter, rick
> >> >
> >> >There are at least two potentially relevant changes.
> >> >
> >> >First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
> >> I've been running this machine with KSTACK_PAGES=4 for some time, so no change.
> W.r.t. Rodney Grimes comments about this (which didn't end up in this messages
> in the thread):
> I didn't see any instability when using KSTACK_PAGES=4 for this until this cropped
> up and seemed to be scheduler related (but not really, it seems).
> I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata
> Server code.
> 
> Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big
> item getting allocated on the stack, but many moderate sized ones.
> (A part of it is multiple instances of "struct vattr", some buried in "struct nfsvattr",
>  that NFS needs to use. I don't think these are large enough to justify malloc/free,
>  but it has to use several of them.)
> 
> One case I did try fixing was about 6 cases where "struct nfsstate" ended up on
> the stack. I changes the code to malloc/free them and then when testing, to
> my surprise I had a 20% performance hit and shelved the patch.
> Now that I know that the server was running near its limit, I might try this one
> again, to see if the performance hit doesn't occur when the machine has adequate
> memory. If the performance hit goes away, I could commit this, but it wouldn't
> have that much effect on the kstack usage. (It's interesting how this patch ended
> up related to the issue this thread discussed.)

Anything we can do to help relieve KSTACK usage, especially on i386
is helpfull.  These is a thread back quite some time where someone
came up with a compile time static "this functions uses X bytes of
local stack" and a bit of clean up was done.  We should persue
this issue further.

My experiece with the i386/KSTACK issues was attempting to do installs
from snapshot .iso's, I usually had to change to a custom kernel without
INVARIANTS and WITNESS, or reduce KSTACK to 2 and suffer the small stack
problem (ie, dont use NFS during install).  Neither was very pleasant.

I have found it in practical to run the 4 page KSTACK in production
VM's using i386 due to memory requirements.  I run many very lean
i386 VM's with 64MB of memory.  I suspect our user base also has
many people doing this, and it would be to our advantage to try
and reduce our kernel stack needs.   

> >> >Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
> >> Could this change have resulted in the system being able to allocate fewer
> >> kernel threads/stacks for some reason?
> >Well, it could, as anything can be buggy. But the intent of the change
> >was to give 4G KVA, and it did.
> Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit
> (see performance issue that went away, noted above) and any change could
> have pushed it across the line, I think.
> 
> >>
> >> >Consequences of the first one are obvious, it is much harder to find
> >> >the place to map the stack.  Second change, on the other hand, provides
> >> >almost full 4G for KVA and should have mostly compensate for the negative
> >> >effects of the first.
> >> >
> >> >And, I cannot see how changing the scheduler would fix or even affect that
> >> >behaviour.
> >> My hunch is that the system was running near its limit for kernel threads/stacks.
> >> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
> >> to a higher peak number of threads and hit the limit.
> >> SCHED_4BSD happened to result in timing such that it stayed just below the
> >> limit and worked.
> >> I can think of a couple of things that might affect this:
> >> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
> >>       they wouldn't terminate and release their resources before more new ones
> >>       are spawned.
> >Scheduler has nothing to do with the threads termination.  It might
> >select running threads in a way that causes the undesired pattern to
> >appear which might create some amount of backlog for termination, but
> >I doubt it.
> >
> >> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
> >>       could try and spawn more mirror DS worker threads at about the same time.
> >>
> >> Anyhow, thanks for the help, rick
> 
> Have a good day, rick
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> 

-- 
Rod Grimes                                                 rgrimes_at_freebsd.org