Re: [PATCH] MAXCPU alterable in kernel config - needs testers

From: Paul Allen <pallen_at_ugcs.caltech.edu> Date: Sun, 8 Oct 2006 18:23:23 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:01 UTC

>From John-Mark Gurney <gurney_j_at_resnet.uoregon.edu>, Sun, Oct 08, 2006 at 05:22:00PM -0700:
> Wouldn't having a single run queue lock still serialize the cpu's when
> getting a thread to run?  Don't we really need a per cpu run queue, and
> then have a scheduler that puts threads on the cpu's run queues?
Just so.  A fine-grain mutex model is much like pipelining a processor.
You break a task up into many small tasks each with its own lock and then
run the tasks in parallel.  There is a problem though.

1) How many pipeline stages you want depends on the number of underlying 
   CPUs.  Too few stages means idle processors.  Too many stages relative
   to the available resources means wasted time acquiring many locks.
2) The work has been relatively matched.  One slow stage creates bottlenecks
   that everyone else eventually feels.
3) Cache coherency: handing off a datum between CPUs wastes cache resources.
   Allowing one CPU to process a given datum through a life-time of several
   different lock stages (tasks) means sharing the data covered by those
   task's locks.

This sort of design problem is hard to solve and harder for several different
developers some of them working in their spare-time to coordinate properly.

It's even harder to do when the end application is unknown (i.e., in a general
purpose OS).

BUT,

per-cpu algorithms are not necessarily better.  you can waste cache resources
by data replication.  you need to provision load-balancing/migration.

I had been hoping that mdillon _at_ dragonfly would provide good experimental
evidence on this point.  unfortunately, in my estimation he  s distracted
by some heavy goals (namely his single-image clustering).  The consequence
of this is 1) it will take a long-time before enough of the dragonfly kernel
is out from under giant and 2) relative performance comparisons will be 
dubious because it will be unclear what cost he is paying for some of his 
single-image clustering support.

Its a shame because the community would greatly benefit from clear experimental
evidence contrasting some of the fine-grained locking approaches with a more
per-cpu oriented design.

AFAIK, the linux kernel has generally favored the per-cpu approach.  In
that respect, relative underperformance of FreeBSD vs. Linux is an indicator
that per-cpu approaches deserve more weight in the FreeBSD world.

                                          Paul