Re: SCHED_ULE should not be the default

From: Attilio Rao <attilio_at_freebsd.org> Date: Thu, 15 Dec 2011 20:02:44 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC

2011/12/15 Jeremy Chadwick <freebsd_at_jdc.parodius.com>:
> On Thu, Dec 15, 2011 at 05:26:27PM +0100, Attilio Rao wrote:
>> 2011/12/13 Jeremy Chadwick <freebsd_at_jdc.parodius.com>:
>> > On Mon, Dec 12, 2011 at 02:47:57PM +0100, O. Hartmann wrote:
>> >> > Not fully right, boinc defaults to run on idprio 31 so this isn't an
>> >> > issue. And yes, there are cases where SCHED_ULE shows much better
>> >> > performance then SCHED_4BSD. ??[...]
>> >>
>> >> Do we have any proof at hand for such cases where SCHED_ULE performs
>> >> much better than SCHED_4BSD? Whenever the subject comes up, it is
>> >> mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
>> >> 2. But in the end I see here contradictionary statements. People
>> >> complain about poor performance (especially in scientific environments),
>> >> and other give contra not being the case.
>> >>
>> >> Within our department, we developed a highly scalable code for planetary
>> >> science purposes on imagery. It utilizes present GPUs via OpenCL if
>> >> present. Otherwise it grabs as many cores as it can.
>> >> By the end of this year I'll get a new desktop box based on Intels new
>> >> Sandy Bridge-E architecture with plenty of memory. If the colleague who
>> >> developed the code is willing performing some benchmarks on the same
>> >> hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
>> >> recent Suse. For FreeBSD I intent also to look for performance with both
>> >> different schedulers available.
>> >
>> > This is in no way shape or form the same kind of benchmark as what
>> > you're planning to do, but I thought I'd throw it out there for folks to
>> > take in as they see fit.
>> >
>> > I know folks were focused mainly on buildworld.
>> >
>> > I personally would find it interesting if someone with a higher-end
>> > system (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the
>> > same test (changing -jX to -j{numofcores} of course).
>> >
>> > --
>> > | Jeremy Chadwick ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??jdc at parodius.com |
>> > | Parodius Networking ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? http://www.parodius.com/ |
>> > | UNIX Systems Administrator ?? ?? ?? ?? ?? ?? ?? ?? ?? Mountain View, CA, US |
>> > | Making life hard for others since 1977. ?? ?? ?? ?? ?? ?? ?? PGP 4BD6C0CB |
>> >
>> >
>> > sched_ule
>> > ===========
>> > - time make -j2 buildworld
>> > ??1689.831u 229.328s 18:46.20 170.4% 6566+2051k 432+4264io 4565pf+0w
>> > - time make -j2 buildkernel
>> > ??640.542u 87.737s 9:01.38 134.5% 6490+1920k 134+5968io 0pf+0w
>> >
>> >
>> > sched_4bsd
>> > ============
>> > - time make -j2 buildworld
>> > ??1662.793u 206.908s 17:12.02 181.1% 6578+2054k 23750+4271io 6451pf+0w
>> > - time make -j2 buildkernel
>> > ??638.717u 76.146s 8:34.90 138.8% 6530+1927k 6415+5903io 0pf+0w
>> >
>> >
>> > software
>> > ==========
>> > * sched_ule test: ??FreeBSD 8.2-STABLE, Thu Dec ??1 04:37:29 PST 2011
>> > * sched_4bsd test: FreeBSD 8.2-STABLE, Mon Dec 12 22:42:54 PST 2011
>>
>> Hi Jeremy,
>> thanks for the time you spent on this.
>>
>> However, I wanted to ask/let you note 3 things:
>> 1) Did you use 2 different code base for the test? (one updated on
>> December 1 and another one on December 12)
>
> No; src-all (/usr/src on this system) was not updated between December
> 1st and December 12th PST.  I do believe I updated it today (15th PST).
> I can/will obviously hold off so that we have a consistent code base for
> comparing numbers between schedulers during buildworld and/or
> buildkernel.
>
>> 2) Please note that you should have repeated this test several times
>> (basically until you don't get a standard deviation which is
>> acceptable with ministat) and report the ministat output
>
> This is the first time I have heard of ministat(1).  I'm pretty sure I
> see what it's for and how it applies to this situation, but boy that man
> page could use some clarification (I have 3 people looking at this thing
> right now trying to figure out what means what in the graph :-) ).
> Anyway, graph or not, I see the point.
>
> Regarding multiple tests: yup, you're absolutely right, the only way to
> do it would be to run a sequence of tests repeatedly (probably 10 per
> scheduler).  Reboots and rm -fr /usr/obj/* would be required after each
> test too, to guarantee empty kernel caches (of all types) consistently
> every time.
>
> What I posted was supposed to give people just a "general idea" if there
> was any gigantic difference between the two, and there really isn't.
> But, as others have stated (and you below), buildworld may not be an
> effective way to "benchmark" what we're trying to test.
>
> Hence me wondering exactly what would make for a good test.  Example:
>
> 1. Run + background some program that "beats on things" (I really don't
> know what; creation/deletion of threads?  CPU benchmark?  bonnie++?),
> with output going to /dev/null.
> 2. Run + background "time make -j2 buildworld" with output going to /dev/null
> 3. Record/save output from "time".
> 4. rm -fr /usr/obj && shutdown -r now
> 5. Repeat all steps ~10 times
> 6. Adjust kernel configuration file to use other scheduler
> 7. Repeat steps 1-5.
>
> What I'm trying to figure out is what #1 and #2 should be in the above
> example.
>
>> 3) The difference is less than 2% which I suspect is really
>> statistically unuseful/the same
>
> Understood.
>
>> I'm not really even surprised ULE is not faster than 4BSD in this case
>> because usually buildworld/buildkernel tests are driven for the vast
>> majority by I/O overhead rather than scheduler capacity. It would be
>> more interesting to analyze how buildworld does while another type of
>> workload is going on.
>
> Yup, agreed/understood, hence me trying to find out what would classify
> as a good stress test for all of this.
>
> I have a testbed system in my garage which I could set up to literally
> do all of this in a loop, meaning automate the entire above process and
> just let it go, writing stderr from time to a file (which wouldn't skew
> the results at all).
>
> Let me know what #1 and #2 above, re: "the workloads", should be and
> I'll be happy to set it up.

My idea, in order to gather meaningful datas for both ULE and 4BSD
would be to see how well they behave in the futher situation:
- 2 concurrent interactive workloads
- 2 concurrent cpu-intensive workloads
- mixed

and having the number of threads for both varying as: N/2, N, N +
small_amount (1 or 2 or 3, etc), N*2 (where N is the number of
available CPUs) which automatically translates into:

- 2 concurrent interactive and intensive (A and B workloads):
  * A N/2 threads, B N/2 threads
  * A N threads, B N/2 threads
  * A N + small_amount, B N/2 threads
  * A N*2 threads, B N/2 threads
  * A N threads, B N threads
  * A N + small_amount, B N threads
  * A N*2 threads, B N threads
  * A N + small_amount, B N + small_amount threads
  * A N*2 threads, B N + small_amount threads
  * A N*2 threads, B N*2 threads

For the mixed case, instead, we should try all the 16 combinations
possibly and it is likely the most interesting case, to be honest.

About the workload, we could use:
interactives: buildworld and bonnie++ (I'm not totally sure if
bonnie++ let you decides how many threads to run, but I'm sure we can
replace with something that really does that)
cpu-intensive: dnetc and SOMETHINGELSE (please propose something that
can be setup very easilly!)
mixed case: buildworld and dnetc

About the environment I'd suggest the following things:
- Try to boot with a maximum of 16 CPUs. I'm sure past that point TLB
shootdown overhead is going to be too overwhelming, make doesn't
really scale well, and also there could be too much contention on
vm_page_lock_queue for interactive threads.
- Try to reduce the I/O effect by using tmpfs as a storage for in and
out datas when working out the benchmark
- Use 10.0 with both kerneland and userland totally debug-free (please
recall to set MALLOC_PRODUCTION in jemalloc) and always at the same
svn revision, with the only change being the scheduler switch and the
number of threads changing during the runs

About the test itself I'd suggest the following things:
- After every test combination, please reboot the machine (like, after
you have tested the A N/2 threads and B N/2 threads case on
sched_4bsd, reboot the machine before to do A N threads and B N/2
threads)
- For every test combination I suggest to run the workloads 4 times,
discard the first one (but keep the value!) and ministat the other
three. Showing the "uncached" case against the average cached one will
give much more indication than expected.
- Expect a standard deviation from ministat to be 95% (or beyond) to be valuable
- For every difference in performance we find we should likely start
worry about if it is as or bigger than 3% and being very concerned
from 5% to above

I think we already have some datas of ULE being broken in some cases
(like George's and Steven's case) but we really need to characterize
more, I think.

Now, I understand this seems a gigantic work but I think there is much
people which is interested in working on this and we may scatter these
tests around, to different testers, to find meaningful datas.

If it was me, I would start with comparisons involving all the N and N
+ small_amount cases which should be the most interesting.

Do you have questions?

Thanks,
Attilio

-- 
Peace can only be achieved by understanding - A. Einstein