Re: Process timing issue

From: Jerome Flesch <jerome.flesch_at_netasq.com> Date: Thu, 24 Feb 2011 15:18:31 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:11 UTC

Thanks for your explanations. It helped greatly. Using ktrdump and 
schedgraph.py and after modifying our test program to set and unset 
automatically debug.ktr.mask, I've been able to get useful information.

First, It made me realize that task switching, with default settings and 
2 active processes, only occurs each 100ms. Knowing that, expecting a 
latency time around 100ms was kind of silly :)

Next, it seems most of the latency pikes are due to a process starting 
or waking up. For instance, it usually happens when the openssl speed 
test is started ( 
http://jflesch.kwain.net/~jflesch/sys_latence/sched/sched_graph_openssl_start.png 
) or when pagedaemon wakes up (I forgot to disable the swap and my test 
program used too much memory to store the result values ...). I'm not 
sure why, but when we start openssl, it is often allowed to run for >= 
300ms, even with our test program set to real time priority. My 
intuition is that, at first, it's considered as an interactive process, 
until the scheduler realizes it's not. But then, does anyone know why it 
would take more than 300ms for the scheduler to realize that ?

Anyway, by setting kern.sched.interact=5 (so openssl isn't considered as 
an interactive process), kern.sched.slice=3 (to get an high enough 
scheduling resolution), and our program to real-time priority, we got 
rid of both problems. I'm just a little bit worried about 
kern.sched.slice=3. Is there any known side effect when reducing slices 
size ?

Also, another issue remain: We were hoping to keep our program with a 
normal priority. However when we set our test program to a normal 
priority (but still an higher priority than openssl), both get 50% of 
the CPU (I guess this is to be expected), and from time to time we have 
a "hiccup" in the scheduling: 
http://jflesch.kwain.net/~jflesch/sys_latence/sched/sched_graph_hicups.png 
. Is there any way to avoid them ? In other words, is it possible to 
make sure that the low priority process never gets more CPU time than 
the high priority one ?

On 23.02.2011 02:08, Ryan Stone wrote:
> To debug weird scheduling issues I find it helpful to start by looking
> at a schedgraph.  schedgraph is a tool that can display a graphical
> representation of what the scheduler was doing over a small slice of
> time.  The one downside is that you have to recompile your kernel to
> get the hooks that collect the necessary data, but it sounds like your
> problem is pretty easy to reproduce and so that shouldn't be a big
> problem.
>
> To enable the hooks, put the following in your kernel conf:
>
> options KTR
> options KTR_COMPILE=KTR_SCHED
> options KTR_MASK=KTR_SCHED
> options KTR_ENTIRES=(128*1024)
>
> Then rebuild and install the new kernel.  Next, run your test.  The
> instant that your test has detected the long delay, set the sysctl
> debug.ktr.mask to 0.  The scheduler hooks record data into a ring
> buffer, so the interesting data can be flushed out within seconds.
> Clearing that sysctl will stop any further events from being logged,
> which means that the interesting data will stay there until you go and
> collect it.
>
> You can get the raw data by redirecting the output of "ktrdump -ct"
> into a file.  Then from any machine with a graphical interface(FreeBSD
> with X installed, Windows, Mac, whatever) and python installed, run:
> $ python schedgraph.py /path/to/ktrdump/output
>
> You can get schedgraph.py from /usr/src/tools/sched.
>
> If you want to collect the data again, set the sysctl debug.ktr.mask
> back to 0x20000000 to restart gathering scheduler data.
>
> Ryan
>