Improving the kernel/i386 timecounter performance (GSoC proposal)

From: Prashant Vaibhav <prashant.vaibhav_at_gmail.com> Date: Thu, 26 Mar 2009 18:21:54 +0530 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:45 UTC

Hi everyone,
I'm a potential Google Summer of Code applicant, proposing to work on
improving the timecounter performance in the FreeBSD kernel (suggestion
from Timecounter Performance
Improvements<http://www.freebsd.org/projects/ideas/index.html#p-timecounter-perf>).
My qualifications are mentioned at the end of this email, for those
interested. After some initial discussion in #freebsd-soc, I'm posting this
to the mailing lists (and CC'ing it to specific people) for further
discussion before I finalize and submit my application.

The primary idea is to improve the performance and resolution of
gettimeofday() and friends by creating a efficient userspace implementation
of these functions, along with some supporting modifications to the kernel.
According to my understanding, currently the gettimeofday() function calls
into the kernel to retrieve the timing information to pass on to user apps.
I propose to improve it as follows: Export the relevant timing information
to a shared page in memory, which will be mapped into every user app's
address space. The gettimeofday() function's implementation will then be
changed to read the timestamp counter (TSC) from the processor, and use the
reading in conjunction with the timing info exported by the kernel to
calculate and return the time info in proper format. The TSC can be read
very efficiently from userspace (currently this is the fastest and highest
resolution timer available, beating HPET, PIT, RTC etc.). This will allow
applications to have a very fast and more importantly, a higher resolution
timer available to them. This will also pave way for optionally making the
FreeBSD kernel tickless, which would help with efficiency and power
consumption (the processor will be able to sleep for longer durations
without having to service timer interrupts several hundred times a second).
Other operating systems (like OS X) already do this to varying extent.

There are several issues with this approach however, and I plan to tackle
each of them so that there is no loss of functionality or accuracy, and
certainly no loss of performance. The project will be completed in stages,
tackling each of these issues —

   - Implement the exporting of shared system-wide pages to be mapped into
   each process. (There has been some work done in this area: Avoiding
   syscall overhead<http://www.freebsd.org/projects/ideas/index.html#p-setproctitle>).
   This page will contain timing info.
   - Have the kernel read and export the information related to TSC during
   boot-up. This is heavily processor dependent and each processor (those from
   Intel/AMD) has its own peculiarities. The kernel should provide at least the
   TSC frequency by which the TSC read from userspace can be scaled to get
   nanosecond time. Wall time offset at boot-time should also be exported so
   TSC can be converted to wall time.
   - The TSC frequency might change on certain processors with non-constant
   TSC rate (because of SpeedStep, dynamic freq scaling etc.). The only way to
   combat this is that the kernel be notified every time the processor
   frequency changes. Every cpu frequency driver will need to be updated to
   notify the kernel before and after a cpu freq change. The tsc frequency will
   then need to be adjusted in the exported info. This does not apply to modern
   processors (Intel Core or higher and recent AMD processors, both of which
   have a constant tsc rate).
   - On multiprocessor systems, threads might bounce between different
   processors. There are two problems here: The TSC of each core could have an
   offset relative to each other, and the TSC of each core could have a
   drifting frequency. The first issue is found on most multicore CPUs, and
   will be solved by measuring the offset at boot-time and exporting this info
   so that the tsc read by the user app can be corrected based on the core it's
   running on. The second issue only applies to AMD Athlon X2 during C1 state.
   This is solved by following AMD's recommendation: disable c1 clock ramping
   during bootup and suspend/resume by updating relevant info in the
   northbridge configuration.
   - In case we have some time left before completion of GSoC, one more
   thing can be added. Scaling the processor frequency up and down takes a
   finite amount of time (tens to hundreds of microseconds). During this time,
   the tsc frequency is undefined. Since we will be notified both before and
   after such a change (by the cpufreq drivers), an alternate source (like HPET
   or RTC) can be used to measure this duration and correct the tsc offset
   after the switch.

Given all this is handled carefully, we will be able to use the TSC read-out
as either: (1) an offset from the last-updated timestamp (updated HZ times
every second, on each timer interrupt). Or (2) use the TSC exclusively for
timing and disable the timer interrupt.

Currently the first approach will be used. This will avoid having to call
into the kernel to get the timing info, as well as provide finer resolution
timing. The second approach is an extension to allow for a tickless kernel
(not part of my proposal, but do-able in the future).

To summarize:

The kernel exports a shared page mapped into each process and set as
read-only. This page is updated on each clock tick to contain the time. This
page also contains the tsc frequency and other information, which is
potentially updated every time this info changes. The userspace
implementation of gettimeofday() reads the timestamp counter from the
processor, and the scale, offset etc. from the shared page to convert it to
nanoseconds. This offset is then added to the last updated nano time (also
present in the shared page) and returned to the application. The various
peculiarities of each processor's tsc implementation will be accounted for.

We will also need to make comprehensive benchmarks and tests to assert the
validity and performance benefits. I am not well versed with rigorous
benchmarking so this part of the project would need additional thought.

My qualifications / personal details:
I'm a 22 year old Indian male. I'm an undergrad in Electrical Engineering &
Computer Science at Jacobs University Bremen, Germany. I have years of
experience in C/C++ and varying job experiences ranging from web development
to human-computer interaction devices. I've taken courses in computer
architecture and operating systems. More details will be listed on my
application, for now I'll mention the experience most relevant to the task
at hand —

Since August 2008, I've started and completed a port of the Darwin XNU
kernel (used by OS X), for generic x86 PCs. (Webpage:
http://code.google.com/p/xnu-dev) Among other things, I added lots of
rtc/tsc improvements to Apple's implementation that deals with exactly the
same problems I have described above. All issues were solved, and the kernel
is being used in production of thousands of computers worldwide (including
the computer I'm typing this on!). Most of the code was written by me, with
support from a few other people, so I have a fair idea of the challenge and
their solutions. The tsc multicore synchronization was written independently
by two other people, so this is the part with which I'm least familiar. The
code is already implemented for XNU and it works well: so most of the work
would be porting it to BSD. Since I'm the author of most of it, and have
good contact with the other 3-4 people who contributed other parts, there
should be no licensing issues. I've also written a SpeedStep driver for OS X
(http://code.google.com/p/xnu-speedstep), which sends clock recalibration
signals to the kernel (also made relevant modifications in the kernel for
this to work).

What I still need to learn/plan
My experience with FreeBSD is somewhat limited. I have a dragonflyBSD based
home server (because freebsd didn't have drivers for its cheap ethernet
card). My kernel programming experience is also limited to the XNU kernel
(since about July last year) and I've helped fix a minor bug (typo in
ethernet driver PCI ID) in dfbsd kernel. But I'm a fast learner, and given
the very well commented and clear code in the freebsd kernel, I should be up
to speed pretty soon. Right now I've installed freebsd in a virtual machine
and am playing around with it. Will shortly try building the kernel and
maybe make small modifications, figure out exactly which parts of the kernel
will need modifications. I've also been reading the freebsd handbook, the
"arch" book and the dev handbook.

Another big problem for me would be making the modifications to export the
shared page and map it into each process — my experience is mostly in
handling the tsc/rtc code, but not in memory management, so this is
something I need to learn. Lastly, I'm not very well-versed in making
rigorous benchmarks. I've done simple benchmarking during the xnu kernel
development, but these were limited to measuring clock ticks. A more
comprehensive test plan would include mysql benchmarks and similar.

Thanks everyone for reading through this humongous email! :-) Discussion
commenceth —

Best,
Prashant Vaibhav

PS: I am out of town with limited connectivity so responses could be
somewhat slow. My aim however is to finalize and submit the application by
the end of the month.