Re: Process accounting/timing has broken recently

From: John Baldwin <jhb_at_freebsd.org>
Date: Tue, 7 Dec 2010 09:13:17 -0500
On Monday, December 06, 2010 7:11:28 pm David Xu wrote:
> John Baldwin wrote:
> > On Sunday, December 05, 2010 6:18:29 pm Steve Kargl wrote:
> >   
> >> Sometime in the last 7-10 days, some one made a
> >> change that has broken process accounting/timing.
> >>
> >> laptop:kargl[42] foreach i ( 0 1 2 3 4 5 6 7 8 9 )
> >> foreach? time ./testf
> >> foreach? end
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        69.55 real        38.39 user        30.94 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        68.82 real        40.95 user        27.60 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        69.14 real        38.90 user        30.02 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        68.79 real        40.59 user        27.99 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        68.93 real        39.76 user        28.96 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        68.71 real        41.21 user        27.29 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        69.05 real        39.68 user        29.15 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        68.99 real        39.98 user        28.80 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        69.02 real        39.64 user        29.16 sys
> >> Max ULP: 0.501607 for x in [-18.000000:88.709999] with dx = 1.067100e-04
> >>        69.38 real        37.49 user        31.67 sys
> >>
> >> testf is a numerically intensive program that tests the
> >> accuracy of expf() in a tight loop.  User time varies
> >> by ~3 seconds on my lightly loaded 2 GHz core2 duo processor.
> >> I'm fairly certain that the code does not suddenly grow/loose
> >> 6 GFLOP of operations.
> >>     
> >
> > The user/sys thing is a hack (and has been).  We sample the PC at stathz (~128 
> > hz) to figure out a user vs sys split and use that to divide up the total 
> > runtime (which actually is fairly accurate).  All you need is for the clock 
> > ticks to fire just a bit differently between runs to get a swing in user vs 
> > system time.
> >
> > What I would like is to keep separate raw bintime's for user vs system time in 
> > the raw data instead, but that would involve checking the CPU ticker more 
> > often (e.g. twice for each syscall, interrupt, and trap in addition to the 
> > current once per context switch).  So far folks seem to be more worried about 
> > the extra overhead rather than the loss of accuracy.
> >
> >   
> Adding any instruction into global syscall path should be cautioned, it
> is worse then before, thinking about a threaded application, a userland
> thread may have locked a mutex and calls a system call, the overhead
> added to system call path can directly affect a threaded application's
> performance now, because the time window the mutex is held
> is longer than before, I have seen some people likes to fiddle with
> system call path, it should be cautioned.

OTOH, the current getrusage(2) stats cannot be trusted.  The only meaningful
thing you can do is to sum them since the total is known to be accurate at
least.

If it wouldn't make things so messy I'd consider a new kernel option
'ACCURATE_RUSAGE' or some such.

-- 
John Baldwin
Received on Tue Dec 07 2010 - 13:39:56 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:10 UTC