Re: calcru: negative time ... followed by freeze

From: Don Lewis <truckman_at_FreeBSD.org> Date: Mon, 21 Jun 2004 01:01:20 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:58 UTC

On 21 Jun, Bruce Evans wrote:
> On Sun, 20 Jun 2004, Don Lewis wrote:
> 
>> On 20 Jun, Robert Watson wrote:
>> > FYI, this is a Xeon box with two physical processors, each with two
>> > logical processors, and the problem could well be SMP-related.
>>
>> The hang might be SMP-related, but I just started getting the calcru
>> messages on my UP Athlon XP box.  I just upgraded to today's
>> -CURRENT from a week-old version, and my console got spammed by a bunch
>> of these messages while I was running portupgrade.
>>
>> calcru: negative time of 0 usec for pid 21261 (sh)
>> calcru: negative time of 0 usec for pid 21261 (sh)
>> ...
>> calcru: negative time of 3917 usec for pid 49504 (sh)
>> calcru: negative time of 3917 usec for pid 49504 (sh)
> 
> Hmm, these are nonnegative times.
> 
> The extra messages in -current might be caused by fill_kinfo() now calling
> calcru().
> 
> Misreporting of negative times is by the following too-simple dianostic:
> 
> % 	tu = (u_int64_t)tv.tv_sec * 1000000 + tv.tv_usec;
> % 	ptu = p->p_uu + p->p_su + p->p_iu;
> % 	if (tu < ptu || (int64_t)tu < 0) {
> % 		printf("calcru: negative time of %jd usec for pid %d (%s)\n",
> % 		    (intmax_t)tu, p->p_pid, p->p_comm);
> % 		tu = ptu;
> % 	}
> 
> The message is also printed for the tu < ptu case, which is what you are
> getting.
> 
> I fixed the messages when I got a lot of them due to a local bug.  The local
> bug was from double rounding for calcru() on child times (which -current
> doesn't do).  I can't see how this could be the problem in -current, since
> the components of ptu are rounded down and there is a KASSERT that they
> added up to less than tu in the previous call.
> 
> Ah, here is a likely cause of the bug in -current:
> 
> % 	if (p == curthread->td_proc) {
> % 		/*
> % 		 * Adjust for the current time slice.  This is actually fairly
> % 		 * important since the error here is on the order of a time
> % 		 * quantum, which is much greater than the sampling error.
> % 		 * XXXKSE use a different test due to threads on other
> % 		 * processors also being 'current'.
> % 		 */
> % 		binuptime(&bt);
> % 		bintime_sub(&bt, PCPU_PTR(switchtime));
> % 		bintime_add(&bt, &p->p_runtime);
> % 	} else
> % 		bt = p->p_runtime;
> 
> The XXXKSE comment is correct that this might be broken.  If the (p
> != curthread->td_proc) case happens at all for a running process, then
> it gives a wrong (out of date) timestamp in bt.  This wrongness will
> be detected if calcru() is was called called earlier in the current
> timeslice and took the other path here.
> 
> The recent change to fill_kinfo() is quite likely to trigger detection
> of this bug.  fill_kinfo() is often used to iterate over all processes
> for ps, so it will call calcru() with (p != curthread->td_proc) for
> all processes other than the one running it, and give a bt that is out
> of date for all such processes that are actually running.  Since there
> can be at most one running process per CPU, this bug only affects SMP.
> 
> The call to calcru() from ttyinfo() may be the only other trigger.
> ttyinfo() picks a process and should rarely or never pick the ithread
> running it, so it will almost always take the (p != curthread->td_proc)
> path.  Again, this is only a problem for the SMP case since in the !SMP
> case the picked process must have been switched away from to run the
> ithread, so it cannot be running.

There must be some !SMP trigger for this as well.  I just checked and I
was able to trigger this on my Pentium-M laptop as well by leaning on
the ^T key while I was logged on via ssh and running 'portupgrade -aP'.

Jun 21 00:41:31 hairball kernel: calcru: negative time of 23169 usec for pid 44653 (sh)
Jun 21 00:41:32 hairball kernel: calcru: negative time of 21990 usec for pid 44665 (sh)

I didn't use ^T on my Athlon box.  I might have had top running, though.

It's interesting that this bug only seems to get triggered on /bin/sh.
Maybe it is fork()/exit()/wait() related?