Re: My problems with stability on -current

From: Doug Barton <dougb_at_FreeBSD.org>
Date: Sat, 07 May 2011 02:21:59 -0700
On 05/05/2011 13:55, Alexander Motin wrote:
> Doug Barton wrote:
>> Alexander suggested some knobs to twist for the timers, and I'll be glad
>> to do that once he gets back to me with more concrete suggestions now
>> that he knows more about my specific problems.
>
> OK, I am all here. While this post is indeed larger then previous, it is
> not much more informative. Sorry. :(

I understand.

> I see several possibly unrelated problems there:
>   - crashes are always crashes. They should be debugged.
>   - calcru going backwards could have the same roots as lost wall clock
> time.

I think you're right about that. What usually happens when the load
maxes out is that the system visibly freezes for a minute or 2, and when 
it comes back to life the log is flooded with calcru messages. If it 
stays up long enough after that the wall clock drift becomes noticeable. 
This is in spite of running ntpd.

> If there are some problems with timer interrupts, timecounters
> could wrap unnoticed that will cause random time jumps.
>   - interactivity problems. I can't prove it is unrelated, but have no
> real ideas now.
>
> I would start from most obvious problems. I need to know more about
> crashes. As usual: how to trigger, stack backtraces, etc.

Triggering is easy, I can start a buildworld with -j2, and a build of
ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system 
will reboot. I posted a panic message relative to r220282, (-current 
archives, 4/4) but kib said it didn't make any sense. Usually I don't 
get a panic at all.

> What's about time problems, I would try to collect more data:
>   - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
> dmesg outputs;

http://people.freebsd.org/~dougb/dougb-current-r221566.txt

>   - what eventtimer is used now and does it helps to switch to another
> one with kern.eventtimer.timer sysctl?

When I was trying to track down the problems last summer I vaguely
remember trying RTC, but eventually we realized that the real problem
was throttling, so I stopped specifying RTC and let it go back to the
default. What do you suggest I try?

>   - does the timer runs in periodic or one-shot mode and does it helps to
> switch to another one?

How could I tell, and how would I switch?

>   - if full CPU load makes time to stop, try to track what is going on
> with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full
> CPU load in one-shot mode you should have stable timer interrupt rate
> about hz+stathz.

Ok, I'll do that tomorrow, tired now.

>   - if timer interrupts are not working well, you can build kernel with
> options                KTR
> options                ALQ
> options                KTR_ALQ
> options                KTR_COMPILE=(KTR_SPARE2)
> options                KTR_ENTRIES=131072
> options                KTR_MASK=(KTR_SPARE2)
> to track event timers operation and use ktrdump to save the trace when
> problem exist (preferably when it begins).
>
> And let's experiment with fresh CURRENT.

Done and done. I'm up to r221566, and I added those options to my kernel 
config. I ran ktrdump -cH -o ktrdumpfile and posted the results here: 
http://people.freebsd.org/~dougb/ktrdumpfile.txt  This was shortly after 
boot, with no load. Not sure if it helps, but there you go.


Thanks again for your help,

Doug

-- 

	Nothin' ever doesn't change, but nothin' changes much.
			-- OK Go

	Breadth of IT experience, and depth of knowledge in the DNS.
	Yours for the right price.  :)  http://SupersetSolutions.com/
Received on Sat May 07 2011 - 07:22:02 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:13 UTC