Re: My problems with stability on -current

From: Alexander Motin <mav_at_FreeBSD.org> Date: Sat, 07 May 2011 12:43:16 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:13 UTC

Doug Barton wrote:
> On 05/05/2011 13:55, Alexander Motin wrote:
>> I see several possibly unrelated problems there:
>>   - crashes are always crashes. They should be debugged.
>>   - calcru going backwards could have the same roots as lost wall clock
>> time.
> 
> I think you're right about that. What usually happens when the load
> maxes out is that the system visibly freezes for a minute or 2, and when
> it comes back to life the log is flooded with calcru messages. If it
> stays up long enough after that the wall clock drift becomes noticeable.
> This is in spite of running ntpd.

These system freezes are very suspicious. Most time counters need only
few seconds to overflow, some even less. So freeze for few minutes will
easily overflow most of them. So the freezes are probably the cause of
time problems, but the question now is what the cause of freezes. You
should try to investigate what is going on during freezes. Does the
system do anything, are there any interrupts working (`vmstat -i` just
before and after), are there any interrupt storms, etc?

>> If there are some problems with timer interrupts, timecounters
>> could wrap unnoticed that will cause random time jumps.
>>   - interactivity problems. I can't prove it is unrelated, but have no
>> real ideas now.
>>
>> I would start from most obvious problems. I need to know more about
>> crashes. As usual: how to trigger, stack backtraces, etc.
> 
> Triggering is easy, I can start a buildworld with -j2, and a build of
> ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system
> will reboot. I posted a panic message relative to r220282, (-current
> archives, 4/4) but kib said it didn't make any sense. Usually I don't
> get a panic at all.

Could you hint me the thread?

>> What's about time problems, I would try to collect more data:
>>   - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
>> dmesg outputs;
> 
> http://people.freebsd.org/~dougb/dougb-current-r221566.txt
> 
>>   - what eventtimer is used now and does it helps to switch to another
>> one with kern.eventtimer.timer sysctl?
> 
> When I was trying to track down the problems last summer I vaguely
> remember trying RTC, but eventually we realized that the real problem
> was throttling, so I stopped specifying RTC and let it go back to the
> default. What do you suggest I try?

As I see, now you are using HPET (chosen automatically). I would try
switch to the LAPIC. Just make sure to disable C-states if you are
enabled them to be sure that LAPIC timer won't stop.

>>   - does the timer runs in periodic or one-shot mode and does it helps to
>> switch to another one?
> 
> How could I tell, and how would I switch?

`sysctl kern.eventtimer.periodic`. And read eventtimers(4) please.

>>   - if full CPU load makes time to stop, try to track what is going on
>> with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full
>> CPU load in one-shot mode you should have stable timer interrupt rate
>> about hz+stathz.
> 
> Ok, I'll do that tomorrow, tired now.
> 
>>   - if timer interrupts are not working well, you can build kernel with
>> options                KTR
>> options                ALQ
>> options                KTR_ALQ
>> options                KTR_COMPILE=(KTR_SPARE2)
>> options                KTR_ENTRIES=131072
>> options                KTR_MASK=(KTR_SPARE2)
>> to track event timers operation and use ktrdump to save the trace when
>> problem exist (preferably when it begins).
>>
>> And let's experiment with fresh CURRENT.
> 
> Done and done. I'm up to r221566, and I added those options to my kernel
> config. I ran ktrdump -cH -o ktrdumpfile and posted the results here:
> http://people.freebsd.org/~dougb/ktrdumpfile.txt  This was shortly after
> boot, with no load. Not sure if it helps, but there you go.

Dump looks fine, but I need dump specifically for the time of the
problem. As soon as time probably can't be trusted here, it would be
nice to make dump as localized as possible: clear buffer with `sysctl
debug.ktr.clear=1`, trigger freeze for few seconds, stop collecting with
`sysctl debug.ktr.mask=0` and do the dump.

-- 
Alexander Motin