New symptom, today (still running r221566) I compiled a small port, that worked without any freezes or interactivity problems. Then I tried compiling a larger port (java/openjdk6 if anyone cares) and still no interactivity problems, but I got the "system wedge requiring power cycle" problem I was seeing previously that I tracked to the one-shot timer update. More below. On 05/07/2011 02:43, Alexander Motin wrote: > Doug Barton wrote: >> On 05/05/2011 13:55, Alexander Motin wrote: >>> I see several possibly unrelated problems there: >>> - crashes are always crashes. They should be debugged. >>> - calcru going backwards could have the same roots as lost wall clock >>> time. >> >> I think you're right about that. What usually happens when the load >> maxes out is that the system visibly freezes for a minute or 2, and when >> it comes back to life the log is flooded with calcru messages. If it >> stays up long enough after that the wall clock drift becomes noticeable. >> This is in spite of running ntpd. > > These system freezes are very suspicious. Most time counters need only > few seconds to overflow, some even less. So freeze for few minutes will > easily overflow most of them. So the freezes are probably the cause of > time problems, but the question now is what the cause of freezes. You > should try to investigate what is going on during freezes. Does the > system do anything, are there any interrupts working (`vmstat -i` just > before and after), are there any interrupt storms, etc? Here is the output on a mostly-idle system, shortly after reboot: vmstat -i interrupt total rate irq1: atkbd0 1784 0 irq9: acpi0 1 0 irq14: ata0 213355 89 irq15: ata1 58 0 irq17: wpi0 74331 31 irq20: hpet0 uhci0+ 787767 331 irq22: uhci2 21453 9 irq256: hdac0 11 0 Total 1098760 462 At a more opportune time I'll try crashing it again and get another result. >>> If there are some problems with timer interrupts, timecounters >>> could wrap unnoticed that will cause random time jumps. >>> - interactivity problems. I can't prove it is unrelated, but have no >>> real ideas now. >>> >>> I would start from most obvious problems. I need to know more about >>> crashes. As usual: how to trigger, stack backtraces, etc. >> >> Triggering is easy, I can start a buildworld with -j2, and a build of >> ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system >> will reboot. I posted a panic message relative to r220282, (-current >> archives, 4/4) but kib said it didn't make any sense. Usually I don't >> get a panic at all. > > Could you hint me the thread? Go to http://www.FreeBSD.org/ Click 'mailing lists' Click 'listed in the FreeBSD Handbook.' Click freebsd-current Click freebsd-current Archives Click April 2011 search for r220282 Voila! :) >>> What's about time problems, I would try to collect more data: >>> - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose >>> dmesg outputs; >> >> http://people.freebsd.org/~dougb/dougb-current-r221566.txt >> >>> - what eventtimer is used now and does it helps to switch to another >>> one with kern.eventtimer.timer sysctl? >> >> When I was trying to track down the problems last summer I vaguely >> remember trying RTC, but eventually we realized that the real problem >> was throttling, so I stopped specifying RTC and let it go back to the >> default. What do you suggest I try? > > As I see, now you are using HPET (chosen automatically). I would try > switch to the LAPIC. Just make sure to disable C-states if you are > enabled them to be sure that LAPIC timer won't stop. Ok, so kern.eventtimer.timer="LAPIC" in /boot/loader.conf should do that, right? I don't use C-states (in part as a result of previous investigation) but I do use powerd as such: powerd_flags="-a adaptive -b adaptive -n adaptive" >>> - does the timer runs in periodic or one-shot mode and does it helps to >>> switch to another one? >> >> How could I tell, and how would I switch? > > `sysctl kern.eventtimer.periodic`. kern.eventtimer.periodic: 0 > And read eventtimers(4) please. I did that, but I don't see anything in there as to which choice is one-shot, and how to change to periodic. I assume 0 is the default, which I also assume is one-shot. Does setting that to 1 change to periodic? Also, can I safely do this while the system is running, or should it be in /boot/loader.conf as well? >>> - if full CPU load makes time to stop, try to track what is going on >>> with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full >>> CPU load in one-shot mode you should have stable timer interrupt rate >>> about hz+stathz. >> >> Ok, I'll do that tomorrow, tired now. >> >>> - if timer interrupts are not working well, you can build kernel with >>> options KTR >>> options ALQ >>> options KTR_ALQ >>> options KTR_COMPILE=(KTR_SPARE2) >>> options KTR_ENTRIES=131072 >>> options KTR_MASK=(KTR_SPARE2) >>> to track event timers operation and use ktrdump to save the trace when >>> problem exist (preferably when it begins). >>> >>> And let's experiment with fresh CURRENT. >> >> Done and done. I'm up to r221566, and I added those options to my kernel >> config. I ran ktrdump -cH -o ktrdumpfile and posted the results here: >> http://people.freebsd.org/~dougb/ktrdumpfile.txt This was shortly after >> boot, with no load. Not sure if it helps, but there you go. > > Dump looks fine, but I need dump specifically for the time of the > problem. As soon as time probably can't be trusted here, it would be > nice to make dump as localized as possible: clear buffer with `sysctl > debug.ktr.clear=1`, trigger freeze for few seconds, stop collecting with > `sysctl debug.ktr.mask=0` and do the dump. Ok, I'll give that a try after work. Thanks, Doug -- Nothin' ever doesn't change, but nothin' changes much. -- OK Go Breadth of IT experience, and depth of knowledge in the DNS. Yours for the right price. :) http://SupersetSolutions.com/Received on Tue May 10 2011 - 00:13:33 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:14 UTC