My problems with stability on -current

From: Doug Barton <dougb_at_dougbarton.us>
Date: Thu, 05 May 2011 00:36:54 -0700
This is long, sorry. I wish I could condense things down to just the 
answer, or even just the question, but here goes. I've used HEAD on my 
main workstation(s) for many years. It's common for there to be ups and 
downs, and that's fine. Lately however the problems have been debilitating.

First a timeline. Since sometime before January 2008 I've been using a 
Dell Latitude D620 laptop as my primary system. It has a core 2 duo 
running at 2.33 G, and 2 G RAM. I 4xboot it with windows xp, freebsd 
current (amd64), another freebsd (usually 8.N-RELEASE i386) and Ubuntu. 
On the first and last I don't do a lot of compiling obviously, but even 
under heavy load on 8.2-RELEASE I'm not seeing problems, so the problems 
I _am_ seeing are not hardware related.

I keep my system very close to stock. My kernel config is GENERIC minus 
devices I don't have, and plus the following:

options         EXT2FS
options         IEEE80211_DEBUG # enable debug msgs
options         VESA
device          atapicam
device          sound
device          snd_hda
device          snp

I was building with clang for a while, but when the problems started I 
went back to gcc. I still have INVARIANTS on but I disabled WITNESS 
because with all the known+unfixed LORs it's kind of pointless. Nothing 
interesting in make/src.conf either, the latter is just a list of stuff 
not to build, KERNCONF, and MODULES_OVERRIDE.

Starting around December 2009 I started having problems under load with 
-current. Often I reported them, sometimes problems were found, 
sometimes not. In the course of trying to debug those problems I 
disabled throttling, which helped. Switching to SCHED_4BSD also helped 
quite a bit with interactivity under load, although it was still worse 
than on 8.x.

In October of 2010 I was lucky enough to receive a donation of a Dell 
Optiplex desktop that I started using as my primary workstation. Around 
that same time there was some work being done in the scheduler(s) and 
various related systems, and my desktop (which had a slightly faster 
core 2 duo and 4 G RAM) was running great. I assumed that the problems 
were solved.

Then 2 months ago I packed up the desktop system and pulled out the 
laptop again. I updated to the latest -current on the laptop, and all 
heck broke loose. I couldn't do anything on my laptop that created even 
a mediocre load without it crashing. Trying to do something like a 
buildworld (even without -j) would cause the system to absolutely crawl. 
I'd get tons of the dreaded "calcru" messages about time going 
backwards, and the system clock would lose literally minutes of wall 
clock time. At one point when I could keep it up long enough to build 
the world without crashing it had lost 40 minutes of wall clock time 
when it finished. I think that specific problem happened sometime 
between March 15 and r220282.

In trying to find that problem, I uncovered another, deeper problem with 
the "one-shot timers" from r212541. In order to make my binary search 
easier for the problem described above I was using a -current snapshot 
CD from August 2010 that I had laying around. I could easily build world 
with -j2, run X, do normal desktop stuff (firefox, thunderbird, pidgin, 
etc.) all at the same time. When I got closer to the more recent 
-current, it would crash as soon as I put a load on it. I eventually 
bifurcated down to that exact commit. I've been running on 212540 for 
over a week now without any problems, including lots of port builds with 
FORCE_MAKE_JOBS, etc.

Alexander suggested some knobs to twist for the timers, and I'll be glad 
to do that once he gets back to me with more concrete suggestions now 
that he knows more about my specific problems.


Doug

-- 

	Nothin' ever doesn't change, but nothin' changes much.
			-- OK Go

	Breadth of IT experience, and depth of knowledge in the DNS.
	Yours for the right price.  :)  http://SupersetSolutions.com/
Received on Thu May 05 2011 - 06:06:26 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:13 UTC