Non-responsive 8.0-RC1

From: Peter Jeremy <peterjeremy_at_acm.org>
Date: Sun, 29 Nov 2009 08:22:26 +1100
My main server is running 8.0/amd64 from between RC1 and RC2 and I've
recently had a couple of long-duration hangs on it during which time
processes doing I/O will stop responding.

The first time, it stopped responding for about 25 minutes and then
spontaneously corrected itself.  I was logged in remotely the whole
time and Ctrl-T was responding throughout (claiming the process was
'runnable').  I tried starteding a second session - which got as far
as reporting the SSH banner I have configured and then did nothing.
The second time lasted about 5 minutes.

I can't find anything in any log files or dmesg.  'vmstat -m' output
looks sensible.  Unfortunately, I didn't have access to the console
on either occasion.

The system is a dual-core Athlon with the base OS (root/usr/var) on
UFS and the remainder of the filesystem ZFS.  It's running SCHEDULE.
It runs a pair of BOINC processes in the background.  The first time,
it should have been otherwise unused apart from a mairix (mail
indexing tool) process that I'd just started.  The second time, it
would have been running a buildkernel.

Based on it managing to report the ssh banner (which is stored in
/etc) but not getting to a shell prompt (my home directory is ZFS),
my initial suspicion was ZFS but it occurs to me that it could be
a priority-inversion problem with the BOINC processes.

Can anyone suggest where to go looking for a cause?

-- 
Peter Jeremy

Received on Sat Nov 28 2009 - 20:22:31 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:58 UTC