Soft updates problems with 5.2.1-RC1

From: Greg 'groggy' Lehey <grog_at_FreeBSD.org> Date: Wed, 18 Feb 2004 10:36:11 +1030 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:43 UTC

A couple of days ago, a large company here in Adelaide installed
5.2.1-RC1 on their main Internet gateway :-(  They're having problems,
which I'm looking at in conjunction with Andrew Rutherford (copied).
Looking for the problems is complicated by the fact that the machine
has to be kept running.

I'm posting what we've seen so far, in case this rings bells with
anybody else.

The main symptom is that round midday, when they get some large mail
messages in and the system load rises, natd stops working.  They can
sometimes get it restarted by reloading the firewall rules, but
sometimes they have to restart the processes.  This causes massive
loss of connections, of course.

At the same time they were getting console messages saying
"backtrace".  The backtrace itself was going into the KVM, of course,
in an unattended server room.  We sent somebody in yesterday
lunchtime, and he reported:

backtrace(c08e24f8,2,cb89fd08,0,22) at backtrace+0x17
getdirtybuf(d48c2bbc,0,1,cb89fd08,1) at getdirtybuf+0x30
flush_inodedep_deps(c6926000,13c78d,d48c2c10,c062f993,d48c2c40) at
flush_inodedep_deps+0xa3
softdep_sync_metadata(d48c2a8,0,c086875d,124,0) at
softdep_sync_metadata+0x87
ffs_fsync(d48c2ca8,0,c085aa45,beb,0) at ffs_fsync+0x3b9
fsync(c74f9c80,<at this another very similar instance scrolled up -
continuing from the same point in the new
instance>d48cd14,c086f793,3ee,1) at fsync+0x1d
syscall(2f,2f,2f,80b59e1)at syscall+0x2a0<several more flash past here>
Xint0x80_syscall()at Xint0x80_syscall+0x1d
...syscall(95) eip=0x282909af,esp=0xbfbfb2cc,ebp=0xbfbfbba8

This is handwritten, but it points about as far from ipfw and natd as
you could imagine.  Looking at the code, it's really trying to tell us
that one of the buffer headers in the bpp parameter to getdirtybuf()
has a null vp:

	/*
	 * XXX This code and the code that calls it need to be reviewed to
	 * verify its use of the vnode interlock.
	 */

	for (;;) {
		if ((bp = *bpp) == NULL)
			return (0);
		if (bp->b_vp == NULL)
			backtrace();

Given the comment, it looks like the vnode interlock is currently not
being used correctly.

Based on the fact that this happens when big mail messages are being
received, we've guessed that the file system in question is /var, and
we've turned off soft updates there.  We're both out of town
effectively for the rest of the week, and we'll continue looking after
that, but if anybody has any thoughts, we'd be grateful.

Greg
--
See complete headers for address and phone numbers.