A couple of days ago, a large company here in Adelaide installed 5.2.1-RC1 on their main Internet gateway :-( They're having problems, which I'm looking at in conjunction with Andrew Rutherford (copied). Looking for the problems is complicated by the fact that the machine has to be kept running. I'm posting what we've seen so far, in case this rings bells with anybody else. The main symptom is that round midday, when they get some large mail messages in and the system load rises, natd stops working. They can sometimes get it restarted by reloading the firewall rules, but sometimes they have to restart the processes. This causes massive loss of connections, of course. At the same time they were getting console messages saying "backtrace". The backtrace itself was going into the KVM, of course, in an unattended server room. We sent somebody in yesterday lunchtime, and he reported: backtrace(c08e24f8,2,cb89fd08,0,22) at backtrace+0x17 getdirtybuf(d48c2bbc,0,1,cb89fd08,1) at getdirtybuf+0x30 flush_inodedep_deps(c6926000,13c78d,d48c2c10,c062f993,d48c2c40) at flush_inodedep_deps+0xa3 softdep_sync_metadata(d48c2a8,0,c086875d,124,0) at softdep_sync_metadata+0x87 ffs_fsync(d48c2ca8,0,c085aa45,beb,0) at ffs_fsync+0x3b9 fsync(c74f9c80,<at this another very similar instance scrolled up - continuing from the same point in the new instance>d48cd14,c086f793,3ee,1) at fsync+0x1d syscall(2f,2f,2f,80b59e1)at syscall+0x2a0<several more flash past here> Xint0x80_syscall()at Xint0x80_syscall+0x1d ...syscall(95) eip=0x282909af,esp=0xbfbfb2cc,ebp=0xbfbfbba8 This is handwritten, but it points about as far from ipfw and natd as you could imagine. Looking at the code, it's really trying to tell us that one of the buffer headers in the bpp parameter to getdirtybuf() has a null vp: /* * XXX This code and the code that calls it need to be reviewed to * verify its use of the vnode interlock. */ for (;;) { if ((bp = *bpp) == NULL) return (0); if (bp->b_vp == NULL) backtrace(); Given the comment, it looks like the vnode interlock is currently not being used correctly. Based on the fact that this happens when big mail messages are being received, we've guessed that the file system in question is /var, and we've turned off soft updates there. We're both out of town effectively for the rest of the week, and we'll continue looking after that, but if anybody has any thoughts, we'd be grateful. Greg -- See complete headers for address and phone numbers.
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:43 UTC