Re: qmail uses 100% cpu after FreeBSD-5.0 to 5.1 upgrade

From: Don Lewis <truckman_at_FreeBSD.org>
Date: Mon, 16 Jun 2003 01:17:35 -0700 (PDT)
On 16 Jun, I wrote:
> On 16 Jun, Tim Robbins wrote:

>>> This looks like a bug in the named pipe code. Reverting
>>> sys/fs/fifofs/fifo_vnops.c to the RELENG_5_0 version makes the problem go
>>> away. I haven't tracked down exactly what change between RELENG_5_0 and
>>> RELENG_5_1 caused the problem.
>> 
>> Looks like revision 1.86 works, but it stops working with 1.87. Moving the
>> soclose() calls to fifo_inactive() may have caused it.
> 
> This is an interesting observation, but I'm not sure why it would make a
> difference.  I haven't looked at the qmail source, but it looks like it
> is doing a non-blocking open on the fifo, calling select() on the fd,
> and hoping that select() waits for a writer to open the fifo before
> returning with an indication that the descriptor is readable.
> 
> It looks like the select code is calling the soreadable() macro to
> determine if the fifo descriptor is readable, and the soreadable() macro
> returns a true value if the SS_CANTRCVMORE socket flag is set, which
> would indicate an EOF condition.
> 
> I might believe that I accidentally changed the setting of this flag,
> but I just compared fifo_vnops.c rev 1.78 with 1.87 and I believe this
> flag should be set the same way in both versions.
> 
> In both versions, fifo_close() always calls socantrcvmore(), which sets
> SS_CANTRCVMORE when the writer count drops to zero.  Prior to 1.87,
> fifo_close() also destroyed the sockets when the reference count dropped
> to zero, which caused fifo_open() to recreate the sockets when the fifo
> was opened again, and when it did, fifo_open() set the SS_CANTRCVMORE
> flag again.
> 
> The posted qmail syscall trace looks like what I would expect to see in
> the present implementation.  I can't explain why it would behave any
> differently prior to 1.87 ...

The plot thickens ...

I ran this bit of code on both 5.1 current with version 1.88 of
fifo_vnops.c, and 4.8-stable:

#include <sys/types.h>
#include <sys/time.h>
#include <unistd.h>
#include <fcntl.h>
main()
{
        int fd;
        fd_set readfds;

        fd = open("myfifo", O_RDONLY | O_NONBLOCK);

        printf("before the loop\n");
        while (1) {
                FD_ZERO(&readfds);
                FD_SET(fd, &readfds);
                printf("%d %d\n", fd, select(20, &readfds, NULL, NULL, NULL));
        }
        exit(0);
}

On 4.8-stable, select() immediately returns a "1", whether or not the
fifo has ever been opened for writing.

On 5.1-current, select() waits forever, even if the fifo has been opened
for writing by another process.  Select() only returns when something
has actually been written to the fifo, and since this process doesn't
read anything from the fifo, it spins on select() forever.

If some data is getting written to the fifo, it doesn't look like qmail
consumes it, and since fifo_close in 1.87 doesn't destroy the sockets,
it looks like the data is hanging around in the fifo while neither end
is open, and qmail stumbles across this data when it calls select()
after re-opening the fifo.

Now there are two questions that I can't answer:

	Why is my analysis of select() and the SS_CANTRCVMORE flag
        incorrect in 5.1-current with version 1.87 or 1.88 of
        fifo_vnops.c.

	Why doesn't qmail get stuck in a similar loop in 4.8-stable,
        since select always returns true for reading on a fifo with no
        writers?
Received on Sun Jun 15 2003 - 23:17:44 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:12 UTC