On 9 Nov, Bruce Evans wrote: > On Sat, 8 Nov 2003, Don Lewis wrote: > >> I've been seeing some wierd things for many months when using a serial >> console on my -CURRENT box. I finally had a chance to take a closer >> look today. >> >> It looks like the problem is some sort of interference between kernel >> output to the console and userland writes to /dev/console. I typically >> see syslogd output to the console get corrupted. Each message that >> syslogd writes seems to get truncated or otherwise corrupted. The most >> common thing I see is that each syslog message is reduced to a space and >> the first character of the month, or sometimes just a space, or >> sometimes nothing at all. > > This is (at least primarily) a longstanding bug in ttymsg(). It uses > nonblocking mode so that it doesn't block in write() or close(). For > the same reason, it doesn't wait for output to drain before close(). > If the close happens to be the last one on the device, this causes any > data buffered in the tty and lower software layers to be discarded > cleanly and any data in lower hardware layers to by discarded in a > driver plus hardware-dependent way (usually not so cleanly, especially > for the character being transmitted). I didn't think of a flush on close problem because I thought syslogd always kept the console open. >> This is totally consistent until I "kill >> -HUP" syslogd, which I believe causes syslogd to close and open >> /dev/console, after which the syslog output appears correct on the >> console. When the syslogd output is being corrupted, I can cat a file to >> /dev/console and the output appears to be correct. > > When I debugged this, syslogd didn't seem to keep the console open, > so the open()/close() in ttymsg() always caused the problem. I didn't > notice killing syslogd makes a difference. Perhaps it helps due to a > missing close. Holding the console open may be a workaround or even > the correct fix. It's not clear where this should be done (should all > clients of ttymsg() do it?). Running getty on the console or on the > underlying tty device should do it accidentally. It looks to me like syslogd keeps the console open in addition to the open()/close() in ttymsg(). cfline() calls open() on anything that begins with '/' and calls isatty() to figure out whether it should set the type to F_CONSOLE, F_TTY, or F_FILE, and init() closes the file descriptor for all of these when syslogd is HUPed. I wonder if the console descriptor is getting revoked ... >> I truss'ed syslogd, and it appears to be working normally, the writev() >> call that writes the data to the console appears to be writing the >> correct character count, so it would appear that the fault is in the >> kernel. > > If there are any kernel bugs in this area, then they would be that > last close of the console affects the underlying tty. The multiple > console changes are quite likely to have broken this if getty is run > on the underlying tty (they silently discarded the half-close of the > underlying tty which was needed to avoided trashing some of its state > when only the console is closed). I'm not running getty on my serial console. It is running on ttyv*. I'm only using the serial console to capture kernel stack traces, etc. >> The problem doesn't appear to be specific to syslogd, because I have >> seen the output from the shutdown scripts that goes to the console get >> truncated as well. > > Yes, in theory it should affect anything that uses ttymsg() or does > direct non-blocking writes without waiting for the output to drain. > Here are some half-baked fixes. The part that clears O_NONBLOCK is > wrong, and the usleep() part is obviously a hack. ttymsg() shouldn't > block even in close(), since if the close is in the parent ttymsg() > might block forever and if the close() is in a forked child then > blocking could create zillions of blocked children. > > Another part of the patch is concerned with limiting forked children. > If I were happy with that part then blocking would not be so bad. In > practice, I don't have enough system activity for blocked children to > be a problem. To see the problem with blocked children, do something > like the following: > - turn off clocal on the console so that the console can block better. > For sio consoles this often requires turning it off in the lock-state > device, since the driver defends against this foot shooting by locking > it on. > - hold the console open or otherwise avoid the original bug in this > thread, else messages will just be discarded in close() faster than > they can pile up. > - turn off your external console device or otherwise drop carrier. > - send lots of messages. I ran into this years ago when I was using actual terminals for console devices and someone typed a ^S and went away. The failure mode wasn't pretty. It was probably in the days when I was managing Masscomp boxes, but it might have been the first batch of Sun servers. > > For a non-half-baked fix, do somethng like: > - never block in ttymsg(), but always wait for output to drain using > tcdrain() in a single child process. It's probably acceptable for > this to not report errors to ttymsg()'s caller. > - limit children better. I think we now fork children iff writev() > returns EWOULDBLOCK and this happens mainly when the tty buffers > fill up due to clocal being off and the external console not > listening. Handling this right seems to require handing off the > messages to a single child process that can buffer the messages > in userland and can block writing and draining them. Blocked write()s > and tcdrain()s are easy enough to handle in a specialized process by > sending signals to abort them. I like this idea better. The userland process could be smart to drop entire lines of output if things started getting too backlogged.Received on Sun Nov 09 2003 - 01:30:25 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:28 UTC