Re: serial console oddity

From: Don Lewis <truckman_at_FreeBSD.org>
Date: Sun, 9 Nov 2003 02:30:16 -0800 (PST)
On  9 Nov, Bruce Evans wrote:
> On Sat, 8 Nov 2003, Don Lewis wrote:
> 
>> I've been seeing some wierd things for many months when using a serial
>> console on my -CURRENT box.  I finally had a chance to take a closer
>> look today.
>>
>> It looks like the problem is some sort of interference between kernel
>> output to the console and userland writes to /dev/console.  I typically
>> see syslogd output to the console get corrupted.  Each message that
>> syslogd writes seems to get truncated or otherwise corrupted.  The most
>> common thing I see is that each syslog message is reduced to a space and
>> the first character of the month, or sometimes just a space, or
>> sometimes nothing at all.
> 
> This is (at least primarily) a longstanding bug in ttymsg().  It uses
> nonblocking mode so that it doesn't block in write() or close().  For
> the same reason, it doesn't wait for output to drain before close().
> If the close happens to be the last one on the device, this causes any
> data buffered in the tty and lower software layers to be discarded
> cleanly and any data in lower hardware layers to by discarded in a
> driver plus hardware-dependent way (usually not so cleanly, especially
> for the character being transmitted).

I didn't think of a flush on close problem because I thought syslogd
always kept the console open.

>> This is totally consistent until I "kill
>> -HUP" syslogd, which I believe causes syslogd to close and open
>> /dev/console, after which the syslog output appears correct on the
>> console. When the syslogd output is being corrupted, I can cat a file to
>> /dev/console and the output appears to be correct.
> 
> When I debugged this, syslogd didn't seem to keep the console open,
> so the open()/close() in ttymsg() always caused the problem.  I didn't
> notice killing syslogd makes a difference.  Perhaps it helps due to a
> missing close.  Holding the console open may be a workaround or even
> the correct fix.  It's not clear where this should be done (should all
> clients of ttymsg() do it?).  Running getty on the console or on the
> underlying tty device should do it accidentally.

It looks to me like syslogd keeps the console open in addition to the
open()/close() in ttymsg().  cfline() calls open() on anything that
begins with '/' and calls isatty() to figure out whether it should set
the type to F_CONSOLE, F_TTY, or F_FILE, and init() closes the file
descriptor for all of these when syslogd is HUPed.

I wonder if the console descriptor is getting revoked ...

>> I truss'ed syslogd, and it appears to be working normally, the writev()
>> call that writes the data to the console appears to be writing the
>> correct character count, so it would appear that the fault is in the
>> kernel.
> 
> If there are any kernel bugs in this area, then they would be that
> last close of the console affects the underlying tty.  The multiple
> console changes are quite likely to have broken this if getty is run
> on the underlying tty (they silently discarded the half-close of the
> underlying tty which was needed to avoided trashing some of its state
> when only the console is closed).

I'm not running getty on my serial console.  It is running on ttyv*. I'm
only using the serial console to capture kernel stack traces, etc.

>> The problem doesn't appear to be specific to syslogd, because I have
>> seen the output from the shutdown scripts that goes to the console get
>> truncated as well.
> 
> Yes, in theory it should affect anything that uses ttymsg() or does
> direct non-blocking writes without waiting for the output to drain.

> Here are some half-baked fixes.  The part that clears O_NONBLOCK is
> wrong, and the usleep() part is obviously a hack.  ttymsg() shouldn't
> block even in close(), since if the close is in the parent ttymsg()
> might block forever and if the close() is in a forked child then
> blocking could create zillions of blocked children.
> 
> Another part of the patch is concerned with limiting forked children.
> If I were happy with that part then blocking would not be so bad.  In
> practice, I don't have enough system activity for blocked children to
> be a problem.  To see the problem with blocked children, do something
> like the following:
> - turn off clocal on the console so that the console can block better.
>   For sio consoles this often requires turning it off in the lock-state
>   device, since the driver defends against this foot shooting by locking
>   it on.
> - hold the console open or otherwise avoid the original bug in this
>   thread, else messages will just be discarded in close() faster than
>   they can pile up.
> - turn off your external console device or otherwise drop carrier.
> - send lots of messages.

I ran into this years ago when I was using actual terminals for console
devices and someone typed a ^S and went away.  The failure mode wasn't
pretty.  It was probably in the days when I was managing Masscomp boxes,
but it might have been the first batch of Sun servers.

> 
> For a non-half-baked fix, do somethng like:
> - never block in ttymsg(), but always wait for output to drain using
>   tcdrain() in a single child process.  It's probably acceptable for
>   this to not report errors to ttymsg()'s caller.
> - limit children better.  I think we now fork children iff writev()
>   returns EWOULDBLOCK and this happens mainly when the tty buffers
>   fill up due to clocal being off and the external console not
>   listening.  Handling this right seems to require handing off the
>   messages to a single child process that can buffer the messages
>   in userland and can block writing and draining them.  Blocked write()s
>   and tcdrain()s are easy enough to handle in a specialized process by
>   sending signals to abort them.

I like this idea better.  The userland process could be smart to drop
entire lines of output if things started getting too backlogged.
Received on Sun Nov 09 2003 - 01:30:25 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:28 UTC