Re: # Fssh_packet_write_wait: Connection to 77.183.250.3 port 22: Broken pipe

From: John Kennedy <warlock_at_phouka.net> Date: Wed, 30 Dec 2020 07:40:01 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:26 UTC

On Wed, Dec 30, 2020 at 08:04:03AM +0100, Hartmann, O. wrote:
> On recent 12-STABLE, 12.1-RELENG and 12.2-RELENG I face a very nasty problem which
> occured a while ago after it seemed to have vanished for a while: running ssh in a xterm
> on FreeBSD boxes as mentioned at the beginning ends up very rapidly in a lost connection
> with
> 
> # Fssh_packet_write_wait: Connection to XXX.XXX.XXX.XXX port 22: Broken pipe
> 
> The backend is in most cases a CURRENT, 12.1-RELENG or 12.2-RELENG or 12-STABLE server. A
> couple of months ago we moved from 11.3-RELENG to 12.1-RELENG (server side, clients were
> always 13-CURRENT or 12-STABLE). With FreeBSD 11 as the backend, those broken pipes
> occured, but not that frequent and rapid as it is the fact now. 
> 
> The "problem" can be mitigated somehow: running top or using the console prevents the
> broken pipe fault for a while, but it still occurs. Running "screen" (port
> sysutils/screen) does extend the usability of the console for a significant timespan, but
> the broken pipe also occurs randomly, but it takes a significant time to occur.

  So, I do a LOT of ssh-in-xterm and I can't say that I've seen anything that
looks like it is FreeBSD's fault (vs ISP, work firewall, work VPN, etc).  For
my cloud host (12.2-p2) I do tend to use the screen program.  At work, in pre-
Covid times (so up to last March 18th or so, whatever that works out to in
versioning/revisions; probably 12.1 or 12.0), I'd have sessions opened a week+.  
At home I'm all 13 at the moment.

  Because I'm running a lot of 13 at home (and before that, 12-stable) I tend
to reboot the box for update reasons.  Is it safe to assume that "very rapidly"
is measured in sub-days?

> My conclusion is: either there is a serious problem with FreeBSD since 12, or there is a
> config issue I'm not aware of, even with "vanilla" installations from official repository
> running unchanged.

  At work, my problems are all about crappy firewalls.  Even firewalls that
we've spent a LOT of money on (PaloAlto, the Juniper before it).  In all
fairness to them, we're running a University's worth of class-B through there
and they have all the state-tracking/deep-inspection goodness turned on trying
to protect everyone from the big bad internet so it's complicated.

  With putty, I've had to turn on TCP/IP keepalives and sending null packets.
The problem there just seems to be that the firewall hardware can only track
so many sessions and, when you stress it, it'll drop "idle" sessions (vs
active, vs not opening up a new one).  Systems hemorrhage connections all the
time when something eats the final connection-close packet, but they can time
the thing out.  The PaloAlto in my case doesn't know that so it just starts
reaping, getting valid idle connections some of the time.  So all my tricks
just involve some amount of traffic to keep that session more alive in the
non-host-state-tracker's brain.

  For SSH at work, I've set this up:

	host *
		TCPKeepAlive		yes
		ServerAliveInterval	60
		ServerAliveCountMax	3

  So, send TCP/IP keepalive packets, send some traffic every 60 seconds, and
tear down the session if you miss 3 of those.  I'll note at home that I haven't
had to do that.  For that cloud 12.2 system, I've had a connection "idle" for
21 hours (but running with a screen going, which is getting some amount of
bidirectional traffic going because it has a date/time stamp that gets updated
once a minute).

  Is 21 hours "significant" by your measurements?

  At home, I don't have a network firewall of any sort.  Probably the usual
unknowns with the ISP and crappyware NAT box they force me to use.

  My cloud system is running on DigitalOcean, for what that is worth.  I'm
not sure what they're doing for firewalls (I'm doing host firewalls out there,
so maybe nothing in my case).