Re: panic in rt_check_fib()

From: Giorgos Keramidas <keramida_at_freebsd.org>
Date: Sun, 14 Sep 2008 15:56:12 +0300
On Sat, 13 Sep 2008 23:28:51 -0700, Julian Elischer <julian_at_elischer.org> wrote:
> To recap on this, I rewrote this function a couple of week sagobecause I
> couldn't keep track of what was going on, and I thought it might
> havesome bad edge cases.  a couple of days later Giorgos contacted me
> saying hta the had a fairly reproducible situation
> where this was triggered and it appeared to be an edge case in
> this function that allowed it to try lock the same lock twice.
>
> I immediatly thought "ah=hah!" I may have a solution to this,
> and gave him a copy of my new function and indead it DOES fix that
> panic. however after deleting and recreating intefaces a few hundred
> times without crashing in rt_check_fib() it then fails somewhere else,
> (actually it leacks some resources and eventually networking stops).
>
> I'm not convinced that is a problem with the new or old rt_check() but
> it did stop me from just committing the new code.
>
> I rereading the way the function (did and still does) work it
> occurred to me that there was a large flaw in teh way it worked..
>
> It dropped a the lock on one route while it went off an did something
> else that might block, On returning it blindly re-grabbed that lock,
> completely ignoring the fact that the route might not even be valid any
> more. (or any of several other things that may have changed while
> it was away (maybe sleeping)).
>
> the code Giorgos is referring to is a patch I suggested to him to
> fix this oversight and not the one that I originally tested and
> had suggested to fix the edge case.
>
> I do however ask that some other people look at this patch!

Exactly.  Thanks for summarizing this so well :)

I have started a kernel with your latest patch (from the quoted message
above), and I can't panic my kernel with the script that did it in a
semi-reliable manner before:

% root_at_kobe:/root# while true ; do \
%         sh home.sh > /dev/null 2>&1 ; \
%         vmstat -z | sed -n -e 1p -e /rt/p ; \
%         sleep 1 ; \
%     done
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       19,       77,       43,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       20,       76,       47,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       21,       75,       51,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       23,       73,       55,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       24,       72,       59,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       25,       71,       62,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       26,       70,       65,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       27,       69,       69,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       29,       67,       73,        0
% ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
% rtentry:                  120,        0,       30,       66,       76,        0
% ^C
% root_at_kobe:/root# sh home.sh

rtentries seem to be going up every time I cycle through the script,
which essentially brings down both wireless and wired interfaces and
then brings up the wired interface of my laptop.  The core of the script
is currently:

  # network interface options
  export ifconfig_re0="inet 192.168.1.10/24"
  export defaultrouter='192.168.1.1'

  echo '## Stopping network interfaces.'
  /etc/rc.d/netif stop re0  && ifconfig re0  delete
  /etc/rc.d/netif stop iwn0 && ifconfig iwn0 delete

  echo '## Bringing up network interface.'
  /etc/rc.d/netif start re0

  echo "## Reloading firewall rules."
  /etc/rc.d/pf reload

  # The default route may be pointing to another interface.  Find out
  # the IP address of the default gateway, delete it and point to the
  # default gateway configured as ${defaultrouter}.
  if [ -n "${defaultrouter}" ]; then
          echo '## Setting default router.'
          _oldrouter=`netstat -rn | grep default | awk '{print $2}'`
          if [ -n "${_oldrouter}" ]; then
                  route delete default "${_oldrouter}"
                  unset _oldrouter
          fi
          route add default "$defaultrouter"
  fi

With your version of rt_check_fib() I have no panics so far.  This
doesn't mean we don't have a bug elsewhere, or that it will not panic
tomorrow, but it's nice that thing seem a bit more stable now.  The old
version of rt_check_fib() used to panic about one third of the time I
ran my 'home.sh' script...

Now an interesting question is: Is it `normal' that the USED rtentry
objects keep going up at every interface restart and are (at least at
first glance) not reclaimed as fast as they are acquired?
Received on Sun Sep 14 2008 - 10:56:53 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:35 UTC