Re: Interrupt routine usage not shown by top in 8.0

From: Barney Cordoba <barney_cordoba_at_yahoo.com> Date: Fri, 13 Mar 2009 07:34:45 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:43 UTC

--- On Thu, 3/12/09, Scott Long <scottl_at_samsco.org> wrote:

> From: Scott Long <scottl_at_samsco.org>
> Subject: Re: Interrupt routine usage not shown by top in 8.0
> To: barney_cordoba_at_yahoo.com
> Cc: current_at_freebsd.org
> Date: Thursday, March 12, 2009, 8:35 PM
> Barney Cordoba wrote:
> > 
> > 
> > 
> > --- On Thu, 3/12/09, Scott Long
> <scottl_at_samsco.org> wrote:
> > 
> >> From: Scott Long <scottl_at_samsco.org>
> >> Subject: Re: Interrupt routine usage not shown by
> top in 8.0
> >> To: barney_cordoba_at_yahoo.com
> >> Cc: current_at_freebsd.org
> >> Date: Thursday, March 12, 2009, 7:42 PM
> >> Barney Cordoba wrote:
> >>> I'm fireing 400Kpps at a udp blackhole
> port.
> >> I'm getting 6000 interrupts
> >>> per second on em3:
> >>> 
> >>> testbox# vmstat -i; sleep 1; vmstat -i
> >>> interrupt                          total      
> rate
> >>> irq1: atkbd0                           1      
>    0
> >>> irq6: fdc0                             1      
>    0
> >>> irq17: uhci1+                       2226      
>    9
> >>> irq18: uhci2 ehci+                     9      
>    0
> >>> cpu0: timer                       470507      
> 1993
> >>> irq256: em0                          665      
>    2
> >>> irq259: em3                      1027684      
> 4354
> >>> cpu1: timer                       470272      
> 1992
> >>> cpu3: timer                       470273      
> 1992
> >>> cpu2: timer                       470273      
> 1992
> >>> Total                            2911911     
> 12338
> >>> 
> >>> interrupt                          total      
> rate
> >>> irq1: atkbd0                           1      
>    0
> >>> irq6: fdc0                             1      
>    0
> >>> irq17: uhci1+                       2226      
>    9
> >>> irq18: uhci2 ehci+                     9      
>    0
> >>> cpu0: timer                       472513      
> 1993
> >>> irq256: em0                          668      
>    2
> >>> irq259: em3                      1033703      
> 4361
> >>> cpu1: timer                       472278      
> 1992
> >>> cpu3: timer                       472279      
> 1992
> >>> cpu2: timer                       472279      
> 1992
> >>> Total                            2925957     
> 12345
> >>> 
> >>> 
> >>> top -SH shows:
> >>> 
> >>>   PID  STATE  C   TIME    CPU COMMAND
> >>>    10  CPU3   3   7:32 100.00% idle
> >>>    10  CPU2   2   7:32 100.00% idle
> >>>    10  RUN    0   7:31 100.00% idle
> >>>    10  CPU1   1   7:31 100.00% idle
> >>> 
> >>> This implies that CPU usage is substantially
> >> under-reported in general
> >>> by the system. Note that I've modified
> >> em_irq_fast() to call em_handle_rxtx() directly
> rather than
> >> scheduling a task to illustrate
> >>> the problem
> >>> 
> >> With unmodified code, what do you see?  Are you
> sending
> >> valid UDP frames with valid checksums and a valid
> port, or
> >> is everything that you're blasting at the
> interface
> >> getting dropped right away?  Calling
> em_handle_rxtx()
> >> directly will cause a very quick panic once you
> start
> >> handling real traffic and you encounter a lock.
> >> 
> >> Scott
> > 
> > I think you're mistaken. I'm also accessing
> the system via an em port
> > (and running top) and em_handle_rxtx() is self
> contained lock-wise. The taskqueue doesn't obtain a lock
> before calling the routine.
> > 
> 
> I understand perfectly how the code works, as I wrote it. 
> While there are no locks in the RX path of the driver, there
> are certainly locks higher up in the network stack RX path. 
> You're not going to hit them in
> your test, but in the real world you will.
> 
> > As I mentioned, they're being dumped into a udp
> blackhole, which implies
> > that I have udp.blackhole set and the port is unused.
> I can see the packets hit the udp socket so its working as
> expected:
> > 
> > 853967872 dropped due to no socket
> > 
> > With unmodified code, the tasq shows 25% usage or so.
> > 
> > I'm not sure what the point of your criticism for
> what clearly is a test.
> > Are you implying that the system can receive 400K pps
> with 6000 ints/sec
> > and record 0% usage because of a coding imperfection?
> Or are you implying
> > that the 25% usage is all due to launching tasks
> unnecessarily and process switching?
> 
> Prior to FreeBSD 5, interrupt processing time was counted
> in the %intr stat.  With FreeBSD 5 and beyond, most
> interrupts moved to full processing contexts called
> ithreads, and the processing time spent in the ithread was
> counted in the %intr stat.  The time spent in low-level
> interrupts was merely counted against the process that got
> interrupted.
> This wasn't a big deal because low-level interrupts
> were only used to launch ithreads and to process low-latency
> interrupts for a few drivers.
> Moving to the taskq model breaks this accounting model.
> 
> What's happening in your test is that the system is
> almost completely idle, so the only thing that is being
> interrupted by the low-level if_em handler is the cpu idle
> thread.  Since you're also bogusly bypassing the
> deferral to the taskq, all stack processing is also
> happening in this
> low-level context, and it's being counted against the
> CPU idle thread.
> However, the process accounting code knows not to charge
> idle thread
> time against the normal stats, because doing so would
> result in the
> system always showing 100% busy.  So your test is
> exploiting this;
> you're stealing all of your cycles from the idle
> threads, and they
> aren't being accounted for because it's hard to
> know when the idle
> thread is having its cycles stolen.
> 
> So no, 25% of a CPU isn't going to "launching
> tasks unnecessarily and
> process switching."  It's going to processing 400k
> packets/sec off of
> the RX ring and up the stack to the UDP layer.  I think
> that if you
> studied how the code worked, and devised more useful
> benchmarks, you'd
> see that the taskq deferral method is usually a significant
> gain in
> performance over polling or simple ithreads.  There is
> certainly room
> for more improvement, and my taskq scheme isn't the
> only way to get
> good performance, but it does work fairly well.

Its difficult to have "better benchmarks" when the system being tested
doesn't have accounting that works. My test is designed to isolate the
driver receive function in a controlled way. So it doesn't much matter
whether the data is real or not, as long as the tests generate a 
consistent load.

The only thing obviously "bogus" is that FreeBSD is launching 16,000
tasks per second (an interrupt plus taskqueue task 8000 
times per second), plus 2000 timer interrupts and reporting 0% cpu 
usage. So I'm to assume that the system will never show 100% usage
as the entire overhead of the scheduler is not accounted for?

Calling handle_rxtx was a timesaver to determine the overhead of 
forcing 8000 context switches per second (16000 in a router setup) for
apparently no reason. Since the OS doesn't account for these, there seems
no way to make this determination. Its convenient to say it works well
or better than something else when there is no way to actually find out
via measurement. I don't see how launching 8000 tasks per second could 
be faster than not launching 8000 tasks per second, but I'm also not 
up on the newest math.

Since you know how things work better than any of us regular programmers,
 if you could please answer these questions it would save a lot of time
and may result in better drivers for us all:

1) MSIX interrupt routines readily do "work" and pass packets up the IP
 stack, while you claim that MSI interrupts cannot? Please explain the
 locking differences between MSI and MSIX, and what locks may be 
encountered by an MSI interrupt routine with "real traffic" that will 
not be a problem for MSIX or taskqueue launched tasks. Its certainly not obvious from any code or docs that I've seen.

2) the bge and fxp (and many other) drivers happily pass traffic up the 
IP stack directly from their interrrupt routines, so why is it bogus for em to do so? And why do these drivers not use the taskqueue
approach that you claim is superior?

2b) Does this also imply that systems with bge or network drivers that
do the "work" in the interrupt handler will yield completely bogus
cpu usage numbers?

3) The em driver drops packets well before 100% cpu usage is realized.
Of course I'm relying on wrong cpu usage stats, so I may be mistaken.
Is there a way (or what is the preferred way) to increase the priority 
of a task relative to other system processes (rather than relative to
 tasks in the queue) so that packets can avoid being dropped while the
system runs other, non-essential tasks?

3b) Is there a way to lock down a task such as a NIC receive task to 
give absolute priority or exclusive use of a cpu? The goal is to make 
certain that the task doesn't yield before it completes some minimum
amount of work.

The reason I'm doing what you consider "bogus" is to get a  handle on 
various overheads, cache trade offs of spreading across cpus, etc. So 
please don't berate me too badly for being a crappy programmer, as I 
actually do know what I'm doing. One problem is the lack of documention
so much of the learning has to be done by trial and error. If there's
a document on the 8.0 scheduler I'm sure many of us would like to 
see it. In my world, working "fairly well" isn't good enough, and I don't
take anyone's word that something is better if they can't demonstrate it
with actual numbers that prove it, particularly when the claim defies 
logic. Most people do benchmarks completely wrong. A driver's efficiency
is measured by how much of the cpu it uses to complete a particular 
workload. A good driver will happily trade off some per-connection 
latency for a 20% increase in overall efficiency. Or at least make it
tunable for various environments.

Its my view that it would be better to just suck packets out of the ring
and queue them for upper layers, but I dont yet have a handle on the trade
offs. Currently the system drops too many packets unnecessarily at 
extremely high load.

BTW, I ran netperf and a fetch loop overnight ("real" data) routed by a
 machine with the "bogus" em setup without encountering any panics or 
data loss.

Barney