Re: Interrupt routine usage not shown by top in 8.0

From: Scott Long <scottl_at_samsco.org> Date: Thu, 12 Mar 2009 18:35:41 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:43 UTC

Barney Cordoba wrote:
> 
> 
> 
> --- On Thu, 3/12/09, Scott Long <scottl_at_samsco.org> wrote:
> 
>> From: Scott Long <scottl_at_samsco.org>
>> Subject: Re: Interrupt routine usage not shown by top in 8.0
>> To: barney_cordoba_at_yahoo.com
>> Cc: current_at_freebsd.org
>> Date: Thursday, March 12, 2009, 7:42 PM
>> Barney Cordoba wrote:
>>> I'm fireing 400Kpps at a udp blackhole port.
>> I'm getting 6000 interrupts
>>> per second on em3:
>>>
>>> testbox# vmstat -i; sleep 1; vmstat -i
>>> interrupt                          total       rate
>>> irq1: atkbd0                           1          0
>>> irq6: fdc0                             1          0
>>> irq17: uhci1+                       2226          9
>>> irq18: uhci2 ehci+                     9          0
>>> cpu0: timer                       470507       1993
>>> irq256: em0                          665          2
>>> irq259: em3                      1027684       4354
>>> cpu1: timer                       470272       1992
>>> cpu3: timer                       470273       1992
>>> cpu2: timer                       470273       1992
>>> Total                            2911911      12338
>>>
>>> interrupt                          total       rate
>>> irq1: atkbd0                           1          0
>>> irq6: fdc0                             1          0
>>> irq17: uhci1+                       2226          9
>>> irq18: uhci2 ehci+                     9          0
>>> cpu0: timer                       472513       1993
>>> irq256: em0                          668          2
>>> irq259: em3                      1033703       4361
>>> cpu1: timer                       472278       1992
>>> cpu3: timer                       472279       1992
>>> cpu2: timer                       472279       1992
>>> Total                            2925957      12345
>>>
>>>
>>> top -SH shows:
>>>
>>>   PID  STATE  C   TIME    CPU COMMAND
>>>    10  CPU3   3   7:32 100.00% idle
>>>    10  CPU2   2   7:32 100.00% idle
>>>    10  RUN    0   7:31 100.00% idle
>>>    10  CPU1   1   7:31 100.00% idle
>>>
>>> This implies that CPU usage is substantially
>> under-reported in general
>>> by the system. Note that I've modified
>> em_irq_fast() to call em_handle_rxtx() directly rather than
>> scheduling a task to illustrate
>>> the problem
>>>
>> With unmodified code, what do you see?  Are you sending
>> valid UDP frames with valid checksums and a valid port, or
>> is everything that you're blasting at the interface
>> getting dropped right away?  Calling em_handle_rxtx()
>> directly will cause a very quick panic once you start
>> handling real traffic and you encounter a lock.
>>
>> Scott
> 
> I think you're mistaken. I'm also accessing the system via an em port
> (and running top) and em_handle_rxtx() is self contained lock-wise. 
> The taskqueue doesn't obtain a lock before calling the routine.
> 

I understand perfectly how the code works, as I wrote it.  While there 
are no locks in the RX path of the driver, there are certainly locks 
higher up in the network stack RX path.  You're not going to hit them in
your test, but in the real world you will.

> As I mentioned, they're being dumped into a udp blackhole, which implies
> that I have udp.blackhole set and the port is unused. I can see the 
> packets hit the udp socket so its working as expected:
> 
> 853967872 dropped due to no socket
> 
> With unmodified code, the tasq shows 25% usage or so.
> 
> I'm not sure what the point of your criticism for what clearly is a test.
> Are you implying that the system can receive 400K pps with 6000 ints/sec
> and record 0% usage because of a coding imperfection? Or are you implying
> that the 25% usage is all due to launching tasks unnecessarily and process 
> switching?

Prior to FreeBSD 5, interrupt processing time was counted in the %intr 
stat.  With FreeBSD 5 and beyond, most interrupts moved to full 
processing contexts called ithreads, and the processing time spent in 
the ithread was counted in the %intr stat.  The time spent in low-level 
interrupts was merely counted against the process that got interrupted.
This wasn't a big deal because low-level interrupts were only used to 
launch ithreads and to process low-latency interrupts for a few drivers.
Moving to the taskq model breaks this accounting model.

What's happening in your test is that the system is almost completely 
idle, so the only thing that is being interrupted by the low-level if_em 
handler is the cpu idle thread.  Since you're also bogusly bypassing the
deferral to the taskq, all stack processing is also happening in this
low-level context, and it's being counted against the CPU idle thread.
However, the process accounting code knows not to charge idle thread
time against the normal stats, because doing so would result in the
system always showing 100% busy.  So your test is exploiting this;
you're stealing all of your cycles from the idle threads, and they
aren't being accounted for because it's hard to know when the idle
thread is having its cycles stolen.

So no, 25% of a CPU isn't going to "launching tasks unnecessarily and
process switching."  It's going to processing 400k packets/sec off of
the RX ring and up the stack to the UDP layer.  I think that if you
studied how the code worked, and devised more useful benchmarks, you'd
see that the taskq deferral method is usually a significant gain in
performance over polling or simple ithreads.  There is certainly room
for more improvement, and my taskq scheme isn't the only way to get
good performance, but it does work fairly well.

Scott