--- On Thu, 3/12/09, Scott Long <scottl_at_samsco.org> wrote: > From: Scott Long <scottl_at_samsco.org> > Subject: Re: Interrupt routine usage not shown by top in 8.0 > To: barney_cordoba_at_yahoo.com > Cc: current_at_freebsd.org > Date: Thursday, March 12, 2009, 8:35 PM > Barney Cordoba wrote: > > > > > > > > --- On Thu, 3/12/09, Scott Long > <scottl_at_samsco.org> wrote: > > > >> From: Scott Long <scottl_at_samsco.org> > >> Subject: Re: Interrupt routine usage not shown by > top in 8.0 > >> To: barney_cordoba_at_yahoo.com > >> Cc: current_at_freebsd.org > >> Date: Thursday, March 12, 2009, 7:42 PM > >> Barney Cordoba wrote: > >>> I'm fireing 400Kpps at a udp blackhole > port. > >> I'm getting 6000 interrupts > >>> per second on em3: > >>> > >>> testbox# vmstat -i; sleep 1; vmstat -i > >>> interrupt total > rate > >>> irq1: atkbd0 1 > 0 > >>> irq6: fdc0 1 > 0 > >>> irq17: uhci1+ 2226 > 9 > >>> irq18: uhci2 ehci+ 9 > 0 > >>> cpu0: timer 470507 > 1993 > >>> irq256: em0 665 > 2 > >>> irq259: em3 1027684 > 4354 > >>> cpu1: timer 470272 > 1992 > >>> cpu3: timer 470273 > 1992 > >>> cpu2: timer 470273 > 1992 > >>> Total 2911911 > 12338 > >>> > >>> interrupt total > rate > >>> irq1: atkbd0 1 > 0 > >>> irq6: fdc0 1 > 0 > >>> irq17: uhci1+ 2226 > 9 > >>> irq18: uhci2 ehci+ 9 > 0 > >>> cpu0: timer 472513 > 1993 > >>> irq256: em0 668 > 2 > >>> irq259: em3 1033703 > 4361 > >>> cpu1: timer 472278 > 1992 > >>> cpu3: timer 472279 > 1992 > >>> cpu2: timer 472279 > 1992 > >>> Total 2925957 > 12345 > >>> > >>> > >>> top -SH shows: > >>> > >>> PID STATE C TIME CPU COMMAND > >>> 10 CPU3 3 7:32 100.00% idle > >>> 10 CPU2 2 7:32 100.00% idle > >>> 10 RUN 0 7:31 100.00% idle > >>> 10 CPU1 1 7:31 100.00% idle > >>> > >>> This implies that CPU usage is substantially > >> under-reported in general > >>> by the system. Note that I've modified > >> em_irq_fast() to call em_handle_rxtx() directly > rather than > >> scheduling a task to illustrate > >>> the problem > >>> > >> With unmodified code, what do you see? Are you > sending > >> valid UDP frames with valid checksums and a valid > port, or > >> is everything that you're blasting at the > interface > >> getting dropped right away? Calling > em_handle_rxtx() > >> directly will cause a very quick panic once you > start > >> handling real traffic and you encounter a lock. > >> > >> Scott > > > > I think you're mistaken. I'm also accessing > the system via an em port > > (and running top) and em_handle_rxtx() is self > contained lock-wise. The taskqueue doesn't obtain a lock > before calling the routine. > > > > I understand perfectly how the code works, as I wrote it. > While there are no locks in the RX path of the driver, there > are certainly locks higher up in the network stack RX path. > You're not going to hit them in > your test, but in the real world you will. > > > As I mentioned, they're being dumped into a udp > blackhole, which implies > > that I have udp.blackhole set and the port is unused. > I can see the packets hit the udp socket so its working as > expected: > > > > 853967872 dropped due to no socket > > > > With unmodified code, the tasq shows 25% usage or so. > > > > I'm not sure what the point of your criticism for > what clearly is a test. > > Are you implying that the system can receive 400K pps > with 6000 ints/sec > > and record 0% usage because of a coding imperfection? > Or are you implying > > that the 25% usage is all due to launching tasks > unnecessarily and process switching? > > Prior to FreeBSD 5, interrupt processing time was counted > in the %intr stat. With FreeBSD 5 and beyond, most > interrupts moved to full processing contexts called > ithreads, and the processing time spent in the ithread was > counted in the %intr stat. The time spent in low-level > interrupts was merely counted against the process that got > interrupted. > This wasn't a big deal because low-level interrupts > were only used to launch ithreads and to process low-latency > interrupts for a few drivers. > Moving to the taskq model breaks this accounting model. > > What's happening in your test is that the system is > almost completely idle, so the only thing that is being > interrupted by the low-level if_em handler is the cpu idle > thread. Since you're also bogusly bypassing the > deferral to the taskq, all stack processing is also > happening in this > low-level context, and it's being counted against the > CPU idle thread. > However, the process accounting code knows not to charge > idle thread > time against the normal stats, because doing so would > result in the > system always showing 100% busy. So your test is > exploiting this; > you're stealing all of your cycles from the idle > threads, and they > aren't being accounted for because it's hard to > know when the idle > thread is having its cycles stolen. > > So no, 25% of a CPU isn't going to "launching > tasks unnecessarily and > process switching." It's going to processing 400k > packets/sec off of > the RX ring and up the stack to the UDP layer. I think > that if you > studied how the code worked, and devised more useful > benchmarks, you'd > see that the taskq deferral method is usually a significant > gain in > performance over polling or simple ithreads. There is > certainly room > for more improvement, and my taskq scheme isn't the > only way to get > good performance, but it does work fairly well. Its difficult to have "better benchmarks" when the system being tested doesn't have accounting that works. My test is designed to isolate the driver receive function in a controlled way. So it doesn't much matter whether the data is real or not, as long as the tests generate a consistent load. The only thing obviously "bogus" is that FreeBSD is launching 16,000 tasks per second (an interrupt plus taskqueue task 8000 times per second), plus 2000 timer interrupts and reporting 0% cpu usage. So I'm to assume that the system will never show 100% usage as the entire overhead of the scheduler is not accounted for? Calling handle_rxtx was a timesaver to determine the overhead of forcing 8000 context switches per second (16000 in a router setup) for apparently no reason. Since the OS doesn't account for these, there seems no way to make this determination. Its convenient to say it works well or better than something else when there is no way to actually find out via measurement. I don't see how launching 8000 tasks per second could be faster than not launching 8000 tasks per second, but I'm also not up on the newest math. Since you know how things work better than any of us regular programmers, if you could please answer these questions it would save a lot of time and may result in better drivers for us all: 1) MSIX interrupt routines readily do "work" and pass packets up the IP stack, while you claim that MSI interrupts cannot? Please explain the locking differences between MSI and MSIX, and what locks may be encountered by an MSI interrupt routine with "real traffic" that will not be a problem for MSIX or taskqueue launched tasks. Its certainly not obvious from any code or docs that I've seen. 2) the bge and fxp (and many other) drivers happily pass traffic up the IP stack directly from their interrrupt routines, so why is it bogus for em to do so? And why do these drivers not use the taskqueue approach that you claim is superior? 2b) Does this also imply that systems with bge or network drivers that do the "work" in the interrupt handler will yield completely bogus cpu usage numbers? 3) The em driver drops packets well before 100% cpu usage is realized. Of course I'm relying on wrong cpu usage stats, so I may be mistaken. Is there a way (or what is the preferred way) to increase the priority of a task relative to other system processes (rather than relative to tasks in the queue) so that packets can avoid being dropped while the system runs other, non-essential tasks? 3b) Is there a way to lock down a task such as a NIC receive task to give absolute priority or exclusive use of a cpu? The goal is to make certain that the task doesn't yield before it completes some minimum amount of work. The reason I'm doing what you consider "bogus" is to get a handle on various overheads, cache trade offs of spreading across cpus, etc. So please don't berate me too badly for being a crappy programmer, as I actually do know what I'm doing. One problem is the lack of documention so much of the learning has to be done by trial and error. If there's a document on the 8.0 scheduler I'm sure many of us would like to see it. In my world, working "fairly well" isn't good enough, and I don't take anyone's word that something is better if they can't demonstrate it with actual numbers that prove it, particularly when the claim defies logic. Most people do benchmarks completely wrong. A driver's efficiency is measured by how much of the cpu it uses to complete a particular workload. A good driver will happily trade off some per-connection latency for a 20% increase in overall efficiency. Or at least make it tunable for various environments. Its my view that it would be better to just suck packets out of the ring and queue them for upper layers, but I dont yet have a handle on the trade offs. Currently the system drops too many packets unnecessarily at extremely high load. BTW, I ran netperf and a fetch loop overnight ("real" data) routed by a machine with the "bogus" em setup without encountering any panics or data loss. BarneyReceived on Fri Mar 13 2009 - 13:34:46 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:43 UTC