RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Robert Watson <rwatson_at_freebsd.org> Date: Wed, 5 May 2004 17:49:34 -0400 (EDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC

On Tue, 4 May 2004, Gerrit Nagelhout wrote:

> I ran the following fragment of code to determine the cost of a LOCK & 
> UNLOCK on both UP and SMP:
> 
> #define	EM_LOCK(_sc)		mtx_lock(&(_sc)->mtx)
> #define	EM_UNLOCK(_sc)		mtx_unlock(&(_sc)->mtx)
> 
>     unsigned int startTime, endTime, delta;
>     startTime = rdtsc();
>     for (i = 0; i < 100; i++)
>     {
>         EM_LOCK(adapter);
>         EM_UNLOCK(adapter);
>     }
>     endTime = rdtsc();
>     delta = endTime - startTime;
>     printf("delta %u start %u end %u \n", (unsigned int)delta, startTime,
> endTime);
> 
> On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per
> LOCK&UNLOCK, and dividing by 100) under UP, and ~300 cycles for SMP. 
> Assuming 10 locks for every packet(which is conservative), at 500Kpps,
> this accounts for:  300 * 10 * 500000 = 1.5 billion cycles (out of 2.8
> billion cycles)  Any comments? 

One of the sets of changes I have in a local branch performs coallescing
of interface unlock/lock operations.  Right now, if you look at the
incoming packet handling in interface code, it tends to read:

   struct mbuf *m;

   while (packets_ready(sc)) {
        m = read_packet(sc);
        XX_UNLOCK(sc);
        ifp->if_input(sc, m);
        XX_LOCK(sc);
   }

I revised the structure for some testing as follows:

   struct mbuf *m, *mqueue, *mqueue_tail;

   mqueue = mqueue_tail = NULL;
   while (packets_read(sc)) {
       m = packets_ready(sc);
       if (mqueue != NULL) {
            mqueue_tail->m_nextpkt = m;
            mqueue_tail = m;
       } else
            mqueue = mqueue_tail = m;
   }
   if (mqueue != NULL) {
       XX_UNLOCK(sc);
       while (mqueue != NULL) {
           m = mqueue;
           mqueue = mqueue->m_nextpkt;
           m->m_nextpkt = NULL;
           ifp->if_input(ifp, m);
       }
       XX_LOCK(sc);
   }

Obviously, if done properly, you'd want to bound the size of the temporary
queue, etc, etc, but even in basic testing I wasn't able to measure an
improvement on the hardware I had on-hand at the time.  However, I need to
re-run this in a post-netperf world and with 64-bit PCI and see if it does
now.  One important thing in this process, though, is to avoid reordering
of packets -- they need to remain serialized by source interface.  Doing
it at this queue is easy, but if we start passing chains of packets into
other pieces, we'll need to be careful where multiple queues get involved,
etc.  Even simple and relatively infrequent packet reordering can cause
TCP to get pretty unhappy.

The fact that the above didn't help performance suggests two things:
first, that my testbed has other bottlenecks, such as PCI bus bandwidth,
and second, that the primary cost currently involved isn't from these
mutexes.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert_at_fledge.watson.org      Senior Research Scientist, McAfee Research