On Tue, 4 May 2004, Gerrit Nagelhout wrote: > I ran the following fragment of code to determine the cost of a LOCK & > UNLOCK on both UP and SMP: > > #define EM_LOCK(_sc) mtx_lock(&(_sc)->mtx) > #define EM_UNLOCK(_sc) mtx_unlock(&(_sc)->mtx) > > unsigned int startTime, endTime, delta; > startTime = rdtsc(); > for (i = 0; i < 100; i++) > { > EM_LOCK(adapter); > EM_UNLOCK(adapter); > } > endTime = rdtsc(); > delta = endTime - startTime; > printf("delta %u start %u end %u \n", (unsigned int)delta, startTime, > endTime); > > On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per > LOCK&UNLOCK, and dividing by 100) under UP, and ~300 cycles for SMP. > Assuming 10 locks for every packet(which is conservative), at 500Kpps, > this accounts for: 300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 > billion cycles) Any comments? One of the sets of changes I have in a local branch performs coallescing of interface unlock/lock operations. Right now, if you look at the incoming packet handling in interface code, it tends to read: struct mbuf *m; while (packets_ready(sc)) { m = read_packet(sc); XX_UNLOCK(sc); ifp->if_input(sc, m); XX_LOCK(sc); } I revised the structure for some testing as follows: struct mbuf *m, *mqueue, *mqueue_tail; mqueue = mqueue_tail = NULL; while (packets_read(sc)) { m = packets_ready(sc); if (mqueue != NULL) { mqueue_tail->m_nextpkt = m; mqueue_tail = m; } else mqueue = mqueue_tail = m; } if (mqueue != NULL) { XX_UNLOCK(sc); while (mqueue != NULL) { m = mqueue; mqueue = mqueue->m_nextpkt; m->m_nextpkt = NULL; ifp->if_input(ifp, m); } XX_LOCK(sc); } Obviously, if done properly, you'd want to bound the size of the temporary queue, etc, etc, but even in basic testing I wasn't able to measure an improvement on the hardware I had on-hand at the time. However, I need to re-run this in a post-netperf world and with 64-bit PCI and see if it does now. One important thing in this process, though, is to avoid reordering of packets -- they need to remain serialized by source interface. Doing it at this queue is easy, but if we start passing chains of packets into other pieces, we'll need to be careful where multiple queues get involved, etc. Even simple and relatively infrequent packet reordering can cause TCP to get pretty unhappy. The fact that the above didn't help performance suggests two things: first, that my testbed has other bottlenecks, such as PCI bus bandwidth, and second, that the primary cost currently involved isn't from these mutexes. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert_at_fledge.watson.org Senior Research Scientist, McAfee ResearchReceived on Wed May 05 2004 - 12:49:47 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC