Re: mbuf count negative

From: Stephan Uphoff <ups_at_tree.com> Date: Sun, 05 Dec 2004 22:45:23 -0500 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:23 UTC

On Sun, 2004-12-05 at 14:08, Robert Watson wrote:
> On Sun, 5 Dec 2004, Sean McNeil wrote:
> 
> > > It replaces non-atomic maintenance of the counters with atomic
> > > maintenance.  However, this adds measurably to the cost of allocation, so
> > > I've been reluctant to commit it.  The counters maintained by UMA are
> > > likely sufficient to generate the desired mbuf output now that we have
> > > mbuma, but I haven't had an opportunity to walk through the details of it. 
> > > I hope to do so once I get closer to merging patches to use critical
> > > sections to protect UMA per-cpu caches, since I need to redo parts of the
> > > sysctl code then anyway.  You might want to give this patch, or one much
> > > like it, a spin to confirm that the race is the one I think it is.  The
> > > race in updating mbuf allocator statistics is one I hope to get fixed
> > > prior to 5.4.
> > 
> > Since they appear to not be required for actual system use (by the fact
> > that it being negative doesn't cause problems), could the counts be
> > computed for display instead? 
> 
> This is pretty much what UMA does with its per-CPU caches.  It pulls and
> pushes statistics from the caches in a couple of situations:
> 
> - When pulling a new bucket into or out of the cache, it has to acquire
>   the zone mutex, so also pushes statistics.
> - When a timer fires every few seconds, all the caches are checked to
>   update the global zone statistics.
> - When the sysctl runs, it replicates the logic in the timer code to also
>   update the zone statistics for display.
> 
> And you can already extract pretty much all of the interesting allocation
> information for mbufs from vmstat -z as the mbufs are now stored using
> UMA.  In the critical section protected version of the code, I haven't yet
> decided if the timers should run per-cpu, and/or how the sysctl should
> coalesce the information for display.  I hope to have much of this
> resolved shortly.  My current leaning is that a small amount of localized
> and temporary inconsistency in the stats isn't a problem, so simply doing
> a set of lockless reads across the per-cpu caches to update stats for
> presentation should be fine, and that we can probably drop the timer
> updates of statistics since the cache bucket balancing keeps things pretty
> in sync. 
> 
> I haven't committed the move to critical sections yet as it's currently a
> performance pessimization for the UP case, as entering a critical section
> on UP is more expensive than acquiring a mutex.  John Baldwin has patches
> that remedy this, but hasn't yet merged them (there's also an instability
> with them I've seen).  I know that Stephen Uphoff has also been
> investigating this issue. 

I am wondering if critical sections are overkill for this application
since the interrupt blocking capability is not needed.

Perhaps a mutex_lock_local, mutex_unlock_local that work on per CPU
mutexes and require the current thread to be pinned to a CPU would be
more appropriate. The functions would have the same functionality as
mutex_lock/mutex_unlock but without using atomic operations.
The "local" mutex would be a clone of the UP sleep mutex and the SMP
sleep mutex and the local mutex could even be used together in any
order. (With working priority inheritance)
I think I will try this in my p4 mutex branch in the next days.

A "do not preempt" section (Interrupt enabled) would be another
possibility.

	Stephan