Re: vmstat's entries type

From: Michal Mertl <mime_at_traveller.cz> Date: Wed, 26 Jul 2006 03:12:13 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:58 UTC

Oliver Fromme wrote:
> John Baldwin <jhb at freebsd.org> wrote:
>  > On Sunday 23 July 2006 20:03, Sten Daniel Sørsdal wrote:
>  > > sthaug at nethelp.no wrote:
>  > > > > > One approach that we could use for 64-bit counters would be to just
>  > > > > > use 32-bits one, and poll them for overflow and bump an overflow
>  > > > > > count.  This assumes that the 32-bit counters overflow much less often
>  > > > > > than the polling interval, and easily triples the amount of storage
>  > > > > > for each of them...  It is ugly :-(
>  > > > > > 
>  > > > > What's wrong with the add+adc (asm) approach found on any i386?
>  > > > 
>  > > > Presumably the fact that add + adc isn't an atomic operation. So if
>  > > > you want to guarantee 64 bit consistency, you need locking or similar.
>  > > > 
>  > > 
>  > > Would it not be necessary to do this locking anyway?
>  > > I don't see how polling for overflow would help this consistency.
>  > > Are both suggestions insufficient?
>  > 
>  > I actually think that add + adc is ok for the case of incrementing simple 
>  > counters.  You can even do 'inc ; addc $0'
> 
> (I'm familiar with asm programming, but I'm not a low-level
> threading or SMP expert, so please excuse me if this is a 
> dumb question ...)
> 
> If you just do add+adc (or inc+adc) and another thread (on
> the same or different processor, I don't know) happens to
> read the counter value at the same time (i.e. after the
> lower 32bit have overflowed, but before the upper 32bit get
> incremented), then that other thread would get a value
> that's off by 2^32.
> 
> What am I missing?

I don't remember all the details, but when I was proposing (with
patches) the change in network counters several years ago, I gave up to
the (possibly right) opposition. Probably from BDE, I don't remember.

Your explanation of a possible failure scenario is just one example of
what can possibly get wrong there.

It ('add' instruction followed by 'addc' - or more generally working
with 64bit counters on a 32bit architecture - or even more generally
working with an integer in a kernel context in multiprocossor
environment) can get wrong in more "exotic" ways on architectures
without implicitly coherent cache - you can read an old value of
something, modify it, and write it back, overwriting much more recent
copy or something like it.

Even a simple increment may not be fully safe (it is also, in the end,
read-modify-write operation, which can be, in theory at least,
interrupted in between any two operations). I have not studied enough
of it, but it makes sense to me and I believe these were among the
reasons why 64 bit counters on 32 bit I386 were rejected at the time.

The modifications of the counters may be wrapped into preprocessor
macros though. The right implementation of the macro can be 100%
correct, but it will add big overhead - e.g. lock instrunction prefix
(needed in I386 SMP) takes possibly hundreds of cycles to execute).

Therefore, I think that we should either go with per-CPU copies of the
counter in whatever size appropriate and have the total be sum of the
values (possibly also taking care of overflow) or we should just accept
the status quo - use something "natural" for the architecture (e.g. int
or long) and hope for the best (a wrong counter normally doesn't cause
any problems). It (int or sometimes long) has been good enough for
decades.

The first way (per-CPU counters) shouldn't be that difficult to do
(almost?) correctly either, I just wanted to propose the change with
lesser potential for a bikeshed (even possibly fruitful :-)) and higher
potential for real change in the sources (and better value of the
counters for me - I believe that I won't build any new 32bit system, so
long should be long enough for me). I believe I should be able to code
both ways (and can easily test I386 and AMD64, but these are not a good
architectures AFAIK, as they ensure cache coherency).

My patch for network counters wrapped every counter operation in a
macro which could have been expanded to different code. Doing the
operations absolutely safely was terribly inefficient. Even per-CPU
increments aren't probable 100% safe and that was the reason
PCPU_LAZY_INC was introduced - so that consumers knew they can't really
rely on the counter value.

Michal