RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Gerrit Nagelhout <gnagelhout_at_sandvine.com> Date: Wed, 5 May 2004 21:16:35 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC

Andrew Gallatin wrote:
> Bruce Evans writes:
> 
>  > 
>  > Athlon XP2600 UP system:  !SMP case: 22 cycles   SMP case: 
> 37 cycles
>  > Celeron 366 SMP system:              35                    48
>  > 
>  > The extra cycles for the SMP case are just the extra cost 
> of a one lock
>  > instruction.  Note that SMP should cost twice as much 
> extra, but the
>  > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by 
> using xchgl
>  > which always locks the bus.  After fixing this:
>  > 
>  > Athlon XP2600 UP system:  !SMP case:  6 cycles   SMP case: 
> 37 cycles
>  > Celeron 366 SMP system:              10                    48
>  > 
>  > Mutexes take longer than simple locks, but not much longer 
> unless the
>  > lock is contested.  In particular, they don't lock the bus any more
>  > and the extra cycles for locking dominate (even in the 
> !SMP case due
>  > to the pessimization).
>  > 
>  > So there seems to be something wrong with your benchmark.  
> Locking the
>  > bus for the SMP case always costs about 20+ cycles, but this hasn't
>  > changed since RELENG_4 and mutexes can't be made much faster in the
>  > uncontested case since their overhead is dominated by the bus lock
>  > time.
>  > 
> 
> Actually, I think his tests are accurate and bus locked instructions
> take an eternity on P4.  See
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html 
> 
> For example, with your test above, I see 212 cycles for the UP case on
> a 2.53GHz P4.  Replacing the atomic_store_rel_int(&slock, 0) with a
> simple slock = 0; reduces that count to 18 cycles.
> 
> If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
> then I think you should do it.  Of course, that still leaves mutexes
> as very expensive on SMP (253 cycles on the 2.53GHz from above).
> 
> Drew
> 

I wonder if there is anything that can be done to make the locking more
efficient for the Xeon.  Are there any other locking types that could
be used instead?
This might also explain why we are seeing much worse system call 
performance under 4.7 in SMP versus UP.  Here is a table of results
for some system call tests I ran.  (The numbers are calls/s)

2.8Ghz Xeon			
                       UP	          SMP
write	             904427		 661312
socket            1327692		1067743
select             554131		 434390
gettimeofday      1734963		 252479

1.3Ghz PIII
                       UP	          SMP
write	             746705            532223
socket            1179819            977448
select             727811            556537
gettimeofday      1849862            186387

The really interesting one is gettimeofday.  For both the Xeon & PIII,
the UP is much better than SMP, but the UP for PIII is better than that
of the Xeon.  I may try to get the results for 5.2.1 later.  I can 
forward the source code of this program to anyone else who wants to try
it out.
Thanks,

Gerrit