On Wed, 5 May 2004, Gerrit Nagelhout wrote: > Andrew Gallatin wrote: > > If its really safe to remove the xchg* from non-SMP atomic_store_rel*, > > then I think you should do it. Of course, that still leaves mutexes > > as very expensive on SMP (253 cycles on the 2.53GHz from above). See my other reply [1 memory barrier but not 2 seems to be needed for each lock/unlock pair in the !SMP case, and the xchgl accidentally (?) provides it; perhaps [lms]fence would give a faster memory barrier]. More ideas on this: - compilers should probably now generate memory barrier instructions foe volatile variables (so volatile variables would be even slower :-). I haven't seen gcc on i386's do this. - jhb once tried changing mtx_lolock_spin(mtx)/mtx_unlock_spin(mtx) to crticial_enter()/critical_exit(). This didn't work because it broke mtx_assert(). It might also not work because it removes the memory barrier. criticial_enter() only has the very weak memory barrier in disable_intr() on i386's. > I wonder if there is anything that can be done to make the locking more > efficient for the Xeon. Are there any other locking types that could > be used instead? I can't think of anything for the SMP case. See above for the !SMP case. > This might also explain why we are seeing much worse system call > performance under 4.7 in SMP versus UP. Here is a table of results > for some system call tests I ran. (The numbers are calls/s) > > 2.8Ghz Xeon > UP SMP > write 904427 661312 > socket 1327692 1067743 > select 554131 434390 > gettimeofday 1734963 252479 > > 1.3Ghz PIII > UP SMP > write 746705 532223 > socket 1179819 977448 > select 727811 556537 > gettimeofday 1849862 186387 It's why the Xeon is relatively slower under -current and SMP. -current just does more locking and more of other things. > The really interesting one is gettimeofday. For both the Xeon & PIII, > the UP is much better than SMP, but the UP for PIII is better than that > of the Xeon. I may try to get the results for 5.2.1 later. I can > forward the source code of this program to anyone else who wants to try > it out. gettimeofday() is slower for SMP because it uses a different timecounter. This is a hardware problem -- there is no good timecounter available. It looks like the TSC timecounter is being used for the UP cases and either the i8254 or the ACPI-slow timecounter for the SMP cases. Reading the TSC takes about 10-12 cycles on most most i386's (probably mny more on P4 ;-). Syscall overhead adds a lot to this, but gettimeofday() still takes much less than a microsecond. The fastest I've seen recently is 260nS/578 cycles for clock_gettime() on an AthlonXP. OTOH, reading the i8254 takes about 4000 nS so gettimeofday() takes 4190nS for clock_gettime() on the same AthlonXP system that takes 260nS with the TSC timecounter. This system also has a slow ACPI timer so clock_gettime() takes 1397nS with the ACPI-fast timecounter and about 3 times as long with the ACPI-slow timecounter. Recently-fixed bugs made it often use the ACPI-slow timecounter although the ACPI-fast timecounter always works. Slow timecounters mainly affect workloads that do too many context switches or timestamps on tinygrams. Probably for yours but not mine. I only notice them when I run microbenchmarks. The simplest one that shows them is "ping -fq localhost". There are normally 7 timestamps per packet (1 to put in the packet in userland, 2 for bookkepping in userland, 2 for pessimization of netisrs in the kernel and 2 for tripping on our own Giant foot in the kernel). RELENG_4 only has the userland ones. With reasonably CPUs (1GHz+ or so) and slow timecounters, making even one of these timestamps takes longer than everything else. BruceReceived on Thu May 06 2004 - 01:19:12 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC