Gerrit Nagelhout wrote: > Andrew Gallatin wrote: > >>Bruce Evans writes: >> >> > >> > Athlon XP2600 UP system: !SMP case: 22 cycles SMP case: >>37 cycles >> > Celeron 366 SMP system: 35 48 >> > >> > The extra cycles for the SMP case are just the extra cost >>of a one lock >> > instruction. Note that SMP should cost twice as much >>extra, but the >> > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by >>using xchgl >> > which always locks the bus. After fixing this: >> > >> > Athlon XP2600 UP system: !SMP case: 6 cycles SMP case: >>37 cycles >> > Celeron 366 SMP system: 10 48 >> > >> > Mutexes take longer than simple locks, but not much longer >>unless the >> > lock is contested. In particular, they don't lock the bus any more >> > and the extra cycles for locking dominate (even in the >>!SMP case due >> > to the pessimization). >> > >> > So there seems to be something wrong with your benchmark. >>Locking the >> > bus for the SMP case always costs about 20+ cycles, but this hasn't >> > changed since RELENG_4 and mutexes can't be made much faster in the >> > uncontested case since their overhead is dominated by the bus lock >> > time. >> > >> >>Actually, I think his tests are accurate and bus locked instructions >>take an eternity on P4. See >>http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html >> >>For example, with your test above, I see 212 cycles for the UP case on >>a 2.53GHz P4. Replacing the atomic_store_rel_int(&slock, 0) with a >>simple slock = 0; reduces that count to 18 cycles. >> >>If its really safe to remove the xchg* from non-SMP atomic_store_rel*, >>then I think you should do it. Of course, that still leaves mutexes >>as very expensive on SMP (253 cycles on the 2.53GHz from above). >> >>Drew >> > > > I wonder if there is anything that can be done to make the locking more > efficient for the Xeon. Are there any other locking types that could > be used instead? > This might also explain why we are seeing much worse system call > performance under 4.7 in SMP versus UP. Here is a table of results > for some system call tests I ran. (The numbers are calls/s) Int 0x80 system calls are known to be extremely expensive on a P4. I think that Jeff Roberson measured them as taking 300 cycles on average. Some work was done on implementing the alternate sysenter/sysexit method, but I don't think it was ever finished. I think that it was shown to have a modest speed improvement, but there was still a lot of overhead that made it slow on a P4. There are other optimizations that can be done like having a shared page that lets you avoid calls like getpid and gettimeofday, but it opens some security risks that have to be dealt with. ScottReceived on Wed May 05 2004 - 16:55:25 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC