Re: KSE and SMP problem in FreeBSD/amd64 5.3BETA3, namely KSEdosen't make use of SMP.

From: Julian Elischer <julian_at_elischer.org>
Date: Sat, 11 Sep 2004 23:13:16 -0700
Firstly ,I am very happy to see your mail.
We need all bug repors.. even bad ones :-)

I have been working on trying to fix problems in this sort of thing in
the last few weeks for 5.3 but will be able to examine your work
more closely in a few days. I just want you to know that your email will be
worked on, even if you do not hear anything immediatly.

more notes below..


NAKATA Maho wrote:
> Dear amd64 freaks, I noticed that there seems to be a bug
> in KSE with SMP configuration.
> 
> Here, I describe my problem in detail.
> 
> the math/atlas port utilize SMP by threading. namely,
> if you have 2 processors you can gain the nearly double performance
> so KSE is the key technology for SMP. However, for amd64, KSE doesn't
> utilize second CPU at all.
> 
> My machine is:
> Tyan S2885
> Opteron 1.6GHz x 2
> 2G bytes of memory
> 
> I confirmed that:
> o FreeBSD/amd64 5.2.1-RELEASE with KSE doesn't work at all,
> dumps core or memory fault, while without KSE works well but
> without performance gain (using libmap.conf, and this is not shown here).

this is expected.

> 
> o FreeBSD/amd64 5.3-BEAT3 with KSE works at least, however,
> doesn't utilize SMP.

I will try examine this together with Peter and Dan over the next few days..
Please show me the output in 5.3 of sysctl kern.threads and kern.sched

also there will be improvements in beta4  I hope

which scheduler?
show ldd output for your program please.

> o FreeBSD/i386 5.2.1-RELEASE, and 5.3-BEAT3 works well.
> 
> How to repreat:
> (it took huge hours to build math/atlas, so I put work dir at)

at?


> 
> CVSup your ports tree, please use:
> # $FreeBSD: ports/math/atlas/Makefile,v 1.27 2004/09/02 00:25:45 maho Exp $
> 
> 0a. prepare opteron SMP machine, and install FreeBSD/amd64 5.3-BETA3.
> 1a. cd /usr/ports/math/atlas
> 2a. make
> 3a. wait for long time
> 4a. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED 
> 5a. make xdlutst (it took only seconds)
> 6a. make xdlutst_pt (it took only seconds)
> 7a. type ./xdlutst -N 1000 2000 200  (this doesn't utilize SMP and KSE)
> NREPS  Major      M      N    lda  NPVTS      TIME     MFLOP     RESID
> =====  =====  =====  =====  =====  =====  ========  ========  ========
>     0  Col     1000   1000   1000    995     0.301  2210.755 3.821e-02
>     0  Col     1200   1200   1200   1194     0.504  2282.569 3.793e-02
>     0  Col     1400   1400   1400   1395     0.794  2303.707 2.843e-02
>     0  Col     1600   1600   1600   1595     1.156  2360.557 2.893e-02
>     0  Col     1800   1800   1800   1793     1.637  2374.130 2.803e-02
>     0  Col     2000   2000   2000   1990     2.192  2431.838 2.744e-02
> 
> 6 cases ran, 6 cases passed
> 
> 
> 8a. type ./xdlutst_pt -N 2000 3000 200
>  ./xdlutst_pt -N 2000 3000 200
> NREPS  Major      M      N    lda  NPVTS      TIME     MFLOP     RESID
> =====  =====  =====  =====  =====  =====  ========  ========  ========
>     0  Col     2000   2000   2000   1990     2.286  2332.527 2.744e-02
>     0  Col     2200   2200   2200   2194     2.764  2567.795 2.639e-02
>     0  Col     2400   2400   2400   2394     3.766  2446.449 2.721e-02
>     0  Col     2600   2600   2600   2593     4.722  2480.761 2.472e-02
>     0  Col     2800   2800   2800   2795     5.855  2499.038 2.441e-02
>     0  Col     3000   3000   3000   2992     7.302  2464.553 2.442e-02
> 
> 6 cases ran, 6 cases passed
> 
> Please see the MFLOP column. This indicates the FLOPS of the calculation.
> Opteron 1.6G's performance is 2.4GFlops for LU decomposition.
> and as you can see no perfomance gain :(
> 
> typical output of top is like that:
> 
>   PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
>   716 root     134    0   185M   179M CPU0   0   1:05 21.09% 21.09% xdlutst_pt
>   716 root     134    0   185M   179M RUN    0   1:05 19.53% 19.53% xdlutst_pt
>   716 root      20    0   185M   179M kserel 1   1:05  0.00%  0.00% xdlutst_pt
>   716 root      20    0   185M   179M ksesig 1   1:05  0.00%  0.00% xdlutst_pt
>   716 root      20    0   185M   179M kserel 0   1:05  0.00%  0.00% xdlutst_pt
> 
> two threads of xdlutst_pt are always running on *ONLY CPU0 or CPU1*
> --------------------------------------------------------------------
> Next, I have tried i386 version
> 
> 0i. prepare opteron SMP machine same as above, and install FreeBSD/i386
> 5.3-BETA3.
> CVSup your ports tree.
> 
> 1i. cd /usr/ports/math/atlas
> 2i. make
> 3i. wait for long time
> 4i. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED 
> 5i. make xdlutst (it took only seconds)
> 6i. make xdlutst_pt (it took only seconds)
> 7i. type ./xdlutst -N 1000 2000 200  (this doesn't utilize SMP and KSE)
> ./xdlutst -N 1000 2000 200
> NREPS  Major      M      N    lda  NPVTS      TIME     MFLOP     RESID
> =====  =====  =====  =====  =====  =====  ========  ========  ========
>     0  Col     1000   1000   1000    995     0.307  2170.617 3.437e-02
>     0  Col     1200   1200   1200   1194     0.522  2204.335 3.482e-02
>     0  Col     1400   1400   1400   1395     0.799  2286.888 4.150e-02
>     0  Col     1600   1600   1600   1595     1.164  2345.104 3.598e-02
>     0  Col     1800   1800   1800   1793     1.616  2405.542 3.601e-02
>     0  Col     2000   2000   2000   1990     2.218  2403.157 3.436e-02
> 
> 6 cases ran, 6 cases passed
> 
> 8i. type  ./xdlutst_pt -N 3000 4000 200 (this utilize KSE so that make
> full use of SMP)
> ./xdlutst_pt -N 3000 4000 200
> NREPS  Major      M      N    lda  NPVTS      TIME     MFLOP     RESID
> =====  =====  =====  =====  =====  =====  ========  ========  ========
>     0  Col     3000   3000   3000   2992     7.157  2514.351 3.650e-02
>     0  Col     3200   3200   3200   3186     5.127  4259.986 3.207e-02
>     0  Col     3400   3400   3400   3392     5.867  4465.006 3.528e-02
>     0  Col     3600   3600   3600   3589     6.791  4579.468 3.519e-02
>     0  Col     3800   3800   3800   3791     8.510  4297.730 3.285e-02
>     0  Col     4000   4000   4000   3995     9.207  4633.234 3.218e-02
> 
> 6 cases ran, 6 cases passed
> 
> yes, there are perfomance gain by utilizing SMP.
> 
> typical output of top seems like
> 
>   PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
>   714 root     139    0   301M   300M CPU1   1   2:16 66.41% 66.41% xdlutst_pt
>   714 root     139    0   301M   300M RUN    0   2:16 66.41% 66.41% xdlutst_pt
>   714 root      20    0   301M   300M kserel 1   2:16  0.00%  0.00% xdlutst_pt
>   714 root      20    0   301M   300M kserel 0   2:16  0.00%  0.00% xdlutst_pt
>   714 root      20    0   301M   300M ksesig 0   2:16  0.00%  0.00% xdlutst_pt
> 
> Summary:
> Difference between 8a and 8i are:
> o there are no perfomance gain in 8a whereas 8i gains nearly double.
> o the result of top indicates that by KSE of amd64, two threads are produced
> correctly, however scheduling is somwhat odd, so that two threads runs
> at the same processor, apparently threads are spread over different
> processors, though.
> 
> You can try easily, work directory of these two ports are available:
> http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-amd64.tar.bz 
> http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-i386.tar.bz
> 
> MD5 (atlas-work-opteron_dual-amd64.tar.bz) = 9d9d7e8b00b34a783b7d2172bc404e23
> MD5 (atlas-work-opteron_dual-i386.tar.bz) = 8076a753c7b3edaea7bd446c6473f120
> 
> Does anybody can fix it?

yes we will try.




> 
> Best regards,
> --nakata maho
> 
Received on Sun Sep 12 2004 - 04:13:21 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:11 UTC