Re: Fine-grained locking for POSIX local sockets (UNIX domain sockets)

From: Kris Kennaway <kris_at_obsecurity.org> Date: Sun, 7 May 2006 17:04:26 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:55 UTC

On Sat, May 06, 2006 at 06:19:08PM -0400, Kris Kennaway wrote:

> x norwatson-8
> + rwatson-8
> +------------------------------------------------------------+
> |                                 +                          |
> |                               + + +               x       x|
> |+                           +  +++ +          x   xxxxx x  x|
> |                   |_________A___M______|        |___A___|  |
> +------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  10       2601.46       2700.26       2650.52      2653.441     30.758034
> +  10       2240.86       2516.87      2496.085      2468.468     81.868576
> Difference at 95.0% confidence
>         -184.973 +/- 58.1052
>         -6.97106% +/- 2.1898%
>         (Student's t, pooled s = 61.8406)
> 
> We see the drop in performance in both cases indicating that we are in
> the "overloaded" regime.  The fact that your patch seems to give worse
> performance is puzzling at first sight.
> 
> Running mutex profiling (and only keeping the unp mutex entries and
> the 10 most contended for clarity) shows the following:
> 
> norwatson, 8 clients:
> 
> debug.mutex.prof.stats:
>    max        total       count   avg     cnt_hold     cnt_lock name
>      5           40           9     4            0            3 kern/uipc_usrreq.c:170 (unp)
>      8            8           1     8            0            0 vm/uma_core.c:2101 (unpcb)
>     13          283          52     5            0            0 vm/uma_core.c:890 (unpcb)
>     14         1075         200     5            0            0 vm/uma_core.c:1885 (unpcb)
>      4           52          18     2            4            6 kern/uipc_usrreq.c:577 (unp)
>      5           39           9     4            4            2 kern/uipc_usrreq.c:534 (unp)
>      5           35          11     3            6            6 kern/uipc_usrreq.c:974 (unp)
>      5           45          11     4            7            4 kern/uipc_usrreq.c:210 (unp)
>    171         1164           9   129            7            2 kern/uipc_usrreq.c:917 (unp)
>     14           78          20     3           11      2872481 kern/uipc_usrreq.c:709 (unp)
>     70          156          11    14           13            4 kern/uipc_usrreq.c:895 (unp)
>     43          581          20    29           24            6 kern/uipc_usrreq.c:239 (unp)
>     44          429          18    23           26            8 kern/uipc_usrreq.c:518 (unp)
>     55          491          12    40           30           10 kern/uipc_usrreq.c:251 (unp)
> ...
>    449     20000519      320038    62        15158            0 kern/uipc_usrreq.c:431 (so_rcv)
>    459     86616085     2880079    30        15699         4944 kern/uipc_usrreq.c:319 (so_snd)
>    146      2273360      640315     3        27918        29789 kern/kern_sig.c:1002 (process lock)
>    387      3325481      640099     5        38143        47670 kern/kern_descrip.c:420 (filedesc structure)
>    150      1881990      640155     2        64111        49033 kern/kern_descrip.c:368 (filedesc structure)
>    496     13792853     3685885     3       101692       132480 kern/kern_descrip.c:1988 (filedesc structure)
>    207      4061793      551604     7       115427       118242 kern/kern_synch.c:220 (process lock)
>    391     10332282     3685885     2       194387       129547 kern/kern_descrip.c:1967 (filedesc structure)
>    465     25504709      320042    79      1632192       294498 kern/uipc_usrreq.c:364 (unp)
>    470    124263922     2880084    43     13222757      2685853 kern/uipc_usrreq.c:309 (unp)
> 
> i.e. there is indeed heavy contention on the unp lock (column 5 counts
> the number of times we tried to acquire it and failed because someone
> else had the lock) - in fact about 5 times as many contentions as
> successful acquisitions.
> 
> With your patch and the same load:
> 
>      3           20           9     2            0            0 kern/uipc_usrreq.c:1028 (unp_mtx)
>      3           22           9     2            0            0 kern/uipc_usrreq.c:1161 (unp_mtx)
>      5           29           9     3            0            2 kern/uipc_usrreq.c:1065 (unp_global_mtx)
>      5           53          18     2            0        76488 kern/uipc_usrreq.c:287 (unp_global_mtx)
>      6           33           9     3            0            0 kern/uipc_usrreq.c:236 (unp_mtx)
>      6           37           9     4            0            0 kern/uipc_usrreq.c:819 (unp_mtx)
>      7            7           1     7            0            0 vm/uma_core.c:2101 (unpcb)
>      8           49           9     5            0            0 kern/uipc_usrreq.c:1101 (unp_mtx)
>     11          136          18     7            0            1 kern/uipc_usrreq.c:458 (unp_global_mtx)
>     32          143           9    15            0            1 kern/uipc_usrreq.c:1160 (unp_global_mtx)
>     44          472          18    26            0            0 kern/uipc_usrreq.c:801 (unp_mtx)
>    123          310           9    34            0            0 kern/uipc_usrreq.c:1100 (unp_mtx)
>    147          452           9    50            0            0 kern/uipc_usrreq.c:1099 (unp_mtx)
>    172          748           9    83            0            0 kern/uipc_usrreq.c:473 (unp_mtx)
>    337         1592           9   176            0            0 kern/uipc_usrreq.c:1147 (unp_mtx)
>    350         1790           9   198            0            0 kern/uipc_usrreq.c:1146 (unp_mtx)
>    780     39405928      320038   123            0            0 kern/uipc_usrreq.c:618 (unp_mtx)
>     18          140           9    15            1            0 kern/uipc_usrreq.c:235 (unp_global_mtx)
>     70          717          18    39            1            3 kern/uipc_usrreq.c:800 (unp_global_mtx)
>    528         2444           9   271            1            1 kern/uipc_usrreq.c:1089 (unp_global_mtx)
>    158          616           9    68            2            2 kern/uipc_usrreq.c:476 (unp_mtx)
>    794    175382857     2880084    60            2         7686 kern/uipc_usrreq.c:574 (unp_mtx)
>      4           25           9     2            3            2 kern/uipc_usrreq.c:422 (unp_global_mtx)
>    186          874           9    97            3            3 kern/uipc_usrreq.c:472 (unp_global_mtx)
>    768     33783759      320038   105         7442            0 kern/uipc_usrreq.c:696 (unp_mtx)
> ...
>    465       913127      320045     2        43130        35046 kern/uipc_socket.c:1101 (so_snd)
>    483      2453927      628737     3        44768        46177 kern/kern_sig.c:1002 (process lock)
>    767    124298544     2880082    43        70037        59994 kern/uipc_usrreq.c:581 (so_snd)
>    794     45176699      320038   141        83252        72140 kern/uipc_usrreq.c:617 (unp_global_mtx)
>    549      9858281     3200210     3       579269       712643 kern/kern_resource.c:1172 (sleep mtxpool)
>    554     17122245      631715    27       641888       268243 kern/kern_descrip.c:420 (filedesc structure)
>    388      3009912      631753     4       653540       260590 kern/kern_descrip.c:368 (filedesc structure)
>    642     49626755     3681446    13      1642954       682669 kern/kern_descrip.c:1988 (filedesc structure)
>    530     13802687     3681446     3      1663244       616899 kern/kern_descrip.c:1967 (filedesc structure)
>    477     23472709     2810986     8      5671248      1900047 kern/kern_synch.c:220 (process lock)
> 
> The top 10 heavily contended mutexes are very different (but note the
> number of mutex acquisitions, column 3, is about the same).
> 
> There is not much contention on unp_global_mtx any longer, but there
> is a lot more on some of the other mutexes, especially the process
> lock via msleep().  Off-hand I don't know what is the cause of this
> bottleneck (note: libthr is used as threading library and libpthread
> is not ported to sparc64).
> 
> Also, a lot of the contention that used to be on the unp lock seems to
> have fallen through onto contending *two* of the filedesc locks (all
> about 1.6 million contentions).  This may also help to explain the
> performance drop.

I profiled the wait channels in msleep and found:

Before:
   1 getblk
   3 wait
1113 sigwait
2228 umtxq_busy
4505 fdesc
9405 umtx
14264 sbwait
-----
31515 total

After:
   1 getblk
   1 wait
1413 sigwait
2794 umtxq_busy
4638 fdesc
10779 umtx
11663 sbwait
-----
31287 total

i.e. the same number of msleep calls in both cases, but the patch
shuffles them from sbwait onto everything else.

Relative change:
sbwait     -19%
umtx       +14%
fdesc      +3%
umtxq_busy +25%
sigwait    +27%

This may help to understand why performance drops at high load when
unp locking is moved out of the way.

Kris