Re: HEADSUP: Turn off cpu_idle_hlt on SMP for now on x86

From: Julian Elischer <julian_at_elischer.org> Date: Fri, 11 Feb 2005 13:22:11 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:28 UTC

John Baldwin wrote:

>I figured out a deadlock I got on a 4 CPU testbox last night and I think I
>have tracked it down to a bug in 4BSD that can be worked around by turning
>off idle halt via the machdep.cpu_idle_hlt sysctl.  Specifically, it involves
>waking up halted CPUs when a thread it is pinned or bound to a specific CPU
>is made runnable.  The details of what I saw today are below:
>
>FreeBSD/i386 (deimos.baldwin.cx) (ttyd0)
>
>login: KDB: enter: Line break on console
>[thread pid 13 tid 100006 ]
>Stopped at      kdb_enter+0x30: leave
>db> show pcpu
>cpuid        = 3
>curthread    = 0xc1dfd5c0: pid 11 "idle: cpu3"
>curpcb       = 0xdc608da0
>fpcurthread  = none
>idlethread   = 0xc1dfd5c0: pid 11 "idle: cpu3"
>APIC ID      = 3
>currentldt   = 0x28
>db> ps
>  pid   proc     uid  ppid  pgrp  flag   stat  wmesg    wchan  cmd
>79079 c2ebe1fc 1001 79067   610 0000012 [LOCK vm page queue mutex c1f90c40] make
>79078 c2f2ede4 1001 78894   610 0000002 [LOCK vm page queue mutex c1f90c40] sh
>79077 c3498000 1001 79043   610 0004002 [LOCK vm page queue mutex c1f90c40] make
>79075 c2f317f0 1001 78898   610 0004002 [LOCK vm page queue mutex c1f90c40] make
>79074 c2184de4 1001 78902   610 0004002 [LOCK vm page queue mutex c1f90c40] make
>79067 c23c71fc 1001 78955   610 0004002 [SLPQ ppwait 0xc23c71fc][SLP] make
>79060 c23e35f4 1001 78915   610 0004002 [LOCK  Giant c1e41800] make
>79043 c2a50000 1001 79042   610 0004002 [SLPQ wait 0xc2a50000][SLP] sh
>79042 c2f493f8 1001 78884   610 0004002 [SLPQ select 0xc075e024][SLP] make
>78978 c28de7f0 1001 78975   610 0004002 [RUNQ] sh
>78975 c29bf1fc 1001 78970   610 0004002 [SLPQ select 0xc075e024][SLP] make
>...
>   40 c1e67000    0     0     0 0000204 [IWAIT] swi1: net
>   39 c1e671fc    0     0     0 0000204 [IWAIT] swi3: vm
>   38 c1e4b5f4    0     0     0 000020c [LOCK  Giant c1e41800] swi4: clock sio
>--More--
>   37 c1e4b7f0    0     0     0 0000204 [IWAIT] irq13:
>   36 c1e4b9ec    0     0     0 0000204 [IWAIT] irq0: clk
>   35 c1e4bbe8    0     0     0 0000204 [IWAIT] irq23: ahc0 ahc1
>   34 c1e4bde4    0     0     0 0000204 [IWAIT] irq22:
>   33 c1e5b000    0     0     0 0000204 [IWAIT] irq21:
>   32 c1e5b1fc    0     0     0 0000204 [IWAIT] irq20:
>   31 c1e5b3f8    0     0     0 0000204 [LOCK  Giant c1e41800] irq19: fxp0 uhci0
>...
>   14 c1dfc5f4    0     0     0 000020c [Can run] idle: cpu0
>   13 c1dfc7f0    0     0     0 000020c [CPU 1] idle: cpu1
>   12 c1dfc9ec    0     0     0 000020c [CPU 2] idle: cpu2
>   11 c1dfcbe8    0     0     0 000020c [CPU 3] idle: cpu3
>
>Thus, CPU 0 is running a make process while CPUs 1, 2, and 3 are all idle.
>Note that softclock() and irq19 (me trying to ssh in this morning over fxp0)
>are both blocked on Giant.  Also note that there are 5 threads blocked on the
>vm page queue mutex.  First, let's see who owns Giant and what they are doing.
>I was doing a benchmark, so I didn't have witness compiled in or turned on, so
>'show alllocks' wasn't an option.
>
>db> x/x Giant,10
>Giant:  c0730028        c06ffb2c        c06ffb2c        b0000           0
>Giant+0x14:     0               0               c2ebd8a2        0
>Giant+0x24:     0               c07300a0        0               0
>time_monitor:   92c86           7               420c903b
>
>The owner is 0xc2ebd8a2 but minus 0x2 since that is a flag:
>
>db> show threads
>...
>  100286 (0xc2ebd8a0)  sched_switch(c2ebd8a0,0,1,77,fcb6cb7e) at sched_switch+0x170
>...
>  100189 (0xc28dfcf0)  sched_switch(c28dfcf0,0,2,dff58a7c,95a67b30) at sched_switch+0x170
>...
>db> tr 100286
>Tracing pid 79075 tid 100286 td 0xc2ebd8a0
>sched_switch(c2ebd8a0,0,1,77,fcb6cb7e) at sched_switch+0x170
>mi_switch(1,0,c07026d2,254,c075d568) at mi_switch+0x1d9
>turnstile_wait(c076bda0,c28dfcf0,c2ebd8a0,d17bc930,c21a79e8) at turnstile_wait+0x269
>_mtx_lock_sleep(c076bda0,c2ebd8a0,0,c0704083,dc5) at _mtx_lock_sleep+0x101
>_mtx_lock_flags(c076bda0,0,c0704083,dc5,c0704083) at _mtx_lock_flags+0x45
>vfs_clean_pages(d17bc930,0,c0704083,423,c21a79b4) at vfs_clean_pages+0x70
>bdwrite(d17bc930,c3bdfa00,80,2000,0) at bdwrite+0x2f0
>ffs_update(c46bf564,0,1,50,e00efb2c) at ffs_update+0x396
>ufs_inactive(e00efb44,e00efb5c,c05978ea,e00efb44,0) at ufs_inactive+0x225
>VOP_INACTIVE_AP(e00efb44,0,c0705028,768,c074e7a0) at VOP_INACTIVE_AP+0x4e
>vrele(c46bf564,c074e2a0,c46bf564,1,c255a700) at vrele+0x11a
>vn_close(c46bf564,1,c255a700,c2ebd8a0,0) at vn_close+0x50
>vn_closefile(c30e6318,c2ebd8a0,c06fddb9,846,c30e6318) at vn_closefile+0xf4
>fdrop_locked(c30e6318,c2ebd8a0,c06fddb9,831) at fdrop_locked+0x94
>fdrop(c30e6318,c2ebd8a0,c06fddb9,77c,c0515cfd,c0737ce0,1000,c2ebd8a0,e00efcac,c0
>56009c,1,e00efc80,c255a700,0,e00efc90,e00efc90,c0513783,3f,c31b93c8,0,6,c31b9300,e00efcb0,c051389d,c31b9300,0,6) at fdrop+0x3c
>closef(c30e6318,c2ebd8a0,c06fddb9,3e8,c2ebd8a0) at closef+0x3f2
>close(c2ebd8a0,e00efd14,4,e00efd00,1) at close+0x1f2
>syscall(2f,2f,2f,809c200,0) at syscall+0x2a0
>Xint0x80_syscall() at Xint0x80_syscall+0x1f
>--- syscall (6, FreeBSD ELF32, close), eip = 0x805c743, esp = 0xbfbfd18c, ebp = 0xbfbfd1a8 ---
>
>Lookup the pid in the ps output above, I see that this thread is one of the
>threads blocked on the vm page queue mutex.  Ok, so let's look at that lock's
>owner:
>
>db> x/x 0xc076bda0,10
>vm_page_queue_mtx:      c0730028        c07100af        c07100af        b0000
>vm_page_queue_mtx+0x10: 0               0               0               c28dfcf2
>vm_page_queue_mtx+0x20: 0               0               c1e673f8        f540
>vm_pages_needed:        0               0               0               1552
>
>Lock owner is 0xc28dfcf2.  Looked that up via show threads and got:
>
>db> tr 100189
>Tracing pid 78978 tid 100189 td 0xc28dfcf0
>sched_switch(c28dfcf0,0,2,dff58a7c,95a67b30) at sched_switch+0x170
>mi_switch(2,0,c0701343,252,0) at mi_switch+0x1d9
>critical_exit(c074bcc0,c052ddd5,c0755f18,0,c28dfcf0) at critical_exit+0x7e
>intr_execute_handlers(c074bcc0,dff58b10,c076bc00,dff58b1c,c28dfcf0) at intr_execute_handlers+0xa4
>atpic_handle_intr(0) at atpic_handle_intr+0xaf
>Xatpic_intr0() at Xatpic_intr0+0x20
>--- interrupt, eip = 0xc054dd3c, esp = 0xdff58b54, ebp = 0xdff58b68 ---
>critical_exit(dff58b94,c06bbcbd,c075cae0,0,c0715535) at critical_exit+0xcc
>_mtx_unlock_spin_flags(c075cae0,0,c0715535,252,e) at _mtx_unlock_spin_flags+0x23
>pmap_invalidate_page(c29c869c,8064000,2,766,c19d6d10) at pmap_invalidate_page+0xdd
>pmap_enter(c29c869c,8064000,c19d6d10,3,0) at pmap_enter+0x33d
>vm_fault(c29c85dc,8064000,2,8,c28dfcf0) at vm_fault+0x13e7
>trap_pfault(dff58d48,1,80643e8,bfbfe060,80643e8) at trap_pfault+0xf1
>trap(2f,2f,2f,0,8069000) at trap+0x21a
>calltrap() at calltrap+0x5
>--- trap 0xc, eip = 0x80563cf, esp = 0xbfbfe0d0, ebp = 0xbfbfe0d8 ---
>
>So this is the thread that is in state RUNQ even though I have three idle
>CPUs.  Very curious!  I then wondered if my current CPU (CPU 3) thought
>that there were any runnable threads:
>
>db> call sched_runnable()
>0
>
>0 means no.  That's when I figured that the runnable thread must be pinned,
>especially since it was interrupted while messing with the pmap which
>usually requires pinning.  I then tried to do an experiment to test my theory
>by turning off the idle halt and manually IPI'ing all the other CPUs so that
>they would recheck their run queues and whichever CPU needed to run the
>process owning the vm page queue mutex would run.  Unfortunately, I think I
>got the vector number wrong as all 3 CPUs paniced. :-P
>
>db> x cpu_idle_hlt
>cpu_idle_hlt:   1
>db> w cpu_idle hlt 0
>cpu_idle_hlt                 0x1        =              0
>db> call lapic_ipi_raw(0x000c00e7, 0)
>0xd1622000
>db> c
>cccpppuuuiiiddd   ===   102;;;   aaapppiiiccc   iiiddd   ===   000102
>
>
>iiinnnssstttrrruuuccctttiiiooonnn   pppoooiiinnnttteerer        r       =
>= = 0 x008xx:80:8x0:cx00cx60ca50536bae055eb9e
>a5st
>as
>ctkas ctpkao cipkno tipenorti   en rt   e  r              =     0 x= 1  0 :0=0x
>x10d0xc:16000xed:cf0bex40dfc
>afb6r40abm
>...
>  857 (make)
>trap number             = 30
>panic: unknown/reserved trap
>cpuid = 0
>Uptime: 1d16h58m16s
>Cannot dump. No dump device defined.
>Automatic reboot in 15 seconds - press a key on the console to abort
>Rebooting...
>cpu_reset: Stopping other CPUs
>
>Thus, my theory is that when the pinned thread was preempted and put back on
>the run queue, the scheduler didn't IPI the CPU it was pinned to to wake it
>up in case it was idle.  The IPI is only needed if the CPUs are halted, which
>is why I think turning the idle halt off might work as a workaround.  I don't
>know if ULE has this same issue, but I've cc'd Jeff and hopefully he can look
>into it.
>

there is an option (sysctl even)
kern.sched.ipiwakeup.htt2: 0
kern.sched.ipiwakeup.onecpu: 0   <--------------
kern.sched.ipiwakeup.useloop: 0
kern.sched.ipiwakeup.usemask: 1
kern.sched.ipiwakeup.delivered: 0
kern.sched.ipiwakeup.requested: 0
kern.sched.ipiwakeup.enabled: 1
kern.sched.quantum: 100000

if this is set it should wake up all idle CPUS not just one of them.

>
>  
>