Re: HEADSUP: Turn off cpu_idle_hlt on SMP for now on x86

From: Julian Elischer <julian_at_elischer.org> Date: Sun, 13 Feb 2005 17:28:33 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:28 UTC

John Baldwin wrote:
> I figured out a deadlock I got on a 4 CPU testbox last night and I think I
> have tracked it down to a bug in 4BSD that can be worked around by turning
> off idle halt via the machdep.cpu_idle_hlt sysctl.  Specifically, it involves
> waking up halted CPUs when a thread it is pinned or bound to a specific CPU
> is made runnable.  The details of what I saw today are below:

did you try setting the sysctl that makes the idle IPI get sent to
ALL idle cpus and not just the one that is selected?

> 
> FreeBSD/i386 (deimos.baldwin.cx) (ttyd0)
> 
> login: KDB: enter: Line break on console
> [thread pid 13 tid 100006 ]
> Stopped at      kdb_enter+0x30: leave
> db> show pcpu
> cpuid        = 3
> curthread    = 0xc1dfd5c0: pid 11 "idle: cpu3"
> curpcb       = 0xdc608da0
> fpcurthread  = none
> idlethread   = 0xc1dfd5c0: pid 11 "idle: cpu3"
> APIC ID      = 3
> currentldt   = 0x28
> db> ps
>   pid   proc     uid  ppid  pgrp  flag   stat  wmesg    wchan  cmd
> 79079 c2ebe1fc 1001 79067   610 0000012 [LOCK vm page queue mutex c1f90c40] make
> 79078 c2f2ede4 1001 78894   610 0000002 [LOCK vm page queue mutex c1f90c40] sh
> 79077 c3498000 1001 79043   610 0004002 [LOCK vm page queue mutex c1f90c40] make
> 79075 c2f317f0 1001 78898   610 0004002 [LOCK vm page queue mutex c1f90c40] make
> 79074 c2184de4 1001 78902   610 0004002 [LOCK vm page queue mutex c1f90c40] make
> 79067 c23c71fc 1001 78955   610 0004002 [SLPQ ppwait 0xc23c71fc][SLP] make
> 79060 c23e35f4 1001 78915   610 0004002 [LOCK  Giant c1e41800] make
> 79043 c2a50000 1001 79042   610 0004002 [SLPQ wait 0xc2a50000][SLP] sh
> 79042 c2f493f8 1001 78884   610 0004002 [SLPQ select 0xc075e024][SLP] make
> 78978 c28de7f0 1001 78975   610 0004002 [RUNQ] sh
> 78975 c29bf1fc 1001 78970   610 0004002 [SLPQ select 0xc075e024][SLP] make
> ...
>    40 c1e67000    0     0     0 0000204 [IWAIT] swi1: net
>    39 c1e671fc    0     0     0 0000204 [IWAIT] swi3: vm
>    38 c1e4b5f4    0     0     0 000020c [LOCK  Giant c1e41800] swi4: clock sio
> --More--
>    37 c1e4b7f0    0     0     0 0000204 [IWAIT] irq13:
>    36 c1e4b9ec    0     0     0 0000204 [IWAIT] irq0: clk
>    35 c1e4bbe8    0     0     0 0000204 [IWAIT] irq23: ahc0 ahc1
>    34 c1e4bde4    0     0     0 0000204 [IWAIT] irq22:
>    33 c1e5b000    0     0     0 0000204 [IWAIT] irq21:
>    32 c1e5b1fc    0     0     0 0000204 [IWAIT] irq20:
>    31 c1e5b3f8    0     0     0 0000204 [LOCK  Giant c1e41800] irq19: fxp0 uhci0
> ...
>    14 c1dfc5f4    0     0     0 000020c [Can run] idle: cpu0
>    13 c1dfc7f0    0     0     0 000020c [CPU 1] idle: cpu1
>    12 c1dfc9ec    0     0     0 000020c [CPU 2] idle: cpu2
>    11 c1dfcbe8    0     0     0 000020c [CPU 3] idle: cpu3
> 
> Thus, CPU 0 is running a make process while CPUs 1, 2, and 3 are all idle.
> Note that softclock() and irq19 (me trying to ssh in this morning over fxp0)
> are both blocked on Giant.  Also note that there are 5 threads blocked on the
> vm page queue mutex.  First, let's see who owns Giant and what they are doing.
> I was doing a benchmark, so I didn't have witness compiled in or turned on, so
> 'show alllocks' wasn't an option.
> 
> db> x/x Giant,10
> Giant:  c0730028        c06ffb2c        c06ffb2c        b0000           0
> Giant+0x14:     0               0               c2ebd8a2        0
> Giant+0x24:     0               c07300a0        0               0
> time_monitor:   92c86           7               420c903b
> 
> The owner is 0xc2ebd8a2 but minus 0x2 since that is a flag:
> 
> db> show threads
> ...
>   100286 (0xc2ebd8a0)  sched_switch(c2ebd8a0,0,1,77,fcb6cb7e) at sched_switch+0x170
> ...
>   100189 (0xc28dfcf0)  sched_switch(c28dfcf0,0,2,dff58a7c,95a67b30) at sched_switch+0x170
> ...
> db> tr 100286
> Tracing pid 79075 tid 100286 td 0xc2ebd8a0
> sched_switch(c2ebd8a0,0,1,77,fcb6cb7e) at sched_switch+0x170
> mi_switch(1,0,c07026d2,254,c075d568) at mi_switch+0x1d9
> turnstile_wait(c076bda0,c28dfcf0,c2ebd8a0,d17bc930,c21a79e8) at turnstile_wait+0x269
> _mtx_lock_sleep(c076bda0,c2ebd8a0,0,c0704083,dc5) at _mtx_lock_sleep+0x101
> _mtx_lock_flags(c076bda0,0,c0704083,dc5,c0704083) at _mtx_lock_flags+0x45
> vfs_clean_pages(d17bc930,0,c0704083,423,c21a79b4) at vfs_clean_pages+0x70
> bdwrite(d17bc930,c3bdfa00,80,2000,0) at bdwrite+0x2f0
> ffs_update(c46bf564,0,1,50,e00efb2c) at ffs_update+0x396
> ufs_inactive(e00efb44,e00efb5c,c05978ea,e00efb44,0) at ufs_inactive+0x225
> VOP_INACTIVE_AP(e00efb44,0,c0705028,768,c074e7a0) at VOP_INACTIVE_AP+0x4e
> vrele(c46bf564,c074e2a0,c46bf564,1,c255a700) at vrele+0x11a
> vn_close(c46bf564,1,c255a700,c2ebd8a0,0) at vn_close+0x50
> vn_closefile(c30e6318,c2ebd8a0,c06fddb9,846,c30e6318) at vn_closefile+0xf4
> fdrop_locked(c30e6318,c2ebd8a0,c06fddb9,831) at fdrop_locked+0x94
> fdrop(c30e6318,c2ebd8a0,c06fddb9,77c,c0515cfd,c0737ce0,1000,c2ebd8a0,e00efcac,c0
> 56009c,1,e00efc80,c255a700,0,e00efc90,e00efc90,c0513783,3f,c31b93c8,0,6,c31b9300,e00efcb0,c051389d,c31b9300,0,6) at fdrop+0x3c
> closef(c30e6318,c2ebd8a0,c06fddb9,3e8,c2ebd8a0) at closef+0x3f2
> close(c2ebd8a0,e00efd14,4,e00efd00,1) at close+0x1f2
> syscall(2f,2f,2f,809c200,0) at syscall+0x2a0
> Xint0x80_syscall() at Xint0x80_syscall+0x1f
> --- syscall (6, FreeBSD ELF32, close), eip = 0x805c743, esp = 0xbfbfd18c, ebp = 0xbfbfd1a8 ---
> 
> Lookup the pid in the ps output above, I see that this thread is one of the
> threads blocked on the vm page queue mutex.  Ok, so let's look at that lock's
> owner:
> 
> db> x/x 0xc076bda0,10
> vm_page_queue_mtx:      c0730028        c07100af        c07100af        b0000
> vm_page_queue_mtx+0x10: 0               0               0               c28dfcf2
> vm_page_queue_mtx+0x20: 0               0               c1e673f8        f540
> vm_pages_needed:        0               0               0               1552
> 
> Lock owner is 0xc28dfcf2.  Looked that up via show threads and got:
> 
> db> tr 100189
> Tracing pid 78978 tid 100189 td 0xc28dfcf0
> sched_switch(c28dfcf0,0,2,dff58a7c,95a67b30) at sched_switch+0x170
> mi_switch(2,0,c0701343,252,0) at mi_switch+0x1d9
> critical_exit(c074bcc0,c052ddd5,c0755f18,0,c28dfcf0) at critical_exit+0x7e
> intr_execute_handlers(c074bcc0,dff58b10,c076bc00,dff58b1c,c28dfcf0) at intr_execute_handlers+0xa4
> atpic_handle_intr(0) at atpic_handle_intr+0xaf
> Xatpic_intr0() at Xatpic_intr0+0x20
> --- interrupt, eip = 0xc054dd3c, esp = 0xdff58b54, ebp = 0xdff58b68 ---
> critical_exit(dff58b94,c06bbcbd,c075cae0,0,c0715535) at critical_exit+0xcc
> _mtx_unlock_spin_flags(c075cae0,0,c0715535,252,e) at _mtx_unlock_spin_flags+0x23
> pmap_invalidate_page(c29c869c,8064000,2,766,c19d6d10) at pmap_invalidate_page+0xdd
> pmap_enter(c29c869c,8064000,c19d6d10,3,0) at pmap_enter+0x33d
> vm_fault(c29c85dc,8064000,2,8,c28dfcf0) at vm_fault+0x13e7
> trap_pfault(dff58d48,1,80643e8,bfbfe060,80643e8) at trap_pfault+0xf1
> trap(2f,2f,2f,0,8069000) at trap+0x21a
> calltrap() at calltrap+0x5
> --- trap 0xc, eip = 0x80563cf, esp = 0xbfbfe0d0, ebp = 0xbfbfe0d8 ---
> 
> So this is the thread that is in state RUNQ even though I have three idle
> CPUs.  Very curious!  I then wondered if my current CPU (CPU 3) thought
> that there were any runnable threads:
> 
> db> call sched_runnable()
> 0
> 
> 0 means no.  That's when I figured that the runnable thread must be pinned,
> especially since it was interrupted while messing with the pmap which
> usually requires pinning.  I then tried to do an experiment to test my theory
> by turning off the idle halt and manually IPI'ing all the other CPUs so that
> they would recheck their run queues and whichever CPU needed to run the
> process owning the vm page queue mutex would run.  Unfortunately, I think I
> got the vector number wrong as all 3 CPUs paniced. :-P
> 
> db> x cpu_idle_hlt
> cpu_idle_hlt:   1
> db> w cpu_idle hlt 0
> cpu_idle_hlt                 0x1        =              0
> db> call lapic_ipi_raw(0x000c00e7, 0)
> 0xd1622000
> db> c
> cccpppuuuiiiddd   ===   102;;;   aaapppiiiccc   iiiddd   ===   000102
> 
> 
> iiinnnssstttrrruuuccctttiiiooonnn   pppoooiiinnnttteerer        r       =
> = = 0 x008xx:80:8x0:cx00cx60ca50536bae055eb9e
> a5st
> as
> ctkas ctpkao cipkno tipenorti   en rt   e  r              =     0 x= 1  0 :0=0x
> x10d0xc:16000xed:cf0bex40dfc
> afb6r40abm
> ...
>   857 (make)
> trap number             = 30
> panic: unknown/reserved trap
> cpuid = 0
> Uptime: 1d16h58m16s
> Cannot dump. No dump device defined.
> Automatic reboot in 15 seconds - press a key on the console to abort
> Rebooting...
> cpu_reset: Stopping other CPUs
> 
> Thus, my theory is that when the pinned thread was preempted and put back on
> the run queue, the scheduler didn't IPI the CPU it was pinned to to wake it
> up in case it was idle.  The IPI is only needed if the CPUs are halted, which
> is why I think turning the idle halt off might work as a workaround.  I don't
> know if ULE has this same issue, but I've cc'd Jeff and hopefully he can look
> into it.
>