filesystem deadlocks

From: Gerrit Nagelhout <gnagelhout_at_sandvine.com> Date: Sat, 19 Jun 2004 18:26:19 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:58 UTC

I am currently running a stress test where I have about 30 postgres 
processes running on a dual Xeon with an adaptec raid controller.
I am trying to reproduce some kernel lockups, but in the process 
keep getting into a state where no more io activity occurs, and all
the postgres processes seem to be stuck in a sleep for a mutex
(not making any progress).  Some of the time, ufs_fsck is running
because of an improper shutdown.  The code is based on CURRENT
from a couple of weeks ago.  After enabling witnes, the
following messages appear:

Jun 19 18:00:51 TPC-D7-23 lock order reversal
Jun 19 18:00:51 TPC-D7-23 1st 0xcab85294 vm object (vm object) _at_
/.amd_mnt/gnagelhout-pc3.sandvine.com/host/gerrit_bsd_5
_main/fw-bsd/src/sys/vm/swap_pager.c:1313
Jun 19 18:00:51 TPC-D7-23 2nd 0xc0780ba0 swap_pager swhash (swap_pager
swhash) _at_ /.amd_mnt/gnagelhout-pc3.sandvine.com/h
ost/gerrit_bsd_5_main/fw-bsd/src/sys/vm/swap_pager.c:1799
Jun 19 18:00:51 TPC-D7-23 3rd 0xca966108 vm object (vm object) _at_
/.amd_mnt/gnagelhout-pc3.sandvine.com/host/gerrit_bsd_5
_main/fw-bsd/src/sys/vm/uma_core.c:886
Jun 19 18:00:51 TPC-D7-23 Stack backtrace:
Jun 19 18:00:51 TPC-D7-23
backtrace(c06de7a0,ca966108,c06ef9dd,c06ef9dd,c06f05b8) at backtrace+0x17
Jun 19 18:00:51 TPC-D7-23
witness_checkorder(ca966108,9,c06f05b8,376,ca924e00) at
witness_checkorder+0x5f3
Jun 19 18:00:51 TPC-D7-23 _mtx_lock_flags(ca966108,0,c06f05b8,376,ca924e14)
at _mtx_lock_flags+0x32
Jun 19 18:00:51 TPC-D7-23 obj_alloc(ca924e00,1000,e6897a1b,101,e6897a30) at
obj_alloc+0x3f
Jun 19 18:00:51 TPC-D7-23 slab_zalloc(ca924e00,1,ca924e14,8,c06f05b8) at
slab_zalloc+0xb3
Jun 19 18:00:51 TPC-D7-23 uma_zone_slab(ca924e00,1,c06f05b8,68f,ca924eb0) at
uma_zone_slab+0xda
Jun 19 18:00:51 TPC-D7-23 uma_zalloc_internal(ca924e00,0,1,5c4,1) at
uma_zalloc_internal+0x3e
Jun 19 18:00:51 TPC-D7-23 uma_zalloc_arg(ca924e00,0,1,707,2) at
uma_zalloc_arg+0x283
Jun 19 18:00:51 TPC-D7-23 swp_pager_meta_build(cab85294,5,0,2,0) at
swp_pager_meta_build+0x12e
Jun 19 18:00:51 TPC-D7-23
swap_pager_putpages(cab85294,e6897be0,1,0,e6897b50) at
swap_pager_putpages+0x306
Jun 19 18:00:51 TPC-D7-23
default_pager_putpages(cab85294,e6897be0,1,0,e6897b50) at
default_pager_putpages+0x2e
Jun 19 18:00:51 TPC-D7-23 vm_pageout_flush(e6897be0,1,0,116,c073bda0) at
vm_pageout_flush+0xdb
Jun 19 18:00:51 TPC-D7-23 vm_pageout_clean(c436cb30,0,c06f03a0,33b,0) at
vm_pageout_clean+0x2a3
Jun 19 18:00:51 TPC-D7-23 vm_pageout_scan(0,0,c06f03a0,5b7,30d4) at
vm_pageout_scan+0x5d5
Jun 19 18:00:51 TPC-D7-23 vm_pageout(0,e6897d48,c06d9172,328,0) at
vm_pageout+0x31d
Jun 19 18:00:51 TPC-D7-23 fork_exit(c064ad69,0,e6897d48) at fork_exit+0x77
Jun 19 18:00:51 TPC-D7-23 fork_trampoline() at fork_trampoline+0x8
Jun 19 18:00:51 TPC-D7-23 --- trap 0x1, eip = 0, esp = 0xe6897d7c, ebp = 0
---

What else can I do to further debug this problem?

A second problem I have noticed (with similar symptoms, ie no more 
IO, everything is blocked), all of my postgres processes are in the wddrain
state.  The code that is supposed to wake them up (runningbufwakeup) 
still gets called on occassion,  but runningbufspace never becomes greater
than lorunningspace, and thus will not call wakeup.  I don't know if this
is due to a slow leak (of runningbufspace), or some deadlock condition.
Any ideas?

Thanks,

Gerrit Nagelhout