Re: clang compiled kernel panic when mounting zfs root on i386

From: Bruce Evans <brde_at_optusnet.com.au>
Date: Tue, 27 Nov 2012 08:21:05 +1100 (EST)
On Mon, 26 Nov 2012, Konstantin Belousov wrote:

> On Mon, Nov 26, 2012 at 06:31:34AM -0800, sig6247 wrote:
>>
>> Just checked out r243529, this only happens when the kernel is compiled
>> by clang, and only on i386, either recompiling the kernel with gcc or
>> booting from a UFS root works fine. Is it a known problem?
> It looks like that clang uses more stack than gcc, and zfs makes quite
> deep call chains.
>
> It would be a waste, generally, to increase the init process kernel
> stack size only to pacify zfs. And I suspect that it would not help
> in the similar situations when the same procedure initiated for non-root
> mounts.

Or to pacify clang...

>> ----------------------------------------------------------------------
>> WARNING: WITNESS option enabled, expect reduced performance.
>> Trying to mount root from zfs:zroot []...
>>
>> Fatal double fault:
>> eip = 0xc0adc37d
>> esp = 0xc86bffc8
>> ebp = 0xc86c003c
>> cpuid = 1; apic id = 01
>> panic: double fault
>> cpuid = 1
>> KDB: enter: panic
>> [ thread pid 1 tid 100002 ]
>> Stopped at      kdb_enter+0x3d: movl    $0,kdb_why
>> db> bt
>> Tracing pid 1 tid 100002 td 0xc89efbc0
>> kdb_enter(c1064aa4,c1064aa4,c10b806f,c139e3b8,f5eacada,...) at kdb_enter+0x3d
>> panic(c10b806f,1,1,1,c86c003c,...) at panic+0x14b
>> dblfault_handler() at dblfault_handler+0xab
>> --- trap 0x17, eip = 0xc0adc37d, esp = 0xc86bffc8, ebp = 0xc86c003c ---
>> witness_checkorder(c1fd7508,9,c109df18,7fa,0,...) at witness_checkorder+0x37d
>> __mtx_lock_flags(c1fd7518,0,c109df18,7fa,c135d918,...) at __mtx_lock_flags+0x87
>> uma_zalloc_arg(c1fd66c0,0,1,4d3,c86c0110,...) at uma_zalloc_arg+0x605
>> vm_map_insert(c1fd508c,c13dfc10,bb1f000,0,cba1e000,...) at vm_map_insert+0x499
>> kmem_back(c1fd508c,cba1e000,1000,3,c86c01d4,...) at kmem_back+0x76
>> kmem_malloc(c1fd508c,1000,3) at kmem_malloc+0x250
>> page_alloc(c1fd1d80,1000,c86c020b,3,c1fd1d80,...) at page_alloc+0x27
>> keg_alloc_slab(103,4,c109df18,870,cb99ef6c,...) at keg_alloc_slab+0xc3
>> keg_fetch_slab(103,c1fd1d80,cb99ef6c,c1fc8230,c86c02c0,...) at keg_fetch_slab+0xe2
>> zone_fetch_slab(c1fd1d80,c1fd0480,103,826,0,...) at zone_fetch_slab+0x43
>> uma_zalloc_arg(c1fd1d80,0,102,3,2,...) at uma_zalloc_arg+0x3f2
>> malloc(4c,c1686100,102,c86c0388,c173d09a,...) at malloc+0xe9
>> zfs_kmem_alloc(4c,102,cb618820,c89efbc0,cb618820,...) at zfs_kmem_alloc+0x20
>> vdev_mirror_io_start(cb8218a0,10,cb8218a0,1,0,...) at vdev_mirror_io_start+0x14a
>> zio_vdev_io_start(cb8218a0,c89efbc0,0,cb8218a0,c86c0600,...) at zio_vdev_io_start+0x228
>> zio_execute(cb8218a0,cb618000,cba1b640,cb900000,400,...) at zio_execute+0x106
>> spa_load_verify_cb(cb618000,0,cba1b640,cb884b40,c86c0600,...) at spa_load_verify_cb+0x89
>> traverse_visitbp(cb884b40,cba1b640,c86c0600,c86c0ba0,0,...) at traverse_visitbp+0x29f
>> traverse_dnode(cb884b40,0,0,8b,0,...) at traverse_dnode+0x92
>> traverse_visitbp(cb884bb8,cba07200,c86c0890,cb884bf4,c16ce7e0,...) at traverse_visitbp+0xe47
>> traverse_visitbp(cb884bf4,cb9bf840,c86c0968,c86c0ba0,0,...) at traverse_visitbp+0xf32
>> traverse_dnode(cb884bf4,0,0,0,0,...) at traverse_dnode+0x92
>> traverse_visitbp(0,cb618398,c86c0b50,2,cb9f1c78,...) at traverse_visitbp+0x96d
>> traverse_impl(0,0,cb618398,3e1,0,...) at traverse_impl+0x268
>> traverse_pool(cb618000,3e1,0,d,c1727830,...) at traverse_pool+0x79
>> spa_load(0,1,c86c0ec4,1e,0,...) at spa_load+0x1dde
>> spa_load(0,0,c13d8c94,1,3,...) at spa_load+0x11a5
>> spa_load_best(0,ffffffff,ffffffff,1,c0adc395,...) at spa_load_best+0x71
>> spa_open_common(c17e0e1e,0,0,c86c1190,c16f5a1c,...) at spa_open_common+0x11a
>> spa_open(c86c1078,c86c1074,c17e0e1e,c135d918,c1fd7798,...) at spa_open+0x27
>> dsl_dir_open_spa(0,cb770030,c17e11b1,c86c11f8,c86c11f4,...) at dsl_dir_open_spa+0x6c
>> dsl_dataset_hold(cb770030,cb613800,c86c1240,cb613800,cb613800,...) at dsl_dataset_hold+0x3a
>> dsl_dataset_own(cb770030,0,cb613800,c86c1240,c1684e30,...) at dsl_dataset_own+0x21
>> dmu_objset_own(cb770030,2,1,cb613800,c86c1290,...) at dmu_objset_own+0x2a
>> zfsvfs_create(cb770030,c86c13ac,c17ee05d,681,0,...) at zfsvfs_create+0x4c
>> zfs_mount(cb78ed20,c17f411c,c9ff4600,c89cae80,0,...) at zfs_mount+0x42c
>> vfs_donmount(c89efbc0,4000,0,c86c1790,cb6c0800,...) at vfs_donmount+0xc6d
>> kernel_mount(cb7700b0,4000,0,0,1,...) at kernel_mount+0x6b
>> parse_mount(cb7700e0,c1194498,0,1,0,...) at parse_mount+0x606
>> vfs_mountroot(c13d95b0,4,c105c042,2bb,0,...) at vfs_mountroot+0x6cf
>> start_init(0,c86c1d08,c105e94c,3db,0,...) at start_init+0x6a
>> fork_exit(c0a42090,0,c86c1d08) at fork_exit+0x7f
>> fork_trampoline() at fork_trampoline+0x8
>> --- trap 0, eip = 0, esp = 0xc86c1d40, ebp = 0 ---
>> db>

43 deep (before the double fault) is disgusting, but even if clang has
broken stack alignment due to a wrong default and no
-mpreferred-stack-boundary to fix it, that's still only about 8*43
extra bytes (8 for the average extra stack to align to 16 bytes).
Probably zfs is also putting large data structures on the stack.

It would be useful if the stack trace printed the the stack pointer
on every function call, so that you could see how much stack each
function used.

All those ', ...' printed after 5 args show further apparent clang
lossage.  vfs_mountroot takes no args at all, but the above shows
ot taking 5 known args and further unknown args.  ddb's stack tracer
does limited parsing of the bytes in function's caller to guess the
number of args.  It usually guesses right for gcc.  It apparently
usually guesses wrong for clang.

I can't test this for i386 since ref10-i386 is down, but for amd64 the
parsing seems to be already too dificult for gcc and impossible for
clang: for the function call in `void bar(void); void foo(void) {
bar(); }', clang generates a tail call:

% test:                                   # _at_test
% 	.cfi_startproc
% # BB#0:                                 # %entry
% 	xorb	%al, %al
% 	jmp	bar                     # TAILCALL
% .Ltmp0:

The bytes after the tail call are unparseable.  The bytes before the
tail call are difficult to parse, and AFAIR ddb doesn't look at them
(the xorb says that there are no xmm args, but that is of no interest
in the kernel).

Gcc generates:

% test:
% .LFB2:
% 	subq	$8, %rsp
% .LCFI0:
% 	movl	$0, %eax
% 	call	bar
% 	addq	$8, %rsp
% 	ret

The addq $8,%rsp after the call looks like it is popping 1 arg and is
probably misguessed as such -- it is actually to clean up the stack
alignment before return.  Actually, this guess doesn't apply to amd64
-- the first few args are in registers where there number is almost
impossible to guess even for gcc if there are none on stack; it is
only when there are some on the stack that the number there can possibly
be guessed and the total number = #number in registers + #on stack.

With 1 arg: `void bar(int x); void foo(int) { bar(x); }', the generated
code is identical for both compilers.  So have any chance of guessing the
number of args for bar(), bytes in bar()'s caller's callers would have
to be checked, recursively.  This is too hard for ddb.  With tail calls,
the checking is unbounded.

Tail calls also break the stack trace, but save space so running out of
kernel stack is less likely :-).

It is arguable a bug to compile kernels with options that make stack tracing
impossible.  Even passing args in registers makes correct arg printing
impossible -- ddb would have to assume that there are always at least
3 (?) args on amd64.  The above shows 3 being printed for fork_exit(),
which is correct, and 3 for kmem_malloc(), which is correct, and none
for fork_trampoline(), which depends on ddb having special knowledge
of this and a few other low-level/asm functions.  All the others are
broken.

An very old bug in this area are that functions written in asm don't have
normal frames (at least on amd64 and i386) unless profiling is configured. 
They are mostly leaf functions that shouldn't trap, so this is not very
important for interactive use of ddb.  But it makes the args of a function
like memcpy() harder to find in interactive use, and breaks stack traces
if an interrupt or trap occurs in the functions.  Compilers may also
omit frame pointers for leaf functions, but this is easy to prevent.
The efficiency gains from not using frame pointers are small or negative,
at least on modern x86 since the frame pointer management can run in
parallel and is especially fast in leaf functions where it is not actually
used (except for debugging), so frame pointers should never be omitted if
there is any chance that the code needs to be debugged.  The chance is
almost 100% for kernels since they can be debugged if GDB or DDB or stack
tracing is configured or if panic dumps are enabled and occur.

A not so old bug in this area is that compiling with -O2 or just with
with -march tends to break things.  Even on old i386 kernels compiled
with -O -march=i386, I mostly see 5 args (but no ', ...').  For a just
a few functions, the calling sequence is "call foo; addl $4*N,%esp"
and ddb correctly guesses that this means N args.  Another sequence
is "call foo; movl %eax,%esi; addl $8,%esp".  ddb doesn't understand
this at all, and misguesses that there are 5 args.

Bruce
Received on Mon Nov 26 2012 - 20:21:15 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:32 UTC