Re: Crash in accounting code: encode_long(), due to bad rusage data?

From: Jeff Roberson <jroberson_at_chesapeake.net> Date: Sun, 19 Aug 2007 14:50:55 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:16 UTC

On Sun, 19 Aug 2007, Robert Watson wrote:

>
> Diomidis, Jeff, Attilio,
>
> I recently upgraded two servers from FreeBSD 6-STABLE to FreeBSD 7-CURRENT in 
> anticipation of the forthcoming release.  Both of them run with accounting 
> enabled at all times.  When a large pine session was exiting on one of the 
> two boxes, I ran into the following panic:
>
> panic: encode_long: -ve value -32749
> KDB: enter: panic
> db> bt
> Tracing pid 81759 tid 100377 td 0xc4a76400
> kdb_enter(c0a9f844,1,c0a9ab3f,e6937b10,1,...) at kdb_enter+0x32
> panic(c0a9ab3f,ffff8013,e6937bb0,e6937c14,c0719dc2,...) at panic+0x124
> encode_long(e6937bb0,e6937bb8,e6937bc0,e6937bb8,0,...) at encode_long+0x1d
> acct_process(c4a76400,c0a9c0c2,162,129,e6937c50,...) at acct_process+0x1c2
> exit1(c4a76400,0,e6937d2c,c0a08303,c4a76400,...) at exit1+0x957
> sys_exit(c4a76400,e6937cfc,4,e6937d38,c0b4bed8,...) at sys_exit+0x1d
> syscall(e6937d38) at syscall+0x2b3
> Xint0x80_syscall() at Xint0x80_syscall+0x20
> --- syscall (1, FreeBSD ELF32, sys_exit), eip = 0x355f2093, esp = 0xbfbfe0ec, 
> ebp = 0xbfbfe108 ---
>
> kgdb was of mixed helpfulness:
>
> #2  0xc048d575 in db_command_loop () at /usr/src/sys/ddb/db_command.c:401
> #3  0xc048ece5 in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:222
> #4  0xc0773d56 in kdb_trap (type=3, code=0, tf=0xe6937a90)
>    at /usr/src/sys/kern/subr_kdb.c:502
> #5  0xc0a08aeb in trap (frame=0xe6937a90) at 
> /usr/src/sys/i386/i386/trap.c:621
> #6  0xc09ede3b in calltrap () at /usr/src/sys/i386/i386/exception.s:139
> #7  0xc0773ed2 in kdb_enter (msg=0xc0a9f844 "panic") at cpufunc.h:60
> #8  0xc074c5d4 in panic (fmt=0xc0a9ab3f "encode_long: -ve value %ld")
>    at /usr/src/sys/kern/kern_shutdown.c:547
> #9  0xc0719bad in encode_long (val=18) at /usr/src/sys/kern/kern_acct.c:525
> #10 0xc0719dc2 in acct_process (td=0xc4a76400)
>    at /usr/src/sys/kern/kern_acct.c:391
> #11 0xc072d7e7 in exit1 (td=0xc4a76400, rv=0)
>    at /usr/src/sys/kern/kern_exit.c:355
> #12 0xc072e21d in sys_exit (td=Could not find the frame base for "sys_exit".
> ) at /usr/src/sys/kern/kern_exit.c:98
> #13 0xc0a08303 in syscall (frame=0xe6937d38)
>    at /usr/src/sys/i386/i386/trap.c:1008
> #14 0xc09edea0 in Xint0x80_syscall ()
>    at /usr/src/sys/i386/i386/exception.s:196
> #15 0x00000033 in ?? ()
> Previous frame inner to this frame (corrupt stack?)
> (kgdb) frame 9
> #9  0xc0719bad in encode_long (val=18) at /usr/src/sys/kern/kern_acct.c:525
> 525             KASSERT(val >= 0,  ("encode_long: -ve value %ld", val));
> (kgdb) print val
> $1 = 18
> (kgdb) list
> 520     encode_long(long val)
> 521     {
> 522             int norm_exp;   /* Normalized exponent */
> 523             int shift;
> 524
> 525             KASSERT(val >= 0,  ("encode_long: -ve value %ld", val));
> 526             if (val == 0)
> 527                     return (0);
> 528             norm_exp = fls(val) - 1;
> 529             shift = FLT_MANT_DIG - norm_exp - 1;
> (kgdb) up
> #10 0xc0719dc2 in acct_process (td=0xc4a76400)
>    at /usr/src/sys/kern/kern_acct.c:391
> 391                     acct.ac_mem = encode_long((ru.ru_ixrss + ru.ru_idrss

      ru_idrss     an integral value of the amount of unshared memory residing
                   in the data segment of a process (expressed in units of
                   kilobytes * ticks-of-execution).

That seems like it's pretty prone to overflow given that this is a 32bit 
machine.  This may just be an improper assert.  I'm shocked that noone has 
run into this so far.

Jeff

> +
> (kgdb) list
> 386             tmp = ut;
> 387             timevaladd(&tmp, &st);
> 388             /* Convert tmp (i.e. u + s) into hz units to match ru_i*. */
> 389             t = tmp.tv_sec * hz + tmp.tv_usec / tick;
> 390             if (t)
> 391                     acct.ac_mem = encode_long((ru.ru_ixrss + ru.ru_idrss 
> +
> 392                         + ru.ru_isrss) / t);
> 393             else
> 394                     acct.ac_mem = 0;
> 395
> (kgdb) inspect ru
> $2 = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0,
>    tv_usec = 0}, ru_maxrss = 509908, ru_ixrss = 26810576,
>  ru_idrss = -2016077424, ru_isrss = 1044992, ru_minflt = 372535,
>  ru_majflt = 42, ru_nswap = 0, ru_inblock = 58, ru_oublock = 278,
>  ru_msgsnd = 1907, ru_msgrcv = 4671, ru_nsignals = 32, ru_nvcsw = 19892,
>  ru_nivcsw = 4614}
>
> I find the large negative value in ru_idrss somewhat sad to contemplate, and 
> while this could well be a problem with capturing of process runtime 
> information, I'd like it if the accounting code were robust against this sort 
> of bug, rather than panicking, and I guess I'd also rather than the process 
> runtime information also be correctly captured :-).
>
> I have a usable core dump, and have disabled accounting in the interim.
>
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>