Re: System, diagnose thyself: auto-documentation for crashes

From: John Baldwin <jhb_at_freebsd.org>
Date: Fri, 29 Aug 2008 11:44:54 -0400
On Friday 29 August 2008 11:13:57 am Kirk Strauser wrote:
> I was having flaky system problems that were driving me to  
> distraction.  Yesterday, I finally got a panic message with an  
> instruction pointer, used addr2line to see that the failure was in  
> uma_zfree_internal, searched Google, and learned that it was probably  
> due to bad RAM.  Half any hour later, memtest86 found the defective  
> stick and the problem was solved.
> 
> This led me to thinking, though: the OS already had all the  
> information needed to figure out where the problem was.  If there had  
> been an explanation inside that function definition, FreeBSD could  
> have automatically gone to the file, searched for that explanation,  
> and told me why my system had probably crashed.
> 
> I propose that we:
> 
> 1) Settle on a standard comment format for metainformation.  There are  
> already standards like Doxygen if we didn't want to home-roll something.
> 
> 2) Write a program that takes an instruction pointer and outputs the  
> comment for the associated function.
> 
> 3) Modify /etc/rc.d/savecore to run the program from #2.
> 
> For instance, suppose the comments in sys/vm/uma_core.c looked like:
> 
> /*
>   * Frees an item to an INTERNAL zone or allocates a free bucket
>   *
>   * Arguments:
>   *      zone   The zone to free to
>   *      item   The item we're freeing
>   *      udata  User supplied data for the dtor
>   *      skip   Skip dtors and finis
>   *
>   * Failure:
>   *      Failures in this function are commonly due to defective RAM.
>   */
> static void
> uma_zfree_internal(uma_zone_t zone, void *item, void *udata,
>      enum zfreeskip skip, int flags)
> {
> ...
> }
> 
> If I'd seen that failure message in my syslog, I would have avoided a  
> few days of teeth gnashing.  What do you think?  I think something  
> like this could be extremely useful.  Benefits:
> 
>   - There would be zero impact on performance because it would only  
> touch comments and not any running code whatsoever.
>   - It would require minimal work.
>   - It could be done incrementally.  Document known common failure  
> points and add others with time.
>   - It wouldn't affect any other systems.

See /usr/sbin/crashinfo for a start.  I have patches to enable it 
from /etc/rc.d/savecore after generating a patch (still need to test them 
though).

-- 
John Baldwin
Received on Fri Aug 29 2008 - 17:12:46 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:34 UTC