System, diagnose thyself: auto-documentation for crashes

From: Kirk Strauser <kirk_at_strauser.com>
Date: Fri, 29 Aug 2008 10:13:57 -0500
I was having flaky system problems that were driving me to  
distraction.  Yesterday, I finally got a panic message with an  
instruction pointer, used addr2line to see that the failure was in  
uma_zfree_internal, searched Google, and learned that it was probably  
due to bad RAM.  Half any hour later, memtest86 found the defective  
stick and the problem was solved.

This led me to thinking, though: the OS already had all the  
information needed to figure out where the problem was.  If there had  
been an explanation inside that function definition, FreeBSD could  
have automatically gone to the file, searched for that explanation,  
and told me why my system had probably crashed.

I propose that we:

1) Settle on a standard comment format for metainformation.  There are  
already standards like Doxygen if we didn't want to home-roll something.

2) Write a program that takes an instruction pointer and outputs the  
comment for the associated function.

3) Modify /etc/rc.d/savecore to run the program from #2.

For instance, suppose the comments in sys/vm/uma_core.c looked like:

/*
  * Frees an item to an INTERNAL zone or allocates a free bucket
  *
  * Arguments:
  *      zone   The zone to free to
  *      item   The item we're freeing
  *      udata  User supplied data for the dtor
  *      skip   Skip dtors and finis
  *
  * Failure:
  *      Failures in this function are commonly due to defective RAM.
  */
static void
uma_zfree_internal(uma_zone_t zone, void *item, void *udata,
     enum zfreeskip skip, int flags)
{
...
}

If I'd seen that failure message in my syslog, I would have avoided a  
few days of teeth gnashing.  What do you think?  I think something  
like this could be extremely useful.  Benefits:

  - There would be zero impact on performance because it would only  
touch comments and not any running code whatsoever.
  - It would require minimal work.
  - It could be done incrementally.  Document known common failure  
points and add others with time.
  - It wouldn't affect any other systems.

-- 
Kirk Strauser
Received on Fri Aug 29 2008 - 13:14:02 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:34 UTC