Re: Using Instruction Pointer address in debug interfaces [Was: Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]]

From: Julian Elischer <julian_at_freebsd.org> Date: Wed, 09 Nov 2011 09:39:13 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:20 UTC

On 11/8/11 9:29 PM, Arnaud Lacombe wrote:
> Hi,
>
> On Tue, Nov 8, 2011 at 9:35 PM, Julian Elischer<julian_at_freebsd.org>  wrote:
>> On 11/8/11 5:52 PM, Arnaud Lacombe wrote:
>>> Hi,
>>>
>>> On Tue, Nov 8, 2011 at 7:08 PM, Julian Elischer<julian_at_freebsd.org>
>>>   wrote:
>>>> On 11/8/11 10:49 AM, Arnaud Lacombe wrote:
>>>>> Hi,
>>>>> To avoid future complaints about the fact that I would be only "talk"
>>>>> without "action", I did implement what I suggested above. As it is
>>>>> quite a large patch-set, I will not post it directly here, however, it
>>>>> is available on github:
>>>>>
>>>>> https://github.com/lacombar/freebsd/tree/master/topic/kern-lock-debug
>>>>>
>>>>> It convert a bunch of debug interface to use the caller instruction
>>>>> pointer, as well as a proof-of-concept teaching printf(9) to convert
>>>>> IP to symbol_name+offset.
>>>>>
>>>>> It translates in a direct saving of about +250kB on i386's GENERIC,
>>>>> just in kernel text size. Even the worst case, ie LOCK_DEBUG == 0,
>>>>> translates to a save of +80kB.
>>>>>
>>>>> Please note that this is still WIP code.
>>>> A couple of comments.
>>>> Firstly, the idea of a printf method to print the IP as symbol+offset is
>>>> an
>>>> interesting idea
>>>> that should be followed up in its own right.
>>>>
>>> FWIW, I have no credit in this idea. It has been in Linux for ages and
>>> ages.
>> yeah as I said  at work I use linux and BSD...
>> the linux stuff that just prints out IP really annoys me.
>>
>> the list stuff and netgraph debug (which should be off in any production
>> system)
> this is, I guess, where we do not agree. You find it acceptable to run
> totally different code in production and during debug. I do not. This
> is completely insane, even more nowadays where heavy parallelism
> increases the likelihood of races, and subtle change in the code, even
> optimization, can cause total behavioral change (ie. Heisenbug).
>
> For the record, we have been tracking for more than 2 months (first
> occurrences happened a year ago) an mbuf corruption in the network
> stack, present in all released code since at least FreeBSD 7[0]. Each
> time we think it is fixed, we are proven wrong by customers within a
> few days when the system crashes again. Even the last attempt which
> was believed to be bullet-proof failed and crashes daily.
>
> All that to say that production code should embed enough facilities to
> allow the system to be fully debugged with a runtime cost as low as
> possible, and a code-size cost as low as possible[1]. I should be able
> to connect on a production machine, turn a knob, an see what is going
> wrong, without the customer noticing.
>
> In the worst case, when you have to enable debug-only code, it must
> not be done by making the non-debug case more expensive, but wrap
> around. The whole original point of the patches was that LOCK_FILE and
> LOCK_LINE are a bad answer to a wrong problem.
>
> `__FILE__, __LINE__' and the bloat introduced is not the problem,
> `const char *file, int line' in way too much prototypes is.
>
in netgraph if you turn off debugging you should not have any char * file
stuff. it should all go away. ( he says  from memory, not actually 
looking)

> Now, you make me realize that `const char *file, int line' should just
> be removed, not replaced by `unsigned long' or anything else. It's
> likely to be done in another iteration.
>
>> just require you to be able to see the console. and have sources nearby.
>> if you need the IP use gdb.
>>
> "console debugging" is yet another abomination which should be hunted
> down. Just try to do any useful work at high-pps on a serial
> console...

if you want to use gdb, then use gdb and if you want to use ktr, then 
use that.
don't just declare the rest of the universe incompetent.
these things are usually there for the developer of the module and 
done in whatever way
is most convenient fo rthem.

>> it's just what you are used to. You are obviously from the dark side
>> ^H^H^H^H^H^H linux.
>>
> My obedience is totally irrelevant to the problem.

> However, if you want to know, my heart tends to be with BSDs.
> Unfortunately, it's a sad love-story where your Beloved keeps
> deceiving you day after day. You want to change small bits at a time,
> make several iteration of progress to make things brighter, but your
> Beloved refuses any change because of too much inertia. Sad.

mostly it's because you keep attacking your loved one with a steak knife.
flowers might get you further.
>> so you are used to doing it that way.. but don't expect us to change just
>> because that's what Linux does.
>>
> again, mentioning Linux is totally irrelevant. Use of Instruction
> Pointer are implementation details for a not so intrusive solution to
> the problem I pointed out, and which you are totally missing.
since these modules were written many new options have come.
for example the option to throw out a stack backtrace is new.
For netgraph however, when I was debugging it, a file/line was
exactly what I needed for the type of error I was looking for at the time.

> Now, please answer this: do you find any of the bloat to the non-debug
> case (ie. passing a NULL pointer and a 0 integer, when `LOCK_DEBUG ==
> 0') worth the extra debugability comfort to be acceptable ?

not talking about lock debug (someone else's area), but in netgraph when
netgraph is not in debug mode, there is no extra data passed around.
similar for the list/queue macros.
> If you do, then your focus is on making things comfortable for
> developers, at the expense 100's of users, rather than making things
> comfortable for 100's of users, at the expense of developers.
>
>> When we have a problem at work on teh Linux driver, my first step is always
>> to try duplicate it on FreeBSD because:
>>
> well, you're lucky FreeBSD supports your device! Lately, we got lately
> a shiny multi-queue network cards with bypass mechanism... that is not
> supported in FreeBSD. So currently, we got an expensive paper-weight.

well write a driver for it.. what do you think I'm doing with the 
driver I'm talking about?
I wrote several bypass network card drivers when I was at 
cisco/ironport.. it's not rocket science,
though it would be nice if we were to come up with a standard 
interface for bypass interfaces.
That is a different topic though..

>> 1/ half the time freebsd will just immediatly assert on something and
>> present you with the bug.. done.
>>
> well, certainly not from a release build; assertion are disabled.

During development. we NEVER have bugs in production ;-)

>> 2/ I can run gdb through firewire on it on ANY standard unmodified kernel
>> and find it, where on Linux I need to
>> get a whole universe of stupid patches all aligned and MAYBE I might be able
>> to see what is going on.
>> if it's on redhat I need to do this, on ubuntu that, on suse something else
>> ,and on different revisions
>> of the kernel it all changes anyhow..
>>
> machine (even x86-64 machines) I run FreeBSD on have no firewire,
> neither do my desktops, this limit the usability of the feature.
> Moreover, I do not use mass-distros either, except for desktop[3], but
> small embedded firmware (ala. openwrt), even on "middle-end" system,
> and stick to vanilla kernel (when not playing with -next).

I also use serial, but having a few spare firewire cards sitting 
around helps.
>>> That said, IP address are barely used in FreeBSD, there is no legacy.
>>> As such, the API should not use `unsigned long' but `void *'[0]; this
>>> is the natural type returned by `__builtin_return_address()' and the
>>> `&&' operator. This would allow to introduce a modifier to `%p' to do
>>> the translation.
>> possibly intptr_t is what should be used. but I'd expect Bruce to drop in
>> here and let us us know.
>>
> whatever will suit you :) I started by using `uintptr_t', but they
> cannot be printed in a portable manner and the printf(9) stuff was not
> ready yet.
>
>   - Arnaud
>
> [0]: I am able to crash any kernel between 7-STABLE to 9.STABLE within
> minutes, with the right pattern and (mainstream and well supported)
> hardware.

and who have you reported that to?  bug number?

> [1]: for example, when I read in `sys/kern/kern_fail.c':
>
>   * Failpoints allow for injecting fake errors into running code on the fly,
>   * without modifying code or recompiling with flags.  Failpoints are always
>   * present, and are very efficient when disabled.
>
> and see that "very efficient when disabled" translates into a
> __predict_false() conditional, well, sorry, but I am very dubious.
> Entering the branch _is_ still a cost. A most efficient way of doing
> this would be using gcc's `asm goto' statement, in which case the hot
> path only ever see a `nop', eventually patched at runtime.
>
> [2]: soekris-like boards do not qualify for "embedded" :)
>
> [3]: FreeBSD (8-STABLE) is way to limited and un-integrated to be
> anywhere but useful, not to speak about kernel bug which leave the
> system so fracked up that you have no other choice but to hard-reboot.

bug number?

>