Re: kernel crashes and portupgrade

From: Terry Lambert <tlambert2_at_mindspring.com> Date: Wed, 30 Apr 2003 19:09:46 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:05 UTC

Lars Eggert wrote:
> On 4/30/2003 4:28 PM, Terry Lambert wrote:
> > If you are panic'ing, and it's repeatable, then you should
> > minimally post:
> 
> Done already:
> 
>         Message-ID: <3EAC5950.7040306_at_isi.edu>
>         Date: Sun, 27 Apr 2003 15:27:28 -0700
>         From: Lars Eggert <larse_at_ISI.EDU>
>         Subject: Re: Kernel panic during portupdrade [ffs_blkfree:
>                  freeing free block]
> 
> (Panic message was in an earlier post to the same thread.)

FWIW, Message-ID does me no good; it's not a searchable field
for me.  If you are going to give me anything other than a URL
for the message in the mailing list archive, ou probably want
to give me (in order of importance):

1)	The mailing list it was sent to
2)	The date
3)	The sender
4)	The subject

--

That's yours, not Kent's.  It's pretty obvious from looking at
your message and the code what's happening there: you are trying
to free a frag of a block whose bit is not set in the cylinder
group bitmap.  To fix it, you have to ask yourself how it's even
possibe to get that situation in the first place.

Theoretically, this is not permitted to happen, because the CG
bitmap is supposed to be written out last.  Practically, there
are several ways to cause this in -current; any one of them
could be your culprit (e.g. you are running with the sched_sync()
patches for fsync that were posted, or you crashed and used a BG
fsck instead of a full fsck, and trusted it to do the right thing,
etc.).  Let's assume that none of those are true at this point,
and that you can repeat the problem after doing a full fsck on
the FS in question from sngle user mode, and rebooting.

So...

The first question we need to answer is why sched_sync is your
callout in fork_exit(); seems pretty daft to me.  I would think
this was indicative of stack corruption... or, it's indicative
of something being allowed to run tat shouldn't run while a
cleanup is in pogress, but not yet committed to the soft updates
list (meaning the CG bit should have been set, but wasn't).

Permit me to suspect 1.193 and 1.192 of /sys/kern/kern_fork.c,
and 1.442 and 1.443 of /sys/kern/vfs_subr.c; particularly, the
conversion from tsleep() to msleep().

A possible workaround might be to modify fork_exit(); there's
code in the function that reads:

        if (PCPU_GET(switchtime.sec) == 0)
                binuptime(PCPU_PTR(switchtime));
        PCPU_SET(switchticks, ticks);
        mtx_unlock_spin(&sched_lock);

        /*
         * cpu_set_fork_handler intercepts this function call to
         * have this call a non-return function to stay in kernel mode.
         * initproc has its own fork handler, but it does return.
         */
        KASSERT(callout != NULL, ("NULL callout in fork_exit"));
        callout(arg, frame);

Change it to read:

        if (PCPU_GET(switchtime.sec) == 0)
                binuptime(PCPU_PTR(switchtime));
        PCPU_SET(switchticks, ticks);

        /*
         * cpu_set_fork_handler intercepts this function call to
         * have this call a non-return function to stay in kernel mode.
         * initproc has its own fork handler, but it does return.
         */
        KASSERT(callout != NULL, ("NULL callout in fork_exit"));
        callout(arg, frame);
        mtx_unlock_spin(&sched_lock);

Instead.  Let me know what happens; it will probably complain about
an LOR or a lock being held that's "not supposed to be held, because
otherwise the kernel wouldn't panic" or whatever...

-- Terry