Re: 5.3-RELEASE TODO

From: Kris Kennaway <kris_at_obsecurity.org>
Date: Thu, 15 Jul 2004 15:04:47 -0700
On Thu, Jul 15, 2004 at 10:24:39AM -0400, Robert Watson wrote:

>  This is a list of open issues that need to be resolved for FreeBSD 5.3. If
>  you have any updates for this list, please e-mail re_at_FreeBSD.org.
> 
> Show stopper defects for 5.3-RELEASE

These are the bugs I'm currently tracking (those I can remember right
now, at least)

* SMP is unusable for me because of the following frequent panic
(actually a panic and another kernel printf interleaved).  Here is the
untangled version:

panic: APIC: Previous IPI is stu c k
                                p m a
 _ l a z y f i x :   s p
u c p u i d  =    0 ;
 n   f o r   5 0 0 0 0 0 0 0
c D e b u g g e r ( " p a n i

jhb says:

> Seems the two CPUs are deadlocked waiting on each other.  The first sent a
> pmap_lazyfixup IPI to the second but the second has interrupts disabled as it
> is trying to send an IPI as well.

He suggested a patch, but it did not fix the problem.

* linprocfs 

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x8
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0xc04e1870
stack pointer           = 0x10:0xf11e6b50
frame pointer           = 0x10:0xf11e6b6c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 23938 (mtree)
kernel: type 12 trap, code=0
Stopped at      pfs_getattr+0x130:      movl    0x8(%eax),%eax
db> trace
pfs_getattr(f11e6b78,c06fda00,cf397b2c,f11e6b98,d23e8a80) at pfs_getattr+0x130
vn_stat(cf397b2c,f11e6c80,d23e8a80,0,c5eb0c60) at vn_stat+0x4f
lstat(c5eb0c60,f11e6d14,2,2,297) at lstat+0x6a
syscall(2f,2f,2f,805a200,805a248) at syscall+0x217
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (190, FreeBSD ELF32, lstat), eip = 0x280ac664, esp = 0xbfbf7594, ebp = 0xbfbf7620 ---

dosirak# addr2line -e kernel.debug 0xc04e1870
/usr/src/sys/i386/compile/DOSIRAK/../../../fs/pseudofs/pseudofs_vnops.c:200

[...]
        if (pvd->pvd_pid != NO_PID) {
                if ((proc = pfind(pvd->pvd_pid)) == NULL)
                        PFS_RETURN (ENOENT);
-->             vap->va_uid = proc->p_ucred->cr_ruid;

rwatson has a patch that works around this particular null pointer
deref, but the underlying cause is not addressed.

* ULE has lots of problems (poor performance on HTT, unable to disable
HTT, incorrect load average reporting on SMP machines, ...).  Should
be turned off until an active maintainer is found.

* Frequent panic at boot time (when starting syslogd?)

panic: mutex Giant not owned at ../../../kern/vfs_subr.c:1365
at line 729 in file ../../../kern/kern_mutex.c
Debugger("panic")
Stopped at      Debugger+0x54:  xchgl   %ebx,in_Debugger.0
db> trace
Debugger(c0766179,c07d1e80,2d9,c0765560,100) at Debugger+0x54
__panic(c0765560,2d9,c07656c8,c0765803,c076e07d) at __panic+0xf5
_mtx_assert(c07d09e0,1,c076e07d,555,c68544ec) at _mtx_assert+0x11c
gbincore(c6889514,0,0,985,c07d5980) at gbincore+0x36
getblk(c6889514,0,0,800,0) at getblk+0xf8
breadn(c6889514,0,0,800,0) at breadn+0x52
bread(c6889514,0,0,800,0) at bread+0x4c
ffs_blkatoff(c6889514,0,0,0,e0f87998) at ffs_blkatoff+0x105
ufs_lookup(e0f87a50,e0f87a8c,c05c77e1,e0f87a50,e0f87bc0) at ufs_lookup+0x270
ufs_vnoperate(e0f87a50,e0f87bc0,e0f87bd4,c076e07d,c61d62a0) at ufs_vnoperate+0x18
vfs_cache_lookup(e0f87ad0,e0f87aec,c05cca32,e0f87ad0,c61d62a0) at vfs_cache_lookup+0x301
ufs_vnoperate(e0f87ad0,c61d62a0,0,c61d62a0,c61d62a0) at ufs_vnoperate+0x18
lookup(e0f87bac,0,c076dac5,a2,c61d62a0) at lookup+0x312
namei(e0f87bac,c62088b2,d,c62088c0,0) at namei+0x27e
unp_bind(c6a09000,c62088b0,c61d62a0,e0f87ca0,c05b5e23) at unp_bind+0xb1
uipc_bind(c6427a50,c62088b0,c61d62a0,e0f87cc8,c05ba0e7) at uipc_bind+0x2b
sobind(c6427a50,c62088b0,c61d62a0,0,c6427a50) at sobind+0x23
kern_bind(c61d62a0,3,c62088b0,c62088b0,0) at kern_bind+0x87
bind(c61d62a0,e0f87d14,c,434,3) at bind+0x43
syscall(2f,2f,2f,bfbfee10,0) at syscall+0x2a0

I added a GIANT_REQUIRED to namei() and confirmed that giant is being
held there, so it's being lost higher up in the stack trace.

* ---
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x104
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0xc058a8cf
stack pointer           = 0x10:0xdcb34cc4
frame pointer           = 0x10:0xdcb34cec
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = resume, IOPL = 0
current process         = 50 (schedcpu)
trap number             = 12
panic: page fault

syncing disks, buffers remaining... panic: mi_switch: switch in a critical section

addr2line says the panic was in kern/sched_4bsd.c:327

                                /*
                                 * The kse slptimes are not touched in wakeup
                                 * because the thread may not HAVE a KSE.
                                 */
                                if (ke->ke_state == KES_ONRUNQ) {
                                        awake = 1;
                                        ke->ke_flags &= ~KEF_DIDRUN;
--->                            } else if ((ke->ke_state == KES_THREAD) &&
                                    (TD_IS_RUNNING(ke->ke_thread))) {
                                        awake = 1;

gdb -k got confused and couldn't make anything out of the backtrace.

* Machines with 4GB RAM do not auto-tune kernel memory parameters
optimally and easily panic under load with a panic message that does
not at least give instructions on what may be wrong and how to fix it.

* 8 Feb 2004
[bug report to -current, confirmed locally]

After typing "truss -f fsck -p /", I see nothing. I press ^Z
and type kill -9 % (killing truss).

I now have these fine processes hanging dead in memory, they are immune
to kill -9 and don't respond to kill -CONT either, ps axl:

  UID   PID  PPID CPU PRI NI   VSZ  RSS MWCHAN STAT  TT       TIME COMMAND
    0 56974     1   0   8  0  1256  744 ppwait D     p1    0:00.00 fsck -p /
    0 56975 56974   0   8  0  1256  744 stopev DV    p1    0:00.00 fsck -p /

* ATA tends to panic the system when error conditions occur, e.g. 

ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=45113664
ad0: DMA limited to UDMA33, non-ATA66 cable or device
ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=45113664
ad0: WARNING - removed from configuration
ata0-master: FAILURE - WRITE_DMA timed out

Fatal trap 12: page fault while in kernel mode
[...]

OTOH ATA on 4.x also gives panics with INVARIANTS when this kind of
thing happens.

[sparc64]

* Likes to panic with "panic: ipi_send: couldn't send ipi"

tmm suggested bumping the value of

./include/smp.h:#define IPI_RETRIES     100

This may have "fixed" the problem, or at least reduced the frequency.

* syscons does not work on ultra30 any more; looks like it might be
related to differences in the keyboard controller on the u30.  marius
and kensmith are knowledgeable about this.

Kris
Received on Thu Jul 15 2004 - 20:04:53 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:01 UTC