On Thu, Jul 15, 2004 at 10:24:39AM -0400, Robert Watson wrote: > This is a list of open issues that need to be resolved for FreeBSD 5.3. If > you have any updates for this list, please e-mail re_at_FreeBSD.org. > > Show stopper defects for 5.3-RELEASE These are the bugs I'm currently tracking (those I can remember right now, at least) * SMP is unusable for me because of the following frequent panic (actually a panic and another kernel printf interleaved). Here is the untangled version: panic: APIC: Previous IPI is stu c k p m a _ l a z y f i x : s p u c p u i d = 0 ; n f o r 5 0 0 0 0 0 0 0 c D e b u g g e r ( " p a n i jhb says: > Seems the two CPUs are deadlocked waiting on each other. The first sent a > pmap_lazyfixup IPI to the second but the second has interrupts disabled as it > is trying to send an IPI as well. He suggested a patch, but it did not fix the problem. * linprocfs Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x8 fault code = supervisor read, page not present instruction pointer = 0x8:0xc04e1870 stack pointer = 0x10:0xf11e6b50 frame pointer = 0x10:0xf11e6b6c code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 23938 (mtree) kernel: type 12 trap, code=0 Stopped at pfs_getattr+0x130: movl 0x8(%eax),%eax db> trace pfs_getattr(f11e6b78,c06fda00,cf397b2c,f11e6b98,d23e8a80) at pfs_getattr+0x130 vn_stat(cf397b2c,f11e6c80,d23e8a80,0,c5eb0c60) at vn_stat+0x4f lstat(c5eb0c60,f11e6d14,2,2,297) at lstat+0x6a syscall(2f,2f,2f,805a200,805a248) at syscall+0x217 Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (190, FreeBSD ELF32, lstat), eip = 0x280ac664, esp = 0xbfbf7594, ebp = 0xbfbf7620 --- dosirak# addr2line -e kernel.debug 0xc04e1870 /usr/src/sys/i386/compile/DOSIRAK/../../../fs/pseudofs/pseudofs_vnops.c:200 [...] if (pvd->pvd_pid != NO_PID) { if ((proc = pfind(pvd->pvd_pid)) == NULL) PFS_RETURN (ENOENT); --> vap->va_uid = proc->p_ucred->cr_ruid; rwatson has a patch that works around this particular null pointer deref, but the underlying cause is not addressed. * ULE has lots of problems (poor performance on HTT, unable to disable HTT, incorrect load average reporting on SMP machines, ...). Should be turned off until an active maintainer is found. * Frequent panic at boot time (when starting syslogd?) panic: mutex Giant not owned at ../../../kern/vfs_subr.c:1365 at line 729 in file ../../../kern/kern_mutex.c Debugger("panic") Stopped at Debugger+0x54: xchgl %ebx,in_Debugger.0 db> trace Debugger(c0766179,c07d1e80,2d9,c0765560,100) at Debugger+0x54 __panic(c0765560,2d9,c07656c8,c0765803,c076e07d) at __panic+0xf5 _mtx_assert(c07d09e0,1,c076e07d,555,c68544ec) at _mtx_assert+0x11c gbincore(c6889514,0,0,985,c07d5980) at gbincore+0x36 getblk(c6889514,0,0,800,0) at getblk+0xf8 breadn(c6889514,0,0,800,0) at breadn+0x52 bread(c6889514,0,0,800,0) at bread+0x4c ffs_blkatoff(c6889514,0,0,0,e0f87998) at ffs_blkatoff+0x105 ufs_lookup(e0f87a50,e0f87a8c,c05c77e1,e0f87a50,e0f87bc0) at ufs_lookup+0x270 ufs_vnoperate(e0f87a50,e0f87bc0,e0f87bd4,c076e07d,c61d62a0) at ufs_vnoperate+0x18 vfs_cache_lookup(e0f87ad0,e0f87aec,c05cca32,e0f87ad0,c61d62a0) at vfs_cache_lookup+0x301 ufs_vnoperate(e0f87ad0,c61d62a0,0,c61d62a0,c61d62a0) at ufs_vnoperate+0x18 lookup(e0f87bac,0,c076dac5,a2,c61d62a0) at lookup+0x312 namei(e0f87bac,c62088b2,d,c62088c0,0) at namei+0x27e unp_bind(c6a09000,c62088b0,c61d62a0,e0f87ca0,c05b5e23) at unp_bind+0xb1 uipc_bind(c6427a50,c62088b0,c61d62a0,e0f87cc8,c05ba0e7) at uipc_bind+0x2b sobind(c6427a50,c62088b0,c61d62a0,0,c6427a50) at sobind+0x23 kern_bind(c61d62a0,3,c62088b0,c62088b0,0) at kern_bind+0x87 bind(c61d62a0,e0f87d14,c,434,3) at bind+0x43 syscall(2f,2f,2f,bfbfee10,0) at syscall+0x2a0 I added a GIANT_REQUIRED to namei() and confirmed that giant is being held there, so it's being lost higher up in the stack trace. * --- Fatal trap 12: page fault while in kernel mode fault virtual address = 0x104 fault code = supervisor read, page not present instruction pointer = 0x8:0xc058a8cf stack pointer = 0x10:0xdcb34cc4 frame pointer = 0x10:0xdcb34cec code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = resume, IOPL = 0 current process = 50 (schedcpu) trap number = 12 panic: page fault syncing disks, buffers remaining... panic: mi_switch: switch in a critical section addr2line says the panic was in kern/sched_4bsd.c:327 /* * The kse slptimes are not touched in wakeup * because the thread may not HAVE a KSE. */ if (ke->ke_state == KES_ONRUNQ) { awake = 1; ke->ke_flags &= ~KEF_DIDRUN; ---> } else if ((ke->ke_state == KES_THREAD) && (TD_IS_RUNNING(ke->ke_thread))) { awake = 1; gdb -k got confused and couldn't make anything out of the backtrace. * Machines with 4GB RAM do not auto-tune kernel memory parameters optimally and easily panic under load with a panic message that does not at least give instructions on what may be wrong and how to fix it. * 8 Feb 2004 [bug report to -current, confirmed locally] After typing "truss -f fsck -p /", I see nothing. I press ^Z and type kill -9 % (killing truss). I now have these fine processes hanging dead in memory, they are immune to kill -9 and don't respond to kill -CONT either, ps axl: UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 0 56974 1 0 8 0 1256 744 ppwait D p1 0:00.00 fsck -p / 0 56975 56974 0 8 0 1256 744 stopev DV p1 0:00.00 fsck -p / * ATA tends to panic the system when error conditions occur, e.g. ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=45113664 ad0: DMA limited to UDMA33, non-ATA66 cable or device ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=45113664 ad0: WARNING - removed from configuration ata0-master: FAILURE - WRITE_DMA timed out Fatal trap 12: page fault while in kernel mode [...] OTOH ATA on 4.x also gives panics with INVARIANTS when this kind of thing happens. [sparc64] * Likes to panic with "panic: ipi_send: couldn't send ipi" tmm suggested bumping the value of ./include/smp.h:#define IPI_RETRIES 100 This may have "fixed" the problem, or at least reduced the frequency. * syscons does not work on ultra30 any more; looks like it might be related to differences in the keyboard controller on the u30. marius and kensmith are knowledgeable about this. Kris
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:01 UTC