One of my colleagues brought down a node on my cluster while running a MPI job. The kernel coredump shows Script started on Mon May 21 17:02:53 2007 node12:root[201] kgdb kernel.debug vmcore.0 [GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"] Unread portion of the kernel message buffer: panic: sbflush_internal: cc 4294965848 || mb 0 || mbcnt 0 cpuid = 0 Uptime: 7h6m34s Physical memory: 16119 MB Dumping 631 MB: 616 600 584 568 552 536 520 504 488 472 456 440 424 408 392 376 360 344 328 312 296 280 264 248 232 216 200 184 168 152 136 120 104 88 72 56 40 24 8 #0 doadump () at pcpu.h:171 171 pcpu.h: No such file or directory. in pcpu.h (kgdb) bt #0 doadump () at pcpu.h:171 #1 0xffffffff802a01eb in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409 #2 0xffffffff802a08cc in panic (fmt=0xffffff03157e0d20 "") at /usr/src/sys/kern/kern_shutdown.c:563 #3 0xffffffff802f4d23 in sbflush_internal (sb=0xffffff031243ab68) at /usr/src/sys/kern/uipc_sockbuf.c:808 #4 0xffffffff802f50cb in sbflush (sb=0xffffff031243ab68) at /usr/src/sys/kern/uipc_sockbuf.c:825 #5 0xffffffff803b7246 in tcp_disconnect (tp=0xffffff03101f73e0) at /usr/src/sys/netinet/tcp_usrreq.c:1496 #6 0xffffffff803b7539 in tcp_usr_disconnect (so=0xffffff0311a04690) at /usr/src/sys/netinet/tcp_usrreq.c:584 #7 0xffffffff802f67f2 in soclose (so=0xffffff031243aae0) at /usr/src/sys/kern/uipc_socket.c:642 #8 0xffffffff802de133 in soo_close (fp=0xffffff0312402258, td=0x0) at /usr/src/sys/kern/sys_socket.c:296 #9 0xffffffff8027479f in fdrop (fp=0xffffff0312402258, td=0xffffff03157e0d20) at file.h:297 #10 0xffffffff80274aaf in closef (fp=0xffffff0312402258, td=0xffffff03157e0d20) at /usr/src/sys/kern/kern_descrip.c:1928 #11 0xffffffff80275f54 in fdfree (td=0xffffff03157e0d20) at /usr/src/sys/kern/kern_descrip.c:1638 #12 0xffffffff8027f537 in exit1 (td=0xffffff03157e0d20, rv=9) at /usr/src/sys/kern/kern_exit.c:271 #13 0xffffffff802a578f in sigexit (td=0xffffff03157e0d20, sig=9) at /usr/src/sys/kern/kern_sig.c:2862 #14 0xffffffff802a63ac in postsig (sig=9) at /usr/src/sys/kern/kern_sig.c:2741 #15 0xffffffff802d3547 in ast (framep=0xffffffffb0580c70) at /usr/src/sys/kern/subr_trap.c:271 #16 0xffffffff804787f0 in Xfast_syscall () ---Type <return> to continue, or q <return> to quit--- at /usr/src/sys/amd64/amd64/exception.S:283 #17 0x00000003c0c7294c in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) quit I have the debug kernel and vmcore file, and can make it available. The dmesg for the node that panic is Copyright (c) 1992-2007 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 7.0-CURRENT #6: Fri May 18 10:19:43 PDT 2007 kargl_at_node10.cimu.org:/usr/obj/usr/src/sys/HPC ACPI APIC Table: <A M I OEMAPIC > Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Dual Core AMD Opteron(tm) Processor 280 (2391.55-MHz K8-class CPU) Origin = "AuthenticAMD" Id = 0x20f12 Stepping = 2 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x1<SSE3> AMD Features=0xe2500800<SYSCALL,NX,MMX+,FFXSR,LM,3DNow!+,3DNow!> AMD Features2=0x3<LAHF,CMP> Cores per package: 2 usable memory = 16902705152 (16119 MB) avail memory = 16387166208 (15628 MB) FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 2 cpu3 (AP): APIC ID: 3 MADT: Forcing active-low polarity and level trigger for SCI ioapic0 <Version 1.1> irqs 0-23 on motherboard ioapic1 <Version 1.1> irqs 24-27 on motherboard ioapic2 <Version 1.1> irqs 28-31 on motherboard acpi0: <A M I OEMXSDT> on motherboard acpi0: [ITHREAD] acpi_hpet0: <High Precision Event Timer> iomem 0xfec01000-0xfec013ff on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 2000 acpi0: Power Button (fixed) acpi0: reservation of 0, a0000 (3) failed acpi0: reservation of 100000, eff00000 (3) failed Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0 cpu0: <ACPI CPU> on acpi0 acpi_throttle0: <ACPI CPU Throttling> on cpu0 cpu1: <ACPI CPU> on acpi0 cpu2: <ACPI CPU> on acpi0 cpu3: <ACPI CPU> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 pcib1: <ACPI PCI-PCI bridge> at device 6.0 on pci0 pci3: <ACPI PCI bus> on pcib1 ohci0: <OHCI (generic) USB controller> mem 0xfeafc000-0xfeafcfff irq 19 at device 0.0 on pci3 ohci0: [GIANT-LOCKED] ohci0: [ITHREAD] usb0: OHCI version 1.0, legacy support usb0: SMM does not respond, resetting usb0: <OHCI (generic) USB controller> on ohci0 usb0: USB revision 1.0 uhub0: <AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb0 device_attach: uhub0 attach returned 6 usb0: port 0, set config at addr 1 failed usb0: root hub problem, error=4 ohci1: <OHCI (generic) USB controller> mem 0xfeafd000-0xfeafdfff irq 19 at device 0.1 on pci3 ohci1: [GIANT-LOCKED] ohci1: [ITHREAD] usb1: OHCI version 1.0, legacy support usb1: SMM does not respond, resetting usb1: <OHCI (generic) USB controller> on ohci1 usb1: USB revision 1.0 uhub1: <AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb1 uhub1: 3 ports with 3 removable, self powered atapci0: <SiI 3114 SATA150 controller> port 0xbc00-0xbc07,0xb400-0xb403,0xb000-0xb007,0xac00-0xac03,0xa800-0xa80f mem 0xfeafec00-0xfeafefff irq 17 at device 5.0 on pci3 atapci0: [ITHREAD] ata2: <ATA channel 0> on atapci0 ata2: [ITHREAD] ata3: <ATA channel 1> on atapci0 ata3: [ITHREAD] ata4: <ATA channel 2> on atapci0 ata4: [ITHREAD] ata5: <ATA channel 3> on atapci0 ata5: [ITHREAD] vgapci0: <VGA-compatible display> port 0xb800-0xb8ff mem 0xfd000000-0xfdffffff,0xfeaff000-0xfeafffff irq 18 at device 6.0 on pci3 isab0: <PCI-ISA bridge> at device 7.0 on pci0 isa0: <ISA bus> on isab0 atapci1: <AMD 8111 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 7.1 on pci0 ata0: <ATA channel 0> on atapci1 ata0: [ITHREAD] ata1: <ATA channel 1> on atapci1 ata1: [ITHREAD] amdsmb0: <AMD-8111 SMBus 2.0 Controller> port 0xcc00-0xcc1f irq 19 at device 7.2 on pci0 smbus0: <System Management Bus> on amdsmb0 smb0: <SMBus generic I/O> on smbus0 amdpm0: <AMD 756/766/768/8111 Power Management Controller> port 0x10e0-0x10ff at device 7.3 on pci0 smbus1: <System Management Bus> on amdpm0 smb1: <SMBus generic I/O> on smbus1 pcib2: <ACPI PCI-PCI bridge> at device 10.0 on pci0 pci2: <ACPI PCI bus> on pcib2 pci2:9:0: bad VPD cksum, remain 72 bge0: <Broadcom Gigabit Ethernet Controller, ASIC rev. 0x2003> mem 0xfc8c0000-0xfc8cffff,0xfc8b0000-0xfc8bffff irq 24 at device 9.0 on pci2 miibus0: <MII bus> on bge0 brgphy0: <BCM5704 10/100/1000baseTX PHY> PHY 1 on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto bge0: Ethernet address: 00:e0:81:34:e1:4c bge0: [ITHREAD] pci2:9:1: bad VPD cksum, remain 72 bge1: <Broadcom Gigabit Ethernet Controller, ASIC rev. 0x2003> mem 0xfc8f0000-0xfc8fffff,0xfc8e0000-0xfc8effff irq 25 at device 9.1 on pci2 miibus1: <MII bus> on bge1 brgphy1: <BCM5704 10/100/1000baseTX PHY> PHY 1 on miibus1 brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto bge1: Ethernet address: 00:e0:81:34:e1:4d bge1: [ITHREAD] pcib3: <ACPI PCI-PCI bridge> at device 11.0 on pci0 pci1: <ACPI PCI bus> on pcib3 acpi_button0: <Power Button> on acpi0 atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] atkbd0: [ITHREAD] sio0: configured irq 4 not in bitmap of probed irqs 0 sio0: port may not be enabled sio0: configured irq 4 not in bitmap of probed irqs 0 sio0: port may not be enabled sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio0: [FILTER] sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A sio1: [FILTER] fdc0: <floppy drive controller (FDE)> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 ppc0: <Parallel port> port 0x378-0x37f irq 7 on acpi0 ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode ppbus0: <Parallel port bus> on ppc0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 ppc0: [GIANT-LOCKED] ppc0: [ITHREAD] fdc0: <floppy drive controller (FDE)> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xcc7ff,0xcc800-0xcdfff,0xce000-0xcf7ff,0xcf800-0xd07ff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <8 virtual consoles, flags=0x300> vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 1.000 msec ad4: 239372MB <WDC WD2500YD-01NVB1 10.02E01> at ata2-master SATA150 SMP: AP CPU #1 Launched! SMP: AP CPU #2 Launched! SMP: AP CPU #3 Launched! hwpmc: TSC/1/0x20<REA> K8/4/0x1ff<INT,USR,SYS,EDG,THR,REA,WRI,INV,QUA> Trying to mount root from ufs:/dev/ad4s1a WARNING: / was not properly dismounted -- Steve http://troutmask.apl.washington.edu/~kargl/Received on Mon May 21 2007 - 22:15:47 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:10 UTC