Lawrence Stewart wrote: > The SACK code puts a global cap on the amount of memory that can be > used for SACK accounting. The variable V_tcp_sack_globalholes tracks > how many SACK holes are currently allocated across all active TCP > connections. It gets incremented in tcp_sackhole_alloc() and > decremented in tcp_sackhole_free() in netinet/tcp_sack.c. > It turns out that there is currently no lock synchronising access to > the variable, and the incrementing/decrementing is not being done > atomically. In Kamigishi's case, the server had a traffic profile > consisting of a large number of clients simultaneously connecting over > cruddy links which was giving the SACK accounting a real workout. The > inevitable race would strike one or more times, leaving the count of > holes not in tune with reality, and eventually when traffic died down > the variable would decrement down below 0, triggering the panic. Note > that this panic only occurs if INVARIANTS is compiled into the kernel > so the issue has been around for some time but not noticed. > The attached patch makes use of the atomic(9) KPI to ensure > incrementing/decrementing the variable is done atomically, which > should fix the bug. > Reviews/testing would be good so that we can get this into 8.0. After applying the patch and rebuilding the kernel I've been getting (similar) kernel panics way too often. Two backtraces follow (note the uptime; it can vary from 4 minutes to 5-7 hours, but average time between traps is approximately 2 hours): Unread portion of the kernel message buffer: kernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x1321288 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8058ac15 stack pointer = 0x28:0xffffff80403d45f0 frame pointer = 0x28:0xffffff80403d4620 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 2350 (mysqld) trap number = 12 panic: page fault cpuid = 0 Uptime: 3h28m59s (kgdb) bt #0 doadump () at pcpu.h:223 #1 0xffffffff80599a63 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:419 #2 0xffffffff80599ebc in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:575 #3 0xffffffff80861f8d in trap_fatal (frame=0xc, eva=Variable "eva" is not available. ) at /usr/src/sys/amd64/amd64/trap.c:852 #4 0xffffffff80862c25 in trap (frame=0xffffff80403d4540) at /usr/src/sys/amd64/amd64/trap.c:345 #5 0xffffffff80848f13 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:223 #6 0xffffffff8058ac15 in _mtx_lock_sleep (m=0xffffffff80e98863, tid=18446742975233083168, opts=Variable "opts" is not available. ) at /usr/src/sys/kern/kern_mutex.c:407 #7 0xffffffff8058ad6e in _mtx_lock_flags (m=Variable "m" is not available. ) at /usr/src/sys/kern/kern_mutex.c:203 #8 0xffffffff80647125 in netisr_queue_internal (proto=1, m=0xffffff0004ed1e00, cpuid=Variable "cpuid" is not available. ) at /usr/src/sys/net/netisr.c:830 #9 0xffffffff80647209 in netisr_queue_src (proto=1, source=Variable "source" is not available. ) at /usr/src/sys/net/netisr.c:860 #10 0xffffffff80643180 in if_simloop (ifp=0xffffff0004609800, m=0xffffff0004ed1e00, af=2, hlen=0) at /usr/src/sys/net/if_loop.c:400 #11 0xffffffff806432d6 in looutput (ifp=0xffffff0004609800, m=0xffffff0004ed1e00, dst=0xffffff80403d47a0, ro=Variable "ro" is not available. ) at /usr/src/sys/net/if_loop.c:296 #12 0xffffffff806a2237 in ip_output (m=0xffffff0004ed1e00, opt=Variable "opt" is not available. ) at /usr/src/sys/netinet/ip_output.c:624 #13 0xffffffff80707874 in tcp_output (tp=0xffffff00137292d8) at /usr/src/sys/netinet/tcp_output.c:1188 #14 0xffffffff80713f29 in tcp_usr_send (so=0xffffff002bc32550, flags=0, m=Variable "m" is not available. ) at tcp_offload.h:269 #15 0xffffffff805fd197 in sosend_generic (so=0xffffff002bc32550, addr=0x0, uio=0xffffff80403d4b10, top=0xffffff0004efeb00, control=0x0, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/uipc_socket.c:1257 #16 0xffffffff805e18a7 in soo_write (fp=Variable "fp" is not available. ) at /usr/src/sys/kern/sys_socket.c:102 #17 0xffffffff805dad25 in dofilewrite (td=0xffffff003db34720, fd=30, fp=0xffffff0013d35a50, auio=0xffffff80403d4b10, offset=Variable "offset" is not available. ) at file.h:239 #18 0xffffffff805dc300 in kern_writev (td=0xffffff003db34720, fd=30, auio=0xffffff80403d4b10) at /usr/src/sys/kern/sys_generic.c:445 #19 0xffffffff805dc405 in write (td=Variable "td" is not available. ) at /usr/src/sys/kern/sys_generic.c:361 #20 0xffffffff808624cf in syscall (frame=0xffffff80403d4c90) at /usr/src/sys/amd64/amd64/trap.c:984 #21 0xffffffff808491a0 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:364 #22 0x00000008014d8efc in ?? () Previous frame inner to this frame (corrupt stack?) Unread portion of the kernel message buffer: kernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x1321288 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8058ac15 stack pointer = 0x28:0xffffff8040145600 frame pointer = 0x28:0xffffff8040145630 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 1932 (lighttpd) trap number = 12 panic: page fault cpuid = 1 Uptime: 1h0m9s (kgdb) bt #0 doadump () at pcpu.h:223 #1 0xffffffff80599a63 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:419 #2 0xffffffff80599ebc in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:575 #3 0xffffffff80861f8d in trap_fatal (frame=0xc, eva=Variable "eva" is not available. ) at /usr/src/sys/amd64/amd64/trap.c:852 #4 0xffffffff80862c25 in trap (frame=0xffffff8040145550) at /usr/src/sys/amd64/amd64/trap.c:345 #5 0xffffffff80848f13 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:223 #6 0xffffffff8058ac15 in _mtx_lock_sleep (m=0xffffffff80e98863, tid=18446742974276744992, opts=Variable "opts" is not available. ) at /usr/src/sys/kern/kern_mutex.c:407 #7 0xffffffff8058ad6e in _mtx_lock_flags (m=Variable "m" is not available. ) at /usr/src/sys/kern/kern_mutex.c:203 #8 0xffffffff80647125 in netisr_queue_internal (proto=1, m=0xffffff002453d400, cpuid=Variable "cpuid" is not available. ) at /usr/src/sys/net/netisr.c:830 #9 0xffffffff80647209 in netisr_queue_src (proto=1, source=Variable "source" is not available. ) at /usr/src/sys/net/netisr.c:860 #10 0xffffffff80643180 in if_simloop (ifp=0xffffff0004609800, m=0xffffff002453d400, af=2, hlen=0) at /usr/src/sys/net/if_loop.c:400 #11 0xffffffff806432d6 in looutput (ifp=0xffffff0004609800, m=0xffffff002453d400, dst=0xffffff80401457b0, ro=Variable "ro" is not available. ) at /usr/src/sys/net/if_loop.c:296 #12 0xffffffff806a2237 in ip_output (m=0xffffff002453d400, opt=Variable "opt" is not available. ) at /usr/src/sys/netinet/ip_output.c:624 #13 0xffffffff80707874 in tcp_output (tp=0xffffff0014bb5b60) at /usr/src/sys/netinet/tcp_output.c:1188 #14 0xffffffff80713f29 in tcp_usr_send (so=0xffffff00047fb2a8, flags=0, m=Variable "m" is not available. ) at tcp_offload.h:269 #15 0xffffffff805fd197 in sosend_generic (so=0xffffff00047fb2a8, addr=0x0, uio=0xffffff000f4d4100, top=0xffffff00236c5d00, control=0x0, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/uipc_socket.c:1257 #16 0xffffffff805e18a7 in soo_write (fp=Variable "fp" is not available. ) at /usr/src/sys/kern/sys_socket.c:102 #17 0xffffffff805dad25 in dofilewrite (td=0xffffff0004b2b720, fd=10, fp=0xffffff0070f18a50, auio=0xffffff000f4d4100, offset=Variable "offset" is not available. ) at file.h:239 #18 0xffffffff805dc300 in kern_writev (td=0xffffff0004b2b720, fd=10, auio=0xffffff000f4d4100) at /usr/src/sys/kern/sys_generic.c:445 #19 0xffffffff805dc381 in writev (td=0xffffff0004b2b720, uap=0xffffff8040145c00) at /usr/src/sys/kern/sys_generic.c:431 #20 0xffffffff808624cf in syscall (frame=0xffffff8040145c90) at /usr/src/sys/amd64/amd64/trap.c:984 #21 0xffffffff808491a0 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:364 #22 0x0000000800c5aacc in ?? () Previous frame inner to this frame (corrupt stack?) -- Kamigishi Rei KREI-RIPEReceived on Sun Jul 05 2009 - 08:47:05 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:51 UTC