Hello, Now I have an additional piece of information for this bug. Today I noticed that the second system -- which did not exhibit the problem so far -- also started collecting sockets in the SYN_RCVD state ("netstat -n"). Extrapoliting the current count and growth rate, it must have started on Saturday. The machine then had an uptime of 25 days -- about the same uptime as the first machine when it started to show this problem. Whatever triggers the bug, it seems to be uptime-related. Both machines are running with HZ=1000 (the default). A signed int variable running at HZ speed would overflow after 2^31 seconds which happens to be 24.9 days ... So it seems this is what's happening: Somewhere in the kernel (probably the TCP syncache code) there's a piece of code using uptime information in HZ resolution for timing purposes or whatever. However, it uses a signed int, maybe just for intermediate results, causing an overflow after 2^31/HZ seconds, which leads to wrong results and finally hanging sockets in the SYN_RCVD state. Could anyone familiar help me trying to locate the bug in the code? I'm pretty sure that my analysis isn't far from the truth. I'm also pretty sure that a type cast at the right place will fix it. The problem is to find the right place. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "C is quirky, flawed, and an enormous success." -- Dennis M. Ritchie.Received on Mon Nov 26 2007 - 18:37:17 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:23 UTC