On 10 Jul, Robert Watson wrote: > > On Sat, 10 Jul 2004, Daniel Lang wrote: > >> So I come back to the issue. As I already wrote, I guess I can rule out >> hardware problems now. I did a very thorough test with the Dell >> diagnosis utilities which showed no problems. > > Thanks! > >> Also, after John's patch I did not see any WITNESS related problems (so >> far) again. But I had the m_copy panic again (see subject). This time I >> did file a PR and did some more detailed gdb analysis. It is all >> documented at: >> >> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/68889 >> >> I am puzzled, because the stack frame on entering m_copym has 0x0 as >> first argument (m), however in the previous frame when m_copy() is >> called, the struct mbuf* argument is valid. >> >> Ok, I just realized that there is a difference m_copy() and m_copym() >> are apparently different functions. Is this a makro/#define discrepancy >> it seems that that m_copym() is the function which is called in this >> line of code. >> >> Ah, I found it: >> >> sys/mbuf.h:#define m_copy(m, o, l) m_copym((m), (o), (l), M_DONTWAIT) >> >> so, the puzzle remains, since the arguments passed are kept, except that >> M_DONTWAIT flag is added. >> >> Is this a trashed stack? > > Possibly, but notice that the m_copym() function modifies its copy of 'm' > in the stack as part of its work -- it uses 'm' to iterate the mbuf chain > passed in in order to move to the necessary starting offset for the copy. > Note that the requested offset ('off0') is 737, and the requested 'len' is > at least 1222, so the loop starting at line 369 will walk until it either > gets far enough or the "offset > size" assertion triggers: > > while (off > 0) { > KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain")); > if (off < m->m_len) > break; > off -= m->m_len; > m = m->m_next; > } > > Since that assertion didn't trigger, we can assume m_copym() successfully > walked at least 'off0' (737) bytes. The problem appears to be that it ran > out of mbufs in which to find data to copy, as it hit the end of the chain > (m == NULL): > > while (len > 0) { > if (m == NULL) { > KASSERT(len == M_COPYALL, > ("m_copym, length > size of mbuf chain")); > break; > } > > So the initial conclusion is that the caller requested that more data be > copied from the chain than is actually present in the chain. This > suggests a bug in socket buffer management or the TCP code. It's > interesting to note that the socket buffer believes it contains less than > the requested length -- 'so_snd.sb_mbcnt' is 1536, which is arguably less > than 737 + 1222 (although we don't know, I think, if it's iterated or not > and therefore decreased the value of 'len'). Could you print the value of > 'top' in the m_copym() stack? That will tell us if it's on the first mbuf > or not. > > It sounds like the socket buffer state may be inconsistent with the TCP > PCB state, or that the expectations in tcp_offset() are wrong. I've CC'd > Paul because he's had his hands in the new SACK code that was merged, and > it has its hands in that bit of the output code. Here are some things you > might want to try: > > (1) Try running with TCP SACK disabled. Set the > 'net.inet.tcp.sack.enable' sysctl to 0 to try this. > > (2) Try adding some assertions just before the copy to m_copy() in > tcp_output(). I'd suggest something like the following: I'm very suspicious of the SACK code. In the non-SACK case, len gets set here: if (!sack_rxmit) len = ((long)ulmin(so->so_snd.sb_cc, sendwin) - off); but when the system panics len+off > sb_cc. It would be interesting to look at *tp and *p in the tcp_output stack frame. If I had to guess, I'd say that either tp->snd_recover-tp->snd_una or p->end-tp->snd_una is greater than so->so_snd.sb_cc.Received on Sat Jul 10 2004 - 21:24:57 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:01 UTC