On 10 Jul, Daniel Lang wrote: > Hi Robert, > > Robert Watson wrote on Wed, Jul 07, 2004 at 12:24:59PM -0400: > [..] >> Just to try ruling out possibilities -- have you run an extensive set of >> hardware diagnostics? Most server class hardware ships with a decent >> diagnostics disk, and I'm sure we can find some for you in the event your >> hardware didn't come with some. While it's quite possibly a software >> problem, tracking hardware problems using software symptoms constitutes >> undesirable pain and so it wouldn't hurt to give that a spin. I remember >> seing your earlier e-mails about running with WITNESS increasing the >> chances of pain -- this could be a bug in WITNESS as you suggest, or it >> could be that WITNESS increases the opportunities for a variety of locking >> related races by increasing the cost of lock/unlock operations. > [..] > > So I come back to the issue. As I already wrote, I guess I can > rule out hardware problems now. I did a very thorough test with > the Dell diagnosis utilities which showed no problems. > > Also, after John's patch I did not see any WITNESS related > problems (so far) again. But I had the m_copy panic again > (see subject). This time I did file a PR and did some more detailed > gdb analysis. It is all documented at: > > http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/68889 > > I am puzzled, because the stack frame on entering m_copym has > 0x0 as first argument (m), however in the previous frame > when m_copy() is called, the struct mbuf* argument is valid. m_copym() overwrites its first and third arguments as it walks the mbuf chain. struct mbuf * m_copym(struct mbuf *m, int off0, int len, int wait) { [snip] while (off > 0) { KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain")); if (off < m->m_len) break; off -= m->m_len; m = m->m_next; } [snip] while (len > 0) { if (m == NULL) { KASSERT(len == M_COPYALL, ("m_copym, length > size of mbuf chain")); break; } [snip] if (len != M_COPYALL) len -= n->m_len; off = 0; m = m->m_next; np = &n->m_next; } The interesting bits would seem to be in stack frame 11, tcp_output(). Check the arguments being passed to m_copym(): #10 0xc0551805 in m_copym (m=0x0, off0=737, len=1222, wait=1) at /usr/src/sys/kern/uipc_mbuf.c:380 We don't know the original value of len that was passed to m_copym(), because it could have been decremented if m_copym() iterated a few times before it paniced, but it was at least 1222. If we add that to off0, then the length of original mbuf chain passed to m_copym() should have been at least 1959. Now take look at the call to m_copy(): #11 0xc059ed5a in tcp_output (tp=0xc3f50000) at /usr/src/sys/netinet/tcp_output.c:748 748 m->m_next = m_copy(so->so_snd.sb_mb, off, (int) len); It would be interesting to see the value of len in stack frame 11, so that we know the original value passed to m_copym(). Also the contents of *so is interesting. (kgdb) p *so [snip] sb_cc = 975, sb_hiwat = 33580, sb_mbcnt = 1536, sb_mbmax = 262144, I'm not sure if sb_cc or sb_mbcnt is the important member, but I think it is sb_cc. I think this means that the mbuf chain contains 975 bytes of data but tcp_output() is telling m_copy() to copy (at least) 1222 bytes of data starting at offset 737. It looks to me like tcp_output() is passing a bogus len value to m_copy().Received on Sat Jul 10 2004 - 20:25:57 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:01 UTC