Re: panic: m_copym, length > size of mbuf chain

From: Daniel Lang <dl_at_leo.org> Date: Sat, 10 Jul 2004 18:38:25 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:01 UTC

Hi Robert,

Robert Watson wrote on Sat, Jul 10, 2004 at 10:27:23AM -0400:
[..]
> > Is this a trashed stack? 
> 
> Possibly, but notice that the m_copym() function modifies its copy of 'm'
> in the stack as part of its work -- it uses 'm' to iterate the mbuf chain
> passed in in order to move to the necessary starting offset for the copy.

Ah well. I thought the value of the parameters, which is displayed
in the stack trace is the value on function call. But of course
it's the value at this moment. Sorry for the confusion.

> Note that the requested offset ('off0') is 737, and the requested 'len' is
> at least 1222, so the loop starting at line 369 will walk until it either
> gets far enough or the "offset > size" assertion triggers:
> 
>         while (off > 0) {
>                 KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain"));
>                 if (off < m->m_len)
>                         break;
>                 off -= m->m_len;
>                 m = m->m_next;
>         }
> 
> Since that assertion didn't trigger, we can assume m_copym() successfully
> walked at least 'off0' (737) bytes.  The problem appears to be that it ran
> out of mbufs in which to find data to copy, as it hit the end of the chain
> (m == NULL):
> 
>         while (len > 0) {
>                 if (m == NULL) {
>                         KASSERT(len == M_COPYALL,
>                             ("m_copym, length > size of mbuf chain"));
>                         break;
>                 }
> 
> So the initial conclusion is that the caller requested that more data be
> copied from the chain than is actually present in the chain.  This
> suggests a bug in socket buffer management or the TCP code.  It's

I understand.

> interesting to note that the socket buffer believes it contains less than
> the requested length -- 'so_snd.sb_mbcnt' is 1536, which is arguably less
> than 737 + 1222 (although we don't know, I think, if it's iterated or not
> and therefore decreased the value of 'len').  Could you print the value of
> 'top' in the m_copym() stack?  That will tell us if it's on the first mbuf
> or not.

(kgdb) p *top
$2 = {m_hdr = {mh_next = 0xca90c300, mh_nextpkt = 0x0,
    mh_data = 0xc3cbea18 "220- -------\r\n220- \r\n220-      Please check:\r\n220-        http://www.leo.org/neues/maintenance_en.html\r\n220-      first, if you experience any problems.\r\n220- \r\n\n\r\n220- \r\n", mh_len = 161,
    mh_flags = 0, mh_type = 1}, M_dat = {MH = {MH_pkthdr = {
        rcvif = 0x2d303232, len = 757935392, header = 0x2d2d2d2d,
        csum_flags = 842140173, csum_data = 220212528, tags = {
          slh_first = 0x3032320a}}, MH_dat = {MH_ext = {ext_buf = 0x2020202d---Can't read userspace from dump, or kernel process---

Hmmm, this looks a bit like this one:

(kgdb) p *so->so_snd.sb_mb->m_hdr.mh_next->m_hdr.mh_next->m_hdr.mh_next->m_hdr.mh_next
$7 = {m_hdr = {mh_next = 0xca423300, mh_nextpkt = 0x0,
    mh_data = 0xca617630 "220- -------\r\n220- \r\n220-      Please check:\r\n220-        http://www.leo.org/neues/maintenance_en.html\r\n220-      first, if you experience any problems.\r\n220- \r\nØŠ^®}Tè=Æq\216×\222ÝapZÍ\r2.Ê\030º\"ÛÜ#t\016\001ò\025þ\t", mh_len = 161, mh_flags = 2, mh_type = 1}, M_dat = {MH = {
      MH_pkthdr = {rcvif = 0x0, len = 14, header = 0x2d2d2d2d, csum_flags = 0,
        csum_data = 16, tags = {slh_first = 0x0}}, MH_dat = {MH_ext = {
          ext_buf = 0x2d303232---Can't read userspace from dump, or kernel process---

But not exactly? Hmm, I guess there are a lot of mbufs with similar
content, if there are many clients connecting. Hm. Would there be
a simple way to find it, if it would be useful at all....

> It sounds like the socket buffer state may be inconsistent with the TCP
> PCB state, or that the expectations in tcp_offset() are wrong.  I've CC'd
> Paul because he's had his hands in the new SACK code that was merged, and
> it has its hands in that bit of the output code.  Here are some things you
> might want to try:
> 
> (1) Try running with TCP SACK disabled.  Set the
>     'net.inet.tcp.sack.enable' sysctl to 0 to try this.

Ok, set in /boot/loader.conf

> (2) Try adding some assertions just before the copy to m_copy() in
>     tcp_output().  I'd suggest something like the following:
> 

[ patch ]

Ok, added, thanks.

>     These printfs are oriented at making sure the socket buffer state is
>     consistent, and don't attempt to dump the TCP state used to construct
>     'len' or 'off'.  If the buffer is consistent (i.e., sb_cc ==
>     m_length()), then the next logical thing to do is figure out how
>     (len + off) became greater than the available buffer space.  If you
>     look up a bit higher in tcp_output(), you'll see the values being set,
>     trimmed, etc, based on various values in the tcpcb.

Ok, so I run with this assertion and wait if this new one is
triggered or again the m_copym() one.

> (3) If the socket buffer is inconsistent -- i.e., the length of the buffer
>     in the socket buffer meta-data doesn't match the actual length of the
>     mbufs in the buffer -- try compiling in "options SOCKBUF_DEBUG", which
>     turns on additional socket buffer diagnostics. 

Ok will do. Many thanks.

Best regards,
 Daniel
-- 
IRCnet: Mr-Spock              - Truth lies in the eye of the beholder - 
 Daniel Lang * dl_at_leo.org * +49 89 289 18532 * http://www.leo.org/~dl/