curious results

From: Randall Stewart <rrs_at_cisco.com>
Date: Tue, 12 Dec 2006 08:00:19 -0500
All:

I have two machines I am testing with... a Intel Xeon
2.8 Gig w/hyperthreading... and a Intel P4D dual core.

Now I am testing SCTP and how it interacts with SMP.. or
that is my intention. I have a snapshot of the MPI code
that one of my friends at UBC has been working on
with Argonne Labs... This uses SCTP :-)

He has written a serious of tests which all now pass
(after a LOT of bugs and LOR's.. all kinds of fun :D)

Now a seperate test he has written is something called
mywaitall. Basically you setup a number of processes,
all of them get up and settled in. Then they coordinate
(near as I can tell) sharing SCTP port and address info
with each other via TCP. Then they switch over and
use the SCTP one-2-many model.. sending data to each other
setting up implicit connections.

This means that running -np 10.. I have 10 endpoints with
90 associations ... I am doing this only on the local
host side.

I run this in a
while true
do
mpiexec -np 10 ./mywaitall
echo "-------"
done


Now on the xeon machine I see a very curious failure. After
about a day of running this. I get two endpoints stuck
one has data to be transmitted the other is waiting for
it.. (the way the program works is they all send/rcv some
info and then say goodbye to each other).

Now I am seeing loss because the app version I have is
buggy... the author did not handle the sending in non-blocking
mode correctly. He thought he got a -1/EAGAIN.. instead of
a 0/0 back.. so he ends up in a tight loop doing
while (sent > -1)
    ret = dothesend()
    if(ret < 0)
       break // error
    sent += ret;


Which means we peg the CPU sending with full send windows.. He
has fixed this.. but I keep testing with the buggy version since
it finds somemany unique problems :-)

But back to my scenario. Now I have, in the past, fixed many
bugs like this that were an SCTP problem :-) but this one
I don't think so anymore..

When I find and look at the assoc's in question the sender
has some outstanding chunks (4 in the last instance) and
its timer is running as far as it is concerned. Here is
the actual callout structure:

$6 = {c_links = {sle = {sle_next = 0x0},
                  tqe = {tqe_next = 0x0,
                  tqe_prev = 0xc6dd02a8}},
        c_time = 264796819,
        c_arg = 0xc27201ec,
        c_func = 0xc0748458 <sctp_timeout_handler>,
        c_mtx = 0x0,
        c_flags = 22}

Now there is another part to the structure (the c_arg) and if
I look in there I see things like it being started 1 second
before (which one would excpect... I save the ticks of
when it was started). I also have a stopped_from field
that gets set any time someone does a stop of the timer
and when the callout is called it sets it to various
values. The time structure is opaque to the SCTP code so
it does not play with these values.. and when you look at
the ticks, its long past expiration..

Note that the 22 indicates NO_MTX | PENDING and ACTIVE.
And yet the linked lists in c_links is NOT set to anything
like I normally see these dudes..

Now I did put a extra global SCTP lock in before starting/stopping
the timer. This did make it take 2-3 days to hit this case.. but
it still happens....

Has anyone seen this ?? I have looked at the timer code and I
do see a mutex spin lock.. but I can't see how SCTP would be
causing this... I am stumped .. any suggestions would be welcome ;-)

--------------------

My second problem is even more bizzare.. if thats possible...:-D

The other p4d runs along fine for a day or so .. and then it will
just stop.. and I mean stop.. if you have a top window up (I have
x off.. to panic it when I want :-D) you see the time frozen. No
updates... it just freezes...

If you type in anything.. the machine picks up again and starts
running as if nothing had happened. The last time I created
this the time had been frozen for at least 12 hours before
I got to it :-D

I dropped directly into KDB and pulled a crash dump...

Looking at all of the SCTP assoc's there was NOTHING
happening.. no data in flight..  nothing...

Now in the past type any key, change to another set
of windows ... and ta-da.. off it runs..

I do see a few TCP timeout alarms in the app (remember
the app talks TCP to setup the SCTP stuff)...

Very wierd...

Any ideas or suggestions would be welcome.

I just did an update in prep of doing a patch (currently
passed to gnn for approval).. so my cores are invalid.. but
I can recreate them pretty easily .. it just takes a
day or so :-)

I can also let anyone that is interested in when the event
occurs of problem one to the machine... and let them
puruse the timers or whatever of the running kernel.. or
take a crash dump and let you look at that..

If anyone has heard of anything like this I would appreciate
some pointers.. it could be something SCTP is doing... at
least the timer one..

Thanks for any suggestions..

R


-- 
Randall Stewart
NSSTG - Cisco Systems Inc.
803-345-0369 <or> 803-317-4952 (cell)
Received on Tue Dec 12 2006 - 12:01:39 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:03 UTC