Re: curious results

From: Randall Stewart <rrs_at_cisco.com> Date: Thu, 14 Dec 2006 12:59:49 -0500 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:03 UTC

All:

Ok, the second problem I noted.. where the system freezes... I can
reproduce pretty regularly... the problem is I can't seem
to gain any information from it.

If I hit ANY key it starts back up again... for example if I
leave it on the screen running top.. if I strke
Ctl- to do a Ctl-alt-esc to drop to ddb... bam the
time updates (jumping an hour in the last instance)...

So when I look at the core.. everything looks normal..

I am seeing at the same instance
"Limiting closed port RST response from 257 to 200 packets/sec"
and some
"calcru: runtime went backwards... " messages as well..

Any suggestions on how I can gather any data from this event...

Oh one other note.. the machine cannot be pinged when it
reaches this state... so a network interupt will not revive
it ..

Very wierd..

R

Randall Stewart wrote:
> All:
> 
> I have two machines I am testing with... a Intel Xeon
> 2.8 Gig w/hyperthreading... and a Intel P4D dual core.
> 
> Now I am testing SCTP and how it interacts with SMP.. or
> that is my intention. I have a snapshot of the MPI code
> that one of my friends at UBC has been working on
> with Argonne Labs... This uses SCTP :-)
> 
> He has written a serious of tests which all now pass
> (after a LOT of bugs and LOR's.. all kinds of fun :D)
> 
> Now a seperate test he has written is something called
> mywaitall. Basically you setup a number of processes,
> all of them get up and settled in. Then they coordinate
> (near as I can tell) sharing SCTP port and address info
> with each other via TCP. Then they switch over and
> use the SCTP one-2-many model.. sending data to each other
> setting up implicit connections.
> 
> This means that running -np 10.. I have 10 endpoints with
> 90 associations ... I am doing this only on the local
> host side.
> 
> I run this in a
> while true
> do
> mpiexec -np 10 ./mywaitall
> echo "-------"
> done
> 
> 
> Now on the xeon machine I see a very curious failure. After
> about a day of running this. I get two endpoints stuck
> one has data to be transmitted the other is waiting for
> it.. (the way the program works is they all send/rcv some
> info and then say goodbye to each other).
> 
> Now I am seeing loss because the app version I have is
> buggy... the author did not handle the sending in non-blocking
> mode correctly. He thought he got a -1/EAGAIN.. instead of
> a 0/0 back.. so he ends up in a tight loop doing
> while (sent > -1)
>    ret = dothesend()
>    if(ret < 0)
>       break // error
>    sent += ret;
> 
> 
> Which means we peg the CPU sending with full send windows.. He
> has fixed this.. but I keep testing with the buggy version since
> it finds somemany unique problems :-)
> 
> But back to my scenario. Now I have, in the past, fixed many
> bugs like this that were an SCTP problem :-) but this one
> I don't think so anymore..
> 
> When I find and look at the assoc's in question the sender
> has some outstanding chunks (4 in the last instance) and
> its timer is running as far as it is concerned. Here is
> the actual callout structure:
> 
> $6 = {c_links = {sle = {sle_next = 0x0},
>                  tqe = {tqe_next = 0x0,
>                  tqe_prev = 0xc6dd02a8}},
>        c_time = 264796819,
>        c_arg = 0xc27201ec,
>        c_func = 0xc0748458 <sctp_timeout_handler>,
>        c_mtx = 0x0,
>        c_flags = 22}
> 
> Now there is another part to the structure (the c_arg) and if
> I look in there I see things like it being started 1 second
> before (which one would excpect... I save the ticks of
> when it was started). I also have a stopped_from field
> that gets set any time someone does a stop of the timer
> and when the callout is called it sets it to various
> values. The time structure is opaque to the SCTP code so
> it does not play with these values.. and when you look at
> the ticks, its long past expiration..
> 
> Note that the 22 indicates NO_MTX | PENDING and ACTIVE.
> And yet the linked lists in c_links is NOT set to anything
> like I normally see these dudes..
> 
> Now I did put a extra global SCTP lock in before starting/stopping
> the timer. This did make it take 2-3 days to hit this case.. but
> it still happens....
> 
> Has anyone seen this ?? I have looked at the timer code and I
> do see a mutex spin lock.. but I can't see how SCTP would be
> causing this... I am stumped .. any suggestions would be welcome ;-)
> 
> --------------------
> 
> My second problem is even more bizzare.. if thats possible...:-D
> 
> The other p4d runs along fine for a day or so .. and then it will
> just stop.. and I mean stop.. if you have a top window up (I have
> x off.. to panic it when I want :-D) you see the time frozen. No
> updates... it just freezes...
> 
> If you type in anything.. the machine picks up again and starts
> running as if nothing had happened. The last time I created
> this the time had been frozen for at least 12 hours before
> I got to it :-D
> 
> I dropped directly into KDB and pulled a crash dump...
> 
> Looking at all of the SCTP assoc's there was NOTHING
> happening.. no data in flight..  nothing...
> 
> Now in the past type any key, change to another set
> of windows ... and ta-da.. off it runs..
> 
> I do see a few TCP timeout alarms in the app (remember
> the app talks TCP to setup the SCTP stuff)...
> 
> Very wierd...
> 
> Any ideas or suggestions would be welcome.
> 
> I just did an update in prep of doing a patch (currently
> passed to gnn for approval).. so my cores are invalid.. but
> I can recreate them pretty easily .. it just takes a
> day or so :-)
> 
> I can also let anyone that is interested in when the event
> occurs of problem one to the machine... and let them
> puruse the timers or whatever of the running kernel.. or
> take a crash dump and let you look at that..
> 
> If anyone has heard of anything like this I would appreciate
> some pointers.. it could be something SCTP is doing... at
> least the timer one..
> 
> Thanks for any suggestions..
> 
> R
> 
> 

-- 
Randall Stewart
NSSTG - Cisco Systems Inc.
803-345-0369 <or> 803-317-4952 (cell)