All: I have two machines I am testing with... a Intel Xeon 2.8 Gig w/hyperthreading... and a Intel P4D dual core. Now I am testing SCTP and how it interacts with SMP.. or that is my intention. I have a snapshot of the MPI code that one of my friends at UBC has been working on with Argonne Labs... This uses SCTP :-) He has written a serious of tests which all now pass (after a LOT of bugs and LOR's.. all kinds of fun :D) Now a seperate test he has written is something called mywaitall. Basically you setup a number of processes, all of them get up and settled in. Then they coordinate (near as I can tell) sharing SCTP port and address info with each other via TCP. Then they switch over and use the SCTP one-2-many model.. sending data to each other setting up implicit connections. This means that running -np 10.. I have 10 endpoints with 90 associations ... I am doing this only on the local host side. I run this in a while true do mpiexec -np 10 ./mywaitall echo "-------" done Now on the xeon machine I see a very curious failure. After about a day of running this. I get two endpoints stuck one has data to be transmitted the other is waiting for it.. (the way the program works is they all send/rcv some info and then say goodbye to each other). Now I am seeing loss because the app version I have is buggy... the author did not handle the sending in non-blocking mode correctly. He thought he got a -1/EAGAIN.. instead of a 0/0 back.. so he ends up in a tight loop doing while (sent > -1) ret = dothesend() if(ret < 0) break // error sent += ret; Which means we peg the CPU sending with full send windows.. He has fixed this.. but I keep testing with the buggy version since it finds somemany unique problems :-) But back to my scenario. Now I have, in the past, fixed many bugs like this that were an SCTP problem :-) but this one I don't think so anymore.. When I find and look at the assoc's in question the sender has some outstanding chunks (4 in the last instance) and its timer is running as far as it is concerned. Here is the actual callout structure: $6 = {c_links = {sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xc6dd02a8}}, c_time = 264796819, c_arg = 0xc27201ec, c_func = 0xc0748458 <sctp_timeout_handler>, c_mtx = 0x0, c_flags = 22} Now there is another part to the structure (the c_arg) and if I look in there I see things like it being started 1 second before (which one would excpect... I save the ticks of when it was started). I also have a stopped_from field that gets set any time someone does a stop of the timer and when the callout is called it sets it to various values. The time structure is opaque to the SCTP code so it does not play with these values.. and when you look at the ticks, its long past expiration.. Note that the 22 indicates NO_MTX | PENDING and ACTIVE. And yet the linked lists in c_links is NOT set to anything like I normally see these dudes.. Now I did put a extra global SCTP lock in before starting/stopping the timer. This did make it take 2-3 days to hit this case.. but it still happens.... Has anyone seen this ?? I have looked at the timer code and I do see a mutex spin lock.. but I can't see how SCTP would be causing this... I am stumped .. any suggestions would be welcome ;-) -------------------- My second problem is even more bizzare.. if thats possible...:-D The other p4d runs along fine for a day or so .. and then it will just stop.. and I mean stop.. if you have a top window up (I have x off.. to panic it when I want :-D) you see the time frozen. No updates... it just freezes... If you type in anything.. the machine picks up again and starts running as if nothing had happened. The last time I created this the time had been frozen for at least 12 hours before I got to it :-D I dropped directly into KDB and pulled a crash dump... Looking at all of the SCTP assoc's there was NOTHING happening.. no data in flight.. nothing... Now in the past type any key, change to another set of windows ... and ta-da.. off it runs.. I do see a few TCP timeout alarms in the app (remember the app talks TCP to setup the SCTP stuff)... Very wierd... Any ideas or suggestions would be welcome. I just did an update in prep of doing a patch (currently passed to gnn for approval).. so my cores are invalid.. but I can recreate them pretty easily .. it just takes a day or so :-) I can also let anyone that is interested in when the event occurs of problem one to the machine... and let them puruse the timers or whatever of the running kernel.. or take a crash dump and let you look at that.. If anyone has heard of anything like this I would appreciate some pointers.. it could be something SCTP is doing... at least the timer one.. Thanks for any suggestions.. R -- Randall Stewart NSSTG - Cisco Systems Inc. 803-345-0369 <or> 803-317-4952 (cell)Received on Tue Dec 12 2006 - 12:01:39 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:03 UTC