All: Ok, the second problem I noted.. where the system freezes... I can reproduce pretty regularly... the problem is I can't seem to gain any information from it. If I hit ANY key it starts back up again... for example if I leave it on the screen running top.. if I strke Ctl- to do a Ctl-alt-esc to drop to ddb... bam the time updates (jumping an hour in the last instance)... So when I look at the core.. everything looks normal.. I am seeing at the same instance "Limiting closed port RST response from 257 to 200 packets/sec" and some "calcru: runtime went backwards... " messages as well.. Any suggestions on how I can gather any data from this event... Oh one other note.. the machine cannot be pinged when it reaches this state... so a network interupt will not revive it .. Very wierd.. R Randall Stewart wrote: > All: > > I have two machines I am testing with... a Intel Xeon > 2.8 Gig w/hyperthreading... and a Intel P4D dual core. > > Now I am testing SCTP and how it interacts with SMP.. or > that is my intention. I have a snapshot of the MPI code > that one of my friends at UBC has been working on > with Argonne Labs... This uses SCTP :-) > > He has written a serious of tests which all now pass > (after a LOT of bugs and LOR's.. all kinds of fun :D) > > Now a seperate test he has written is something called > mywaitall. Basically you setup a number of processes, > all of them get up and settled in. Then they coordinate > (near as I can tell) sharing SCTP port and address info > with each other via TCP. Then they switch over and > use the SCTP one-2-many model.. sending data to each other > setting up implicit connections. > > This means that running -np 10.. I have 10 endpoints with > 90 associations ... I am doing this only on the local > host side. > > I run this in a > while true > do > mpiexec -np 10 ./mywaitall > echo "-------" > done > > > Now on the xeon machine I see a very curious failure. After > about a day of running this. I get two endpoints stuck > one has data to be transmitted the other is waiting for > it.. (the way the program works is they all send/rcv some > info and then say goodbye to each other). > > Now I am seeing loss because the app version I have is > buggy... the author did not handle the sending in non-blocking > mode correctly. He thought he got a -1/EAGAIN.. instead of > a 0/0 back.. so he ends up in a tight loop doing > while (sent > -1) > ret = dothesend() > if(ret < 0) > break // error > sent += ret; > > > Which means we peg the CPU sending with full send windows.. He > has fixed this.. but I keep testing with the buggy version since > it finds somemany unique problems :-) > > But back to my scenario. Now I have, in the past, fixed many > bugs like this that were an SCTP problem :-) but this one > I don't think so anymore.. > > When I find and look at the assoc's in question the sender > has some outstanding chunks (4 in the last instance) and > its timer is running as far as it is concerned. Here is > the actual callout structure: > > $6 = {c_links = {sle = {sle_next = 0x0}, > tqe = {tqe_next = 0x0, > tqe_prev = 0xc6dd02a8}}, > c_time = 264796819, > c_arg = 0xc27201ec, > c_func = 0xc0748458 <sctp_timeout_handler>, > c_mtx = 0x0, > c_flags = 22} > > Now there is another part to the structure (the c_arg) and if > I look in there I see things like it being started 1 second > before (which one would excpect... I save the ticks of > when it was started). I also have a stopped_from field > that gets set any time someone does a stop of the timer > and when the callout is called it sets it to various > values. The time structure is opaque to the SCTP code so > it does not play with these values.. and when you look at > the ticks, its long past expiration.. > > Note that the 22 indicates NO_MTX | PENDING and ACTIVE. > And yet the linked lists in c_links is NOT set to anything > like I normally see these dudes.. > > Now I did put a extra global SCTP lock in before starting/stopping > the timer. This did make it take 2-3 days to hit this case.. but > it still happens.... > > Has anyone seen this ?? I have looked at the timer code and I > do see a mutex spin lock.. but I can't see how SCTP would be > causing this... I am stumped .. any suggestions would be welcome ;-) > > -------------------- > > My second problem is even more bizzare.. if thats possible...:-D > > The other p4d runs along fine for a day or so .. and then it will > just stop.. and I mean stop.. if you have a top window up (I have > x off.. to panic it when I want :-D) you see the time frozen. No > updates... it just freezes... > > If you type in anything.. the machine picks up again and starts > running as if nothing had happened. The last time I created > this the time had been frozen for at least 12 hours before > I got to it :-D > > I dropped directly into KDB and pulled a crash dump... > > Looking at all of the SCTP assoc's there was NOTHING > happening.. no data in flight.. nothing... > > Now in the past type any key, change to another set > of windows ... and ta-da.. off it runs.. > > I do see a few TCP timeout alarms in the app (remember > the app talks TCP to setup the SCTP stuff)... > > Very wierd... > > Any ideas or suggestions would be welcome. > > I just did an update in prep of doing a patch (currently > passed to gnn for approval).. so my cores are invalid.. but > I can recreate them pretty easily .. it just takes a > day or so :-) > > I can also let anyone that is interested in when the event > occurs of problem one to the machine... and let them > puruse the timers or whatever of the running kernel.. or > take a crash dump and let you look at that.. > > If anyone has heard of anything like this I would appreciate > some pointers.. it could be something SCTP is doing... at > least the timer one.. > > Thanks for any suggestions.. > > R > > -- Randall Stewart NSSTG - Cisco Systems Inc. 803-345-0369 <or> 803-317-4952 (cell)Received on Thu Dec 14 2006 - 17:08:37 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:03 UTC