> Jun 29 09:21:58 node11 kernel: TCP: [192.168.0.12]:54528 to [192.168.0.11]:526 OK - I can see the packets corresponding to this error by doing something like: % tcpdump -S -r synfinrstdata -n port 62391 and port 60621 17:22:01.607876 192.168.0.15.62391 > 192.168.0.11.60621: S 106955928:106955928(0) win 65535 <mss 1460,nop,wscale 8,sackOK,timestamp 419300 0> 17:22:01.607967 192.168.0.11.60621 > 192.168.0.15.62391: S 2273558377:2273558377(0) ack 106955929 win 65535 <mss 1460,nop,wscale 8,nop,nop,timestamp 4153368860 419300> 17:22:01.608514 192.168.0.15.62391 > 192.168.0.11.60621: F 106955941:106955941(0) ack 2273558378 win 260 <nop,nop,timestamp 419301 4153368860> 17:22:01.609638 192.168.0.11.60621 > 192.168.0.15.62391: F 2273558378:2273558378(0) ack 106955942 win 260 <nop,nop,timestamp 4153368862 419301> 17:22:01.609697 192.168.0.11.60621 > 192.168.0.15.62391: F 2273558378:2273558378(0) ack 106955942 win 260 <nop,nop,timestamp 4153368862 419301> 17:22:01.610103 192.168.0.11.60621 > 192.168.0.15.62391: R 2273558379:2273558379(0) win 0 The start of this looks like a perfectly normal TCP connection - it opens normally, transfers about 12 bytes in one direction and then closes. Strangley, 192.168.0.11 then sends two FIN packets, followed by a reset. The error message produced by the kernel should have produced a reset in response, but I'm not sure I can see quite enough to see what happened. We could try to get all of the packets in the connection by doing: tcpdump -i whatever_interface -w /tmp/fulldump -s 80 then wait for an error (that is not local to this machine - I think they are going to lo0). Then note the port numbers and do: tcpdump -r /tmp/fulldump port _port1_ and port _port2_ With regard to the truss output > poll({4/POLLIN 5/POLLIN 6/POLLIN 7/POLLIN 9/POLLIN 10/POLLIN 11/POLLIN 13/POLL It looks like MPI is looking only for file discriptors to become ready for reading. I'd guess one of the file discriptors is in an error state, but MPI isn't checking for theat, so it is spinning. David.Received on Fri Jun 29 2007 - 19:27:08 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:13 UTC