On Thu, Jun 28, 2007 at 08:14:43AM -0700, Julian Elischer wrote: > Steve Kargl wrote: > >On Thu, Jun 28, 2007 at 02:50:40PM +0400, Eygene Ryabinkin wrote: > >>Steve, good day. > >> > >>Wed, Jun 27, 2007 at 06:43:11PM -0700, Steve Kargl wrote: > >>>Any advice on how to isolate or avoid? > >>> > >>>Jun 27 18:31:19 node11 kernel: TCP: [192.168.0.11]:59661 to > >>>[192.168.0.11]:63266 tcpflags 0x10<ACK>; syncache_expand: Segment failed > >>>SYNCOOKIE authentication, segment rejected (probably spoofed) > >>According to Andre Oppermann, these are harmless: > >> http://lists.freebsd.org/pipermail/freebsd-net/2007-June/014401.html > >> > >>But I am expiriencing some problems related to the other messages > >>like 'tcp_input: Listen socket: Spurious RST, segment rejected'. > >>Though it seems not to be your case, but my problems are documented > >>in the aforementioned thread. Just in case you're curious... > >>-- > > > >Andre certainly knows more about TCP/IP than I, but no, these > >are not harmless. Everytime one of these messages appears > >on the console, my MPI application hangs and must be restarted. > >My large numerical simulations randomly die anywhere from > >15 minutes to 25 hours after launching the job. > > is the app on that machine or another machine? > It's a message passing interface MPI application. I have 6 nodes in a cluster. Each node has 4 CPUs. Each node gets 4 processes. There are a total of 24 processes, and communication between nodes is over a GigE network. This is top(1) output on node11 last pid: 2919; load averages: 4.76, 4.56, 4.55 up 0+12:48:23 09:45:04 34 processes: 5 running, 29 sleeping CPU states: 23.6% user, 0.0% nice, 66.4% system, 10.0% interrupt, 0.0% idle Mem: 4587M Active, 588M Inact, 263M Wired, 596K Cache, 214M Buf, 10G Free Swap: 17G Total, 17G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 896 kargl 1 130 0 1567M 1428M CPU3 2 659:31 86.77% AVL_PS_mpi 897 kargl 1 130 0 1201M 1061M CPU2 3 655:01 86.33% AVL_PS_mpi 898 kargl 1 130 0 1201M 1061M RUN 1 653:25 86.18% AVL_PS_mpi 899 kargl 1 139 0 1201M 1061M RUN 2 655:00 85.89% AVL_PS_mpi When I get the SYNCOOKIE authentication error message, CPU state shows 0% user and 99.9% system. All 4 processes show WCPU 99.99%. This occurs on all the nodes. AFAICT, the processes are spinning waiting for info from other processes. This info never comes -- SteveReceived on Thu Jun 28 2007 - 14:50:36 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:13 UTC