Unable to connect to a MDCS cluster

3 views (last 30 days)
Javier
Javier on 28 Apr 2013
Hi,
I have a matlab cluster configured to run workers in two machines running win7. When I connect from my client and ask for 16 workers that can be taken from only one machine everything works. However, when I ask for 17 or more workers with matlabpool, the matlab scheduler gives the error below (names and IPs were removed in purpose). Any idea how to solve this? Note: the workers in the MJS appear as conencted and idle and the firewalls are off (both with win7). The client validates the cluster perfectly.
Thanks, Javier ----------------------------------------------------------------
>> matlabpool open MYSCHED 17
Starting matlabpool using the 'MYSCHED' profile ... stopped.
Error using matlabpool (line 144)
Failed to open matlabpool. (For information in addition to the causing
error, validate the profile 'MYSCHED' in the Cluster Profile Manager.)
Caused by:
Error using
distcomp.interactiveclient/pGetSockets>iThrowIfBadParallelJobStatus
(line 101)
The interactive communicating job errored with the following
message: Cannot rerun task because there are no rerun attempts left
(task has no rerun attempts left).
Original cancel message:
The task was cancelled by user "SYSTEM" on machine
"machine2" with message: "MPI initialisation
failed:
A failure occurred during connect/accept. The MPI error was: Other
MPI error, error stack:
MPI_Comm_connect(120)............................:
MPI_Comm_connect(port="tag=0 port=28355
description=machine1 ifname=machine1ip ",
MPI_INFO_NULL, root=0, comm=0xc4000005, newcomm=_______)
failed
MPID_Comm_connect(191)...........................:
MPIDI_Comm_connect(379)..........................:
MPIDI_Create_inter_root_communicator_connect(134):
MPIDI_CH3I_Connect_to_root_sock(304).............:
MPIDU_Sock_post_connect(1231)....................: unable to
connect to machine1 on port 28355, exhausted all
endpoints (errno -1)
MPIDU_Sock_post_connect(1247)....................: gethostbyname
failed, The requested name is valid, but no data of the requested
type was found. (errno 11004).".

Answers (1)

Sam Marshalik
Sam Marshalik on 10 May 2013
This is likely caused by mis-configured DNS on one of the hosts. Please log in to the erroring node on the cluster, and try to look-up the node that it is trying to connect to using the nslookup command. If this command fails, you will likely need to setup DNS properly on the erroring node.
- Sam

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!