Matlab cluster - validation - parpool stuck

Hi,
I am testing for the first time Matlab Parallel. I have a cluster in my local network, with 8 cluster nodes and a max of 32 workers that can run.
I created the cluster profile and ran "validate." If I validate only 1 of the cluster nodes (4 workers), everything is fine, and the validation is done in a couple of minutes or less. However, if I try with 2 cluster nodes, the validation stops at "Parallel pool test (parpool)" and keeps on running indefinitely; I waited up to 1 hour. If I try to "Stop" the validation, nothing happens, and I have to kill Matlab from the task manager. I also tried running the 2 clusters with only 1 worker each, but the same thing happens. I am not getting any error, and I don't get any report either, because, as I said, the parpool testing just keeps on running. If I try cluster node 2 alone, everything works fine again. The problem arises when I start both cluster nodes 1 and 2 (and, of course, also when I start all 8 cluster nodes).
I am new to Matlab parallel, and I currently have no clue where to start looking. Could someone give me some hint on what the problem could be?

 Accepted Answer

A couple of questions
  • What scheduler is your cluster running? MJS? HPC Server? PBS/Slurm/etc?
  • Sounds like all the stages are passing, except the last one (parpool) IFF the workers span >1 node. Otherwise, the last stage with workers running on a single node pass. Do I have that right?
Without knowing more, I betting that you don't have password-less SSH between the compute nodes and that mpiexec is hung. Get onto a compute node and ssh to another. Are you prompted for your password? If so, you've reproduced the issue.

4 Comments

Maria
Maria on 21 May 2021
Edited: Maria on 21 May 2021
Hi Raymond,
thanks for your support!
  • We're running the MJS scheduler.
  • Yes, it's correct. As soon as we enable more than one node, the parpool gets stuck. Our setup is that we have a dedicated head node, which does not run any workers only the license manager and the scheduler.
Your hint about the password-less SSH between the nodes sounds extremely interesting. It's correct, that we haven't configured any passwordless SSH between nodes. The only thing that we've done is to setup certificate based SSH access from the head to all nodes, to be able to easily administrate the nodes from the head.
Do we need to configure such certificate based access on the compute nodes or some configuration without authentication at all between them? In any case, do we need this for the root user on the compute nodes (because we start the mjs service and workers as root on the compute nodes)?
Best,
Maria
Since you're running MJS, it might be a firewall issue between the compute nodes. Contact Technical Support (support@mathworks.com) and they can walk you through it.
Thank you, I will do so!
I have disabled all firewall, but the problem persists. I am now waiting for the Technical Support... In the meantime, do you have any other idea what could it be?

Sign in to comment.

More Answers (0)

Categories

Products

Release

R2021a

Asked:

on 20 May 2021

Commented:

on 1 Jun 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!