why does my job on cluster stop to produce output

3 views (last 30 days)
Hey, I am using parallel toolbox on a linux cluster (istan nodes and SLURM scheduler). The main routine (the parfor loop section) looks as follows. An 2-d array (MASKE) is used to extract time series which have values, and the function core_eQM is applied on these time series: ...
cd $WORKDIR
pc = parcluster('local')
pc.JobStorageLocation = strcat('$WORKDIR/',getenv('SLURM_JOB_ID'))
% start the matlabpool with maximum available workers
% control how many workers by setting ntasks in your sbatch script
matlabpool(pc, getenv('SLURM_CPUS_ON_NODE'))
...
pardim=size(MASKE,2);
XX1=NaN(7305,pardim);
parfor ii=1:pardim
if ~isnan(MASKE(1,ii))
fprintf('%i \t \n', ii); %shows me progress of job (creates files, which are empty)
if meth == 1;
xx1 = core_eQM(squeeze(VARIABLE_BSE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,3654:7305)))
elseif meth == 2
...
end
end
end
Now, I use the following script to submit it on the cluster:
#!/bin/bash
#SBATCH --job-name=imat_par_test
#SBATCH --output=matlab_parfor.out
#SBATCH --error=matlab_parfor.err
#SBATCH --partition=ivy
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=20
source /etc/profile.d/00-modules.sh
module load app/matlab2014b
cd $WORKDIR
# Create a local work directory
mkdir -p $WORKDIR/$SLURM_JOB_ID
#cd $WORKSDIR/$SLURM_JOB_ID
# Kick off matlab
matlab -nodesktop < script_apply_BC.m &
#wait
# Cleanup local work directory
rm -rf $WORKSDIR/$SLURM_JOB_ID
At the beginning (first few hours) the job runs fine. The size of pardim is 420. After pardim reaching approx. 250, the procedure slows down and finally does not "continue", i.e. the job is still running without producing output files. Thus, no problems are reported in the matlab_parfor.err file. I do not know exactly how I can analyse the problems in this case.
Any ideas?
  5 Comments
Patrick Laux
Patrick Laux on 9 Jul 2021
unfortunately not, Simone. I just gave up.
If you find out more, I would be happy if you let me know.
Patrick
Simone Stünzi
Simone Stünzi on 9 Jul 2021
I've increased idleTimeout to Inf and will let you know if that solves my issue.
Best, Simone

Sign in to comment.

Answers (0)

Categories

Find more on Third-Party Cluster Configuration in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!