MATLAB Answers

emarch
0

Reported Task State not accurate when running on MS HPC grid

Asked by emarch
on 19 Aug 2019
Latest activity Answered by Edric Ellis
on 20 Aug 2019
Hi,
We are using Matlab 2018a with the parallel toolbox in conjuction with a Matlab parallel server leveraging MS HPC Server 2012 as the scheduler. We've noticed when trying to retrieve task states using the following construct that it is common for incorrect states to be returned:
obj.Job.Tasks.State
For example, when we first start a job it will report pending, then briefly switch to failed before accurately report as running. Are there any tricks to getting these task states to be reported properly?
Thanks for any help.

  0 Comments

Sign in to comment.

1 Answer

Answer by Edric Ellis
on 20 Aug 2019
 Accepted Answer

Unfortunately, getting accurate state information back from the cluster can be tricky. This is because there are multiple sources of information relating to this - there's the "JobX/TaskY.state.mat" files on disk in your JobStorageLocation. These are created in state pending, the client moves them to queued on submission, and then the worker MATLAB processes set them to be running, and finally finished. There's also the information coming back from querying the underlying scheduling system. These pieces of information can occasionally (and usually transiently) conflict with each other, which leads to spurious states being observed. (It is necessary to query the underlying scheduling system to deal with the case where the worker MATLAB crashes before it gets to set the state file to running or finished.)
If you can, I would recommend using Job.wait as your primary means of waiting for results to become available. (Perhaps with the timeout parameter). This method ought to be more reliable than querying the task State properties directly, as it performs more detailed (and more expensive) checks.

  0 Comments

Sign in to comment.