MDCE Admin center is ok, But parallel computing does not recognize workers

Hi All,
I have a desktop with XPSP3x86 os and a laptop with win7x64. Both have MATLAB distributed computing server.
I start job manager on XPSP3x86 and worker on both computers. Using Admin Center i can check everything that is ok. But checking with cluster profile manager makes me confused. It's check yields to passed results, but it didn't recognize all workers (i.e. 2).
what is the problem? Thanks in advanced

Answers (2)

Do you have a 64-bit worker and a 32-bit worker? This is not a recommended configuration, due to the underlying technology that PCT/MDCS uses, which requires them to have matching word sizes and processor endianness -- and you'll see behavior like you are seeing.
From the system requirements page:
Homogeneous cluster configurations are recommended. Parallel processing constructs that work on the infrastructure enabled by matlabpool—parfor, spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness. A limited set of functions in Parallel Computing Toolbox can work in heterogeneous cluster configurations.
Thanks for replying,
I found that (after many tries) if the workers have same cpu architecture (x86 or 64) then they see each other in matlab pool. This is true when the grid is as following :
Physical pc virtual pc on that
*x86 xpsp3*
*x86 xpsp3*
*x86 xpsp3*
x64 win7
*x86 xpsp3*
so i have 4 workers. Now the question is when i start cluster with 3 workers and submit a job on that, there is a job there that mess me. see below please:
>> myCluster.Jobs
ans =
Job: 2-by-1
============
# ID Type State FinishTime Username #tasks
-------------------------------------------------------------------------
1 134 pool running hormoz 3
2 135 independent queued hormoz 20
myCluster.Jobs(1).Tasks
ans =
MJSTask: 3-by-1
================
# ID State FinishTime Function Error
--------------------------------------------------------------
1 1 running @distcomp.nop
2 2 running @distcomp.nop
3 3 running @distcomp.nop
so, my job never started and program hangs. please help me.

11 Comments

This looks normal, if you have three workers and three tasks. The three tasks get worked on, then it would move onto the next 20, taking them as the work on the first job completes. If something is not working on the fist job, you'll need to check for errors in the job.
Also, you can spawn other workers on your host with the "startworker" command, e.g.
startworker -name worker1
There's no need to have a VM to have another worker. It just adds overhead and hurts performance. There might be another reason you have a VM, but it's not strictly necessary for this example.
If possible, I'd recommend you get all the machines on the same OS, most likely Windows 7 64-bit. You'll be able to get the worker counts you expect and benefit from the larger word sizes.
Dear Jason,
I have problem with the @distcomp.nop function. As i start my workers, some of them are busy. Submitting job are done very well, and they are finished normally. But all new tasks are done on the idle workers.
Now i have a busy worker with the follow data:
myCluster.Jobs
ans =
Job: 4-by-1
============
# ID Type State FinishTime Username #tasks
-------------------------------------------------------------------------
1 9 independent pending hormoz 0
2 10 independent finished - hormoz 20
3 11 pool running hormoz 1
4 12 independent finished - hormoz 20
then:
myCluster.Jobs(3).Tasks
ans =
Task ID 1 from Job 11 Information
=================================
State: running
Function: @distcomp.nop
StartTime: Mon Oct 22 19:12:03 GMT+03:30 2012
Running Duration: 0 days 2h 6m 6s
- Task Result Properties
ErrorIdentifier:
ErrorMessage:
and:
myCluster.Jobs(3).Tasks.Function
ans =
@distcomp.nop
for timming we have:
myCluster.Jobs(3).StartTime
ans =
Mon Oct 22 19:11:35 GMT+03:30 2012
and now the time is 21:23. what is this?
What happens if you use something like rand?
This is not my Jason. This is a by-default job. it never stops. After starting worker, it automatically process this annoying job. :(
You should be able to just kill/cancel the job, then.
I don't know why that no-op function would get in such a state ... it's not doing much of anything.
If you run Admin Center and run the connectivity tests, does anything fail?
Admin center tests are normally passed, and reports show that the workers executing @distcomp.nop task, are busy.
BTW, nothing fails.
There might also be something interesting in the debug log:
You can also cancel the job:
Then look at the ErrorMessage, OutputArguments. Although I bet they will say "cancelled by the user", and the debug log might shed more light on what's getting hung up.
There is some intresting points:
  • Without matlabpool command, jobs and task can be submitted and normally finished. So you can fetch the outputs.
  • Invoking cancel or delete for a job or task result in losing connection in matlabpool. So i should call matlabpool open Profile1.
  • without any matlabpool session, cancel or delete can be used without anyproblem and it can remove any job/task.
what do you think?
matlabpool and job/task are two different ways of working.
If you open a matlabpool, workers will be consumed while it is open. The number will be equal to the size of the pool. You can close the matlabpool and the workers should be freed.
The job and task interface allows more control, you can specify the number of workers you want, and tasks can be queued. The workers will be freed when they complete their work.
Dear Jason,
Thanks for all your replies. Another question comes up. local profile uses all cores from local CPU: 2 labs.
But while i use MATLAB pool with profile1 (knows 2 computers. my computer and a virtual mashine with 2 core processor), i have 4 labs (two pc with 2 dual core processors). Unfrtunately, matlabpool just knows 2 labs. is it possible to use 4 labs and how?
Yes, you can start more workers on the machines and you will be able to access more labs. You can do this using AdminCenter or via the "startworker" command in matlabroot\toolbox\disctomp\bin
Be careful with starting more, though -- a good starting point for worker count is one per (compute, not virtual/hyperthreaded) core and 2 GB of RAM per worker. If you start exceeding those, it's possible your performance will actually decrease as you could run out of RAM (and use much slower swap -- especially on a VM), processor capacity, network bandwidth, etc.

Sign in to comment.

Asked:

on 19 Oct 2012

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!