Send Deep Learning Batch Job to Cluster
This example shows how to send deep learning training batch jobs to a cluster so that you can continue working or close MATLAB® during training.
Training deep neural networks often takes hours or days. To use time efficiently, you can train neural networks as batch jobs and fetch the results from the cluster when they are ready. You can continue working in MATLAB while computations take place or close MATLAB and obtain the results later using the Job Monitor. You can optionally monitor the jobs during training and, after the job is complete, you can fetch the trained networks and compare their accuracies.
Requirements
Before you can run this example, you need to configure a cluster and upload your data to the Cloud. In MATLAB, you can create clusters in the cloud directly from the MATLAB Desktop. On the Home tab, in the Parallel menu, select Create and Manage Clusters. In the Cluster Profile Manager, click Create Cloud Cluster. Alternatively, you can use MathWorks Cloud Center to create and access compute clusters. For more information, see Getting Started with Cloud Center. For this example, ensure that your desired cloud cluster is set as the default parallel environment on the MATLAB Home tab, in Parallel > Select Parallel Environment. After that, upload your data to an Amazon S3 bucket and use it directly from MATLAB. This example uses a copy of the CIFAR-10 data set that is already stored in Amazon S3. For instructions, see Work with Deep Learning Data in AWS.
Submit Batch Job
You can send a function or a script as a batch job to the cluster by using the batch
(Parallel Computing Toolbox) function. By default, the cluster allocates one worker to execute the contents of the job. If the code in the job will benefit from extra workers, for example, it includes automatic parallel support or a parfor
-loop, you can specify more workers by using the Pool
name-value argument of the batch
function.
When you submit a batch job as a script, by default, workspace variables are copied from the client to the workers. To avoid copying workspace variables to the workers, submit batch jobs as functions.
The trainConvNet
function is provided as a supporting file with this example. To access the function, open the example as a live script. The function trains a single network using a given mini-batch size and returns the trained network and its accuracy. To perform a parameter sweep across mini-batch sizes, send the function as a batch job to the cluster four times, specifying a different mini-batch sizes for each job. When sending a function as a batch job, specify the number of outputs of the function and the input arguments.
c = parcluster("MyClusterInTheCloud"); miniBatchSize = [64 128 256 512]; numBatchJobs = numel(miniBatchSize); for idx=1:numBatchJobs job(idx) = batch(c,"trainConvNet",2,{idx,miniBatchSize(idx)}); end
Training each network in an individual batch job instead of using a single batch job that trains all of the networks in parallel avoids the overhead required to start a parallel pool in the cluster and allows you to use the job monitor to observe the progress of each network computation individually.
You can submit additional jobs to the cluster. If the cluster is not available because it is running other jobs, any new job you submit remains queued until the cluster becomes available.
Monitor Training Progress
You can see the current status of your job in the cluster by checking the Job Monitor. In the Environment section on the Home tab, select Parallel > Monitor Jobs to open the Job Monitor.
You can optionally monitor the progress of training in detail by sending data from the workers running the batch jobs to the MATLAB client. In the trainConvNet
function, the output function sendTrainingProgress
is called after each iteration to add the current iteration and training accuracy to a ValueStore
(Parallel Computing Toolbox). A ValueStore
stores data owned by a specific job and each data entry consists of a value and a corresponding key.
function stop = sendTrainingProgress(info) if info.State == "iteration" && ~isempty(info.TrainingAccuracy) % Get the ValueStore object of the current job. store = getCurrentValueStore; % Store the training results in the job ValueStore object with a unique % key. key = idx; store(key) = struct(iteration=info.Iteration,accuracy=info.TrainingAccuracy); end stop = false; end
Create a figure for displaying the training accuracy of the networks and, for each job submitted:
Create a subplot to display the accuracy of the network being trained.
Get the
ValueStore
object of the job.Specify a callback function to execute each time the job adds an entry to the
ValueStore
. The callback functionupdatePlot
is provided at the end of this example and plots the current training accuracy of a network.
figure for i=1:numBatchJobs subplot(2,2,i) xlabel("Iteration"); ylabel("Accuracy (%)"); ylim([0 100]) lines(i) = animatedline; store{i} = job(i).ValueStore; store{i}.KeyUpdatedFcn = @(store,key) updatePlot(lines(i),store(key).iteration,store(key).accuracy); end
Fetch Results Programmatically
After submitting jobs to the cluster, you can continue working in MATLAB while computations take place. If the rest of your code depends on completion of a job, block MATLAB by using the wait
command. In this case, wait for the job to finish.
wait(job(1))
After the job finishes, fetch the results by using the fetchOutputs
function. In this case, fetch the trained networks and their accuracies.
for idx=1:numBatchJobs results{idx}=fetchOutputs(job(idx)); end results{:}
ans=1×2 cell array
{1×1 dlnetwork} {[0.6866]}
ans=1×2 cell array
{1×1 dlnetwork} {[0.5964]}
ans=1×2 cell array
{1×1 dlnetwork} {[0.6542]}
ans=1×2 cell array
{1×1 dlnetwork} {[0.6230]}
If you close MATLAB, you can still recover the jobs in the cluster to fetch the results either while the computation is taking place or after the computation is complete. Before closing MATLAB, make a note of the job ID and then retrieve the job later by using the findJob
function.
To retrieve a job, first create a cluster object for your cluster by using the parcluster
function. Then, provide the job ID to findJob
. In this case, the job ID is 3.
c = parcluster("MyClusterInTheCloud");
job = findJob(c,ID=3);
Delete a job when you no longer need it. The job is removed from the Job Monitor.
delete(job(1));
To delete all jobs submitted to a particular cluster, pass all jobs associated with the cluster to the delete
function.
delete(c.Jobs);
Use Job Monitor to Fetch Results
When you submit batch jobs, all the computations happen in the cluster and you can safely close MATLAB. You can check the status of your jobs by using the Job Monitor in another MATLAB session.
When a job is done, you can retrieve the results from the Job Monitor. In the Environment section on the Home tab, select Parallel > Monitor Jobs to open the Job Monitor. Then right-click a job to display the context menu. From this menu, you can:
Load the job into the workspace by clicking Show Details
Fetch the trained networks and their accuracies by clicking Fetch Outputs
Delete the job when you are done by clicking Delete
Supporting Functions
The updatePlot
function adds a point to one of the subplots indicating the current training accuracy of a network. The function receives an animated line object, and the current iteration and accuracy of a network.
function updatePlot(line,iteration,accuracy) addpoints(line,iteration,accuracy); drawnow limitrate nocallbacks end
See Also
batch
(Parallel Computing Toolbox) | ValueStore
(Parallel Computing Toolbox)
Related Examples
- Use parfor to Train Multiple Deep Learning Networks
- Work with Deep Learning Data in AWS
- Work with Deep Learning Data in Azure
- Offload Experiments as Batch Jobs to a Cluster
More About
- Batch Processing (Parallel Computing Toolbox)