Problems with multi-gpus

5 views (last 30 days)
Andres Ramirez
Andres Ramirez on 4 Nov 2017
Edited: Joss Knight on 20 Nov 2017
I am using this function to train a CNN:
function [trainedNet,trainingSet,testSet] = OurNetCBIR
outputFolder = fullfile('Database');
rootFolder=fullfile(outputFolder, 'Oliva');
imds = imageDatastore(fullfile(rootFolder),'IncludeSubfolders', true, 'LabelSource', 'foldernames');
imds.ReadFcn = @(filename)readAndPreprocessImage(filename);
function Iout = readAndPreprocessImage(filename)
I = imread(filename);
if ismatrix(I)
I = cat(3,I,I,I);
end
Iout = imresize(I, [227 227]);
end
[trainingSet,testSet] = splitEachLabel(imds, 0.7, 'randomize');
layers = [
imageInputLayer([227 227 3],'DataAugmentation','none') % (1)
convolution2dLayer(7,50,'Stride', 2, 'Padding', 0,'Name','Conv1') % (2)111x111x50
reluLayer('Name','ReLu1') % (3)111x111x50
maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling1') % (4)55x55x50
crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm1') % (5)55x55x50
convolution2dLayer(5,100,'Stride', 1, 'Padding', 2,'Name','Conv2') % (6)55x55x100
reluLayer('Name','ReLu2') % (7)55x55x100
maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling2') % (8)27x27x100
crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm2') % (9)27x27x100
convolution2dLayer(3,256,'Stride', 1,'Padding', 2,'Name','Conv3') % (10)27x27x256
reluLayer('Name','ReLu3') % (11)27x27x256
maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling3') % (12)13x13x256
crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm3') % (13)13x13x256
convolution2dLayer(3,400,'Stride', 1,'Padding', 1,'Name','Conv4') % (14)13x13x400
reluLayer('name','ReLu4') % (15)13x13x400
convolution2dLayer(3,400,'Stride', 1,'Padding', 1,'Name','Conv5') % (16)13x13x400
reluLayer('Name','ReLu5') % (17)13x13x400
convolution2dLayer(3,256,'Stride', 1,'Padding', 1,'Name','Conv6') % (18)13x13x256
reluLayer('Name','ReLu6') % (19)13x13x256
maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling4') % (20)6x6x256
fullyConnectedLayer(4800,'Name','fc1') % (21)1x1x4800
reluLayer('Name','ReLu7') % (22)1x1x4800
dropoutLayer(0.5,'Name','dropout1') % (23)1x1x4800
fullyConnectedLayer(2400,'Name','fc2') % (24)1x1x2400
reluLayer('Name','ReLu8') % (25)1x1x2400
dropoutLayer(0.5,'Name','dropout2') % (26)1x1x2400
fullyConnectedLayer(8,'Name','fc3') % (27)
softmaxLayer()
classificationLayer()];
options = trainingOptions('sgdm',...
'InitialLearnRate',0.001,...
'LearnRateSchedule','piecewise',...
'LearnRateDropFactor',0.1,...
'LearnRateDropPeriod',30,...
'MaxEpochs',10,...
'Momentum',0.9,...
'L2Regularization',0.0005,...
'MiniBatchSize',25,...
'ExecutionEnvironment','gpu');
trainedNet = trainNetwork(trainingSet,layers,options);
end
I have no problem training with a single gpu, but when I try to train with multiple gpus, matlab generates the following error:
Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.
Error using trainNetwork (line 140)
An invalid indexing request was made.
Error in OurNetCBIR (line 110)
trainedNet = trainNetwork(trainingSet,layers,options);
Caused by:
Error using Composite/subsasgn (line 103)
An invalid indexing request was made.
Struct contents reference from a non-struct array object.
The client lost connection to worker 1. This might be due to network problems, or the interactive communicating job might have
errored.
Can someone help me please?

Answers (3)

Joss Knight
Joss Knight on 5 Nov 2017
I can reproduce your issue. It seems the issue is your use of an anonymous function to call a nested function for your datastore ReadFcn. Something about that is causing a crash when the datastore is deserialised on your pool worker (i.e. copied to it). This is a bug which we will investigate - thanks very much for bringing it to our attention.
Still, your issue is easily fixed. Reference your nested function directly rather than via an anonymous function:
imds.ReadFcn = @readAndPreprocessImage;
However, in R2017b you should be using augmentedImageSource to resize your images, since use of a ReadFcn cripples performance. This doesn't give you a way to convert grayscale images to RGB, but the best solution is to do that offline and save new files.

Andres Ramirez
Andres Ramirez on 19 Nov 2017
Edited: Andres Ramirez on 19 Nov 2017
Hello Joss:
I modified my program using the augmentedImageSource function (not the AugmentedImageSource function) to change the size of the image instead of ReadFcn. To which the new scrip list:
%
%
outputFolder = fullfile('Database');
rootFolder=fullfile(outputFolder, 'Oliva');
imds = imageDatastore(fullfile(rootFolder),'IncludeSubfolders', true, 'LabelSource', 'foldernames');
[trainingSet,testSet] = splitEachLabel(imds, 0.7, 'randomize');
imageSize = [227 227 3];
datasourcetraining = augmentedImageSource(imageSize,trainingSet,'BackgroundExecution',false);
%%Definimos las capas de la CNN
layers = [
imageInputLayer([227 227 3]) % (1)
convolution2dLayer(7,50,'Stride', 2, 'Padding', 0,'Name','Conv1') % (2)111x111x50
reluLayer('Name','ReLu1') % (3)111x111x50
maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling1') % (4)55x55x50
crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm1') % (5)55x55x50
convolution2dLayer(5,100,'Stride', 1, 'Padding', 2,'Name','Conv2') % (6)55x55x100
reluLayer('Name','ReLu2') % (7)55x55x100
maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling2') % (8)27x27x100
crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm2') % (9)27x27x100
convolution2dLayer(3,256,'Stride', 1,'Padding', 1,'Name','Conv3') % (10)27x27x256
reluLayer('Name','ReLu3') % (11)27x27x256
maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling3') % (12)13x13x256
crossChannelNormalizationLayer(5,'Alpha', 0.00002,'Beta', 0.75,'K',1,'Name','Norm3') % (13)13x13x256
convolution2dLayer(3,400,'Stride', 1,'Padding', 1,'Name','Conv4') % (14)13x13x400
reluLayer('name','ReLu4') % (15)13x13x400
convolution2dLayer(3,400,'Stride', 1,'Padding', 1,'Name','Conv5') % (16)13x13x400
reluLayer('Name','ReLu5') % (17)13x13x400
convolution2dLayer(3,256,'Stride', 1,'Padding', 1,'Name','Conv6') % (18)13x13x256
reluLayer('Name','ReLu6') % (19)13x13x256
maxPooling2dLayer(3,'Stride', 2,'Padding', 0,'Name','maxPooling4') % (20)6x6x256
fullyConnectedLayer(4800,'Name','fc1') % (21)1x1x4800
reluLayer('Name','ReLu7') % (22)1x1x4800
dropoutLayer(0.5,'Name','dropout1') % (23)1x1x4800
fullyConnectedLayer(2400,'Name','fc2') % (24)1x1x2400
reluLayer('Name','ReLu8') % (25)1x1x2400
dropoutLayer(0.5,'Name','dropout2') % (26)1x1x2400
fullyConnectedLayer(8,'Name','fc3') % (27)
softmaxLayer
classificationLayer];
options = trainingOptions('sgdm',...
'InitialLearnRate',0.001,...
'LearnRateSchedule','piecewise',...
'LearnRateDropFactor',0.1,...
'LearnRateDropPeriod',30,...
'MaxEpochs',100,...
'Momentum',0.9,...
'L2Regularization',0.0005,...
'MiniBatchSize',128,...
'Verbose',true,...
'ExecutionEnvironment','multi-gpu');
%%entrenamiento
trainedNet = trainNetwork(datasourcetraining,layers,options);
When I use 'BackgroundExecution' is true and 'ExecutionEnvironment' is 'auto'; the network trains without problem using the 8 CPUs that the machine has; when I put 'BackgroundExecution' is false and 'ExecutionEnvironment' is 'gpu'; the network trains without problem with a single GPU; but when I change 'ExecutionEnvironment' is 'multi-gpu' the network begins to train using the 4 gpus and after a certain decade the training is interrupted and matlab throws the following messages:
Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.
Initializing image normalization.
|=========================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning|
| | | (seconds) | Loss | Accuracy | Rate |
|=========================================================================================|
| 1 | 1 | 4.98 | 2.0800 | 14.06% | 0.0010 |
| 4 | 50 | 196.97 | 2.0736 | 14.84% | 0.0010 |
| 8 | 100 | 393.91 | 2.0451 | 11.72% | 0.0010 |
| 11 | 150 | 591.06 | 1.8704 | 26.56% | 0.0010 |
| 15 | 200 | 788.22 | 1.5337 | 47.66% | 0.0010 |
| 18 | 250 | 984.79 | 1.4195 | 48.44% | 0.0010 |
Lab 2:
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Error using trainNetwork (line 140)
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
Error in afr_prueba (line 75)
trainedNet = trainNetwork(datasourcetraining,layers,options);
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 69)
Error detected on worker 2.
Error using parallel.internal.mpi.gopReduce (line 44)
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
  1 Comment
Joss Knight
Joss Knight on 19 Nov 2017
Edited: Joss Knight on 19 Nov 2017
Well, strictly speaking this is a different question, but okay. Timeouts are a consequence of using graphics cards in WDDM mode on Windows. A quick search would give you your answer, for instance:
You can turn off timeouts, or reduce the amount of work your GPUs are doing so they don't occur.
I don't know why you're getting timeouts in multi-gpu mode but not on a single GPU. Are your other GPUs much lower powered than your main one, or are they all the same?

Sign in to comment.


Andres Ramirez
Andres Ramirez on 20 Nov 2017
Edited: Andres Ramirez on 20 Nov 2017
Hello, thanks for answering.
The 4 GPUs that I have are the same. Will it have something to do with one of the GPUs that controls the video of the machine?
CUDADevice with properties:
Name: 'GeForce GTX 1080'
Index: 1
ComputeCapability: '6.1'
SupportsDouble: 1
DriverVersion: 9
ToolkitVersion: 8
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.5899e+09
AvailableMemory: 7.0066e+09
MultiprocessorCount: 20
ClockRateKHz: 1733500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
I considered adjusting the TDRLevel and reduced the WDDM TDR delay to 0; The results with multi-gpu are the following:
Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.
Initializing image normalization.
|=========================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning|
| | | (seconds) | Loss | Accuracy | Rate |
|=========================================================================================|
| 1 | 1 | 5.17 | 2.0800 | 14.06% | 0.0010 |
| 4 | 50 | 205.09 | 2.0736 | 14.84% | 0.0010 |
| 8 | 100 | 408.72 | 2.0451 | 11.72% | 0.0010 |
Lab 2:
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In nnet.internal.cnn.ParallelTrainer/trainLocal (line 165)
In spmdlang.remoteBlockExecution (line 50)
Error using trainNetwork (line 140)
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
Error in afr_prueba (line 75)
trainedNet = trainNetwork(datasourcetraining,layers,options);
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 69)
Error detected on worker 2.
Error using parallel.internal.mpi.gopReduce (line 44)
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
I have disabled the WDDM TDR and the network already running with multiple GPUs. Perform two tests one with a single GPU with which an average time of 0.4119 seconds was obtained per decade, another with the four GPUs with which an average time of .9690 seconds was obtained per decade.
Using the 4 GPUs you get an average time per decade 2 times greater than that obtained with a single GPU. I do not understand what happens? It is assumed that when using more GPUs should reduce the time for decades considerably
  1 Comment
Joss Knight
Joss Knight on 20 Nov 2017
Edited: Joss Knight on 20 Nov 2017
Unfortunately on Windows the delay for communication between GPUs is significant. You can only manage this by increasing the MiniBatchSize as much as possible, trying to get it to the maximum achievable with your available memory - this improves the compute/communication ratio. It depends on the hardware but it's not always possible on Windows to get multi-gpu to go faster than single GPU. The general advice is to keep the MiniBatchSize per GPU the same. You can also scale up the learning rate commensurately because a large batch size lets you train faster (although sometimes you need to 'boot' your network with a smaller learn rate at first). Also, if running Linux is an option for you that will ameliorate this issue.
The behaviour of TDR is often confusing, with timeouts not necessarily being related (it seems) to the execution time of a single kernel. I don't know why the timeouts still seem to be occurring even after you've disabled them - I've only seen this before when the user has not rebooted after changing the registry keys. Did you reboot?
The fact that one of your cards is running graphics will definitely be interfering. You could try removing it from the pool. One way to do that is to set CUDA_VISIBLE_DEVICES on MATLAB startup to ensure only the non-display cards are used:
setenv CUDA_VISIBLE_DEVICES 0,2,3
...or whatever the indexes of those cards are (noting that the indices for this environment variable are 1 less than the indices shown by gpuDevice).

Sign in to comment.

Categories

Find more on Image Data Workflows in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!