GPU out of memory issue appears with trainNetwork.

I have a Tesla P100 with 16 GB RAM. Yesterday, I ran the trainNetwork() with different layer achitectures and few different input data. It worked. Then I tried a larger input data set, but get the out of memory error:
Error using trainNetwork
GPU out of memory. Try reducing 'MiniBatchSize' using the trainingOptions function.
Error in A1_B1_C1a_D2 (line 152)
[net,netinfo] = trainNetwork(trainInput,trainTarget,Layers,options);
Caused by:
Error using gpuArray/hTimesTranspose
Out of memory on device. To view more detail about available memory on the GPU, use 'gpuDevice()'. If the problem persists, reset the GPU by calling 'gpuDevice(1)'.
I try to do what is suggested, but it doesn't help. I have tried many different less intensive approaches, done a reboot, and I even have returned to the scripts that used to work fine.
Now nothing works.
Any suggestions to troubleshoot hardware faults or a protective status somewhere?

 Accepted Answer

Matt J
Matt J on 3 May 2023
Edited: Matt J on 3 May 2023
Then I tried a larger input data set, but get the out of memory error:
If you make your data larger and larger, you will eventually run out of memory. Maybe reduce the MiniBatchSize setting.

13 Comments

This is what I did first. Didn't help. Then I reset with gpuDevice(1). Didn't help. Rebooted. Didn't help. Made the data size smaller and smaller. Didn't help. Fewer filters and fewer layers. Didn't help. Then returned to the run scripts that I used and ran with no problems yesterday, to see if I could run them at least. I couldn't.
Hi Mads. This is hard to diagnose without more detail. You'll need to provide 1) The output of gpuDevice before and after the error, 2) Your MATLAB version, 3) some example code that reproduces the issue.
The GPU doesn't usually run out of memory unless it's out of memory, so whatever your were doing with the larger dataset has probably left your GPU low on memory. If closing all applications and restarting MATLAB doesn't fix it then it's likely you are inadvertently still running the code that is using up the memory. Perhaps, for example, you are loading all your data and putting it onto the GPU instead of allowing trainNetwork to do that.
Hi Joss. Thanks. And it is hard for me to give these details. The gpuDevice returns to full capacity in memory after the error... it turns to the idle state.
I was able to rerun/train an even older data set. The input data to the gpu was 1.6820e+10 bytes,
TotalMemory: 17071734784 (17.07 GB)
AvailableMemory: 16340589072 (16.34 GB)
So if it not the "available memory" that sets the limit then I don't know what that means. But as I said I have trained networks with higher memory loads and even higher than the limit. So I guess it is possible (somehow) to let matlab split it up and sending only chunks to the gpu and not the whole train and validate data arrays...? But since I just increased the array sizes until I got the out of memory error and not other optimization parameters I feel lost in what is wrong.
We need to determine whether your problem is just that your network needs too much memory, or that there's something else, so you should re-confirm whether when you go back to your original code you continue to see the error, even after resetting the device.
You could also run analyzeNetwork on your input to see how many parameters your network has and the size of the intermediate activations during training.
Well, I managed to get it up and running. I'm not sure why it works.
As mentioned, I managed to run something. Did two trainings with large data sets split in two to keep memory low.
Then I took the trained net and transfered it to another net. In order to train on fewer, but slightly larger cases on the outputside.
This didn't work. In my search I scaled the new data set down to a ridiculous low amount, ~300 MB, and used minibatch of 10.
Array sizes were: ...validate...
s1 =
64 64 18 100
s2 =
100 10256
and ...train...
s3 =
64 64 18 800
s4 =
800 10256
TotSize =
302342400
But the error is the same:
Error using trainNetwork (line 184)
GPU out of memory. Try reducing 'MiniBatchSize' using the trainingOptions function.
Error in A1_B1_C1d_D1 (line 97)
[net,netinfo] = trainNetwork(trainInput,trainTarget,Layers,options);
Caused by:
Error using gpuArray/hTimesTranspose
Out of memory on device. To view more detail about available memory on the GPU, use
'gpuDevice()'. If the problem persists, reset the GPU by calling 'gpuDevice(1)'.
Clearly, the describtion of the error is wrong. But what is wrong?
Not sure why you think the description of the error is wrong. You are definitely running out of memory.
So your training data and validation data together have 300 million elements. Is it double precision data? So could be 2.4GB. Are you putting it all on the GPU before you start training?
Also why don't you post the output of analyzeNetwork so we can get an idea of how much data would have to be held in memory to train your network.
Right, here is a screenshot from analyzeNetwork:
This is the script I'm running... or want to run:
temp = load('... some previous net....mat');
% this loads my training and validation data
[trainInput,trainTarget] = LoadInputTargetFiles(Folder_C_input_DL,[1],'train');
[validateInput,validateTarget] = LoadInputTargetFiles(Folder_C_input_DL,[1],'validate');
Nt = 641;
transferLayers = temp.net.Layers(1:6);
Layers = [
transferLayers
reluLayer
fullyConnectedLayer(Nt*2*8)
reluLayer
fullyConnectedLayer(Nt*2*8)
clipLayer(1,'myclip')
regressionLayer
];
Layers(8).WeightLearnRateFactor = 10; % hints from video
Layers(8).WeightL2Factor = 1;
Layers(8).BiasLearnRateFactor = 20;
Layers(8).BiasL2Factor = 1;
options = trainingOptions(...
'sgdm', ...
'MaxEpochs',1000,...
'InitialLearnRate',0.006,...
'Momentum',0.95,...
'Shuffle','every-epoch',...
'ValidationData',{validateInput,validateTarget},...
'ValidationPatience',Inf,...
'ValidationFrequency', 500,...
'L2Regularization',1e-4,...
'Plots','training-progress',...
'CheckPointPath',Folder_D_run_DL_checkpoints,...
'ExecutionEnvironment','gpu','MiniBatchSize',10);
gpu=gpuDevice();
reset(gpu);
gpu=gpuDevice();
disp(gpu)
s1 = size(validateInput)
s2 = size(validateTarget)
s3 = size(trainInput)
s4 = size(trainTarget)
TotSize = prod(s1)+prod(s2)+prod(s3)+prod(s4); TotSize = TotSize*4 % 4 because it is type single
[net,netinfo] = trainNetwork(trainInput,trainTarget,Layers,options);
So the previous net was trained with the same input size (per example) 64x64x18 (single), but with 320e3 examples for training and 40e3 examples for validation.
On the output side it was size 1360x1 (single).
In the new training I tested with the script above with a small data set, loaded only 800 examples for training and 100 examples for validation. Here the input size is the same (per example) 64x64x18 (single), but the output size is 10256x1 (single).
For comparison I have been able to train a network with that output size and with many more training data examples loaded onto the gpu.
So...
This is a screenshot from a such net from one of the checkpoints.
Seems fairly clearcut to me. In your first image, fc2 alone takes up 7.4GB so you're definitely going to struggle, especially for training because you need 8GB for weights, 8GB for their gradients, and probably 8 more for temporaries while you're updating the weights. You need a smaller network. Try adding more convolution layers rather than relying on a massive fully connected layer to do most of the work. Look at the Total Number of Learnables at the top of the Network Analyzer window and multiply it by 4 to get the number of bytes your network will need.
Your other network is much smaller, a 'mere' 1.4GB for the fully connected layers.
Oh... right... I hadn't accounted for the total learnables and that enourmous FC. By inserting a conv layer before it, I managed to run it.
Pew...
Thanks

Sign in to comment.

More Answers (0)

Products

Release

R2023a

Asked:

on 3 May 2023

Commented:

on 15 May 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!