MATLAB Answers

1

Initializing GPU on multiple workers cause an unknown error

I've noticed that the following simple code results in an weird error, if I use R2016b on a machine with two GTX1080Ti and one K2200 :
% start a _new_ Matlab instance first!
parpool(16);
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )
The error message I get:
Error using parallel.FevalOnAllFuture/fetchOutputs (line 69)
One or more futures resulted in an error.
Caused by:
Error using parallel.internal.pool.deserialize>@()gather(gpuArray(1))
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error
<-- repeated multiple times -->
After that, all GPU functionality gets completely broken:
>> a=gpuArray(1)
Error using gpuArray
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error
Even re-starting Matlab won't help. The fix is to clear the CUDA JIT cache folder, "%USERPROFILE%\AppData\Roaming\NVIDIA\ComputeCache".
However, the following "longer pre-initialization" works OK for me:
% start a _new_ Matlab instance first and clear CUDA JIT cache if there was an error.
gpuDevice(1)
gather(gpuArray(1))
parpool();
fetchOutputs( parfevalOnAll(@() gpuDevice(1),1) )
fetchOutputs(parfevalOnAll(@() gather(gpuArray(1)),1))
AFAIU:
  1. Matlab R2016b that I use here, was designed for CUDA 7.5, and there are no binaries for CUDA Compute Capability 6.1.
  2. That's why Matlab uses CUDA JIT to recompile a ton (~400 MB) of stuff when user calls any gpu-related function the first time. (Which also causes many " gpuDevice() is slow " questions.
  3. There's something wrong with that JIT, if combined with parpool (a race condition?).
My system is: Windows 10, CUDA 8.0 (cuda_8.0.61_win10) with patch 2 (cuda_8.0.61.2_windows), nvidia driver r384.94. The CUDA_CACHE_MAXSIZE environment variable is set to 2147483647.
My questions:
  1. Is my "longer pre-initialization" workaround actually "safe"? Is it a real workaround for those "race condition"? Or is it as good as the original (might be stable on my specific system, but is likely to fail on some other)? Assuming I have to stay with R2016b for now, targeting CUDA 8.0 and Pascal GPU (building a dll).
  2. Same code works OK in R2017b-R2018a and above. Is that just because they don't use CUDA JIT here? Or is the real underlying issue actually fixed? (I don't have a device with compute capability >6.x at hand, so I'm unable to check that.)R2017a behaves like R2016b here, even though it claims CUDA 8.0 support - it still writes something (but just ~40MB) to CUDA JIT cache, fails in test #1 and works in test #2.

  10 Comments

I don't really have any more ideas I'm afraid
Personally, my solution was to re-write GPU part in plain CUDA & c++, using mex, but without using mxInitGPU and gpuArray at all. This is somewhat hacky (cause I have to keep in mind to not use gpuArray), but it works. And it's faster.
-----------------------------------------------------------------------------------------------
Try downloading a newer driver, then try downloading an OLDER driver.
Yep, maybe I'll try different drivers later. I just don't have enough time for this now.
But, I've just checked that the code from my original message fails on a completely different system (with two GTX1080+K2200 as well, though), with Windows 7 x64 and NVidia 388.19 driver. So, at least, it's not "just on my machine".
-----------------------------------------------------------------------------------------------
Nonetheless, I'll requisition one and check your code in 16b.
That would be very nice, if you would be able to reproduce the issue - this confirmation might help a lot if I would file a bug report to NVidia.
-----------------------------------------------------------------------------------------------
since this works fine in later versions of MATLAB
Personally, I'm not sure CUDA JIT works in newer Matlab versions... Isn't it working just because JIT is not actually needed? If I had a Volta GPU, I could have tried to reproduce the same thing in R2017b as well....
-----------------------------------------------------------------------------------------------
I suspect it is but really, CUDA 7.5 is pretty old now and NVIDIA don't worry themselves too much about supporting the JIT pipeline for older cards, so there could be an issue in your driver that won't be fixed, or will never work because the PTX itself is faulty.
Well, it's definitely possible that this is only related to particular "pipeline", related to particular ptx version. Even though preventing different processes/threads from writing into the same file simultaneously (or whatever) looks like a somewhat version-independent part of code. But from what I currency know - it's possible that the bug is still there as well.
-----------------------------------------------------------------------------------------------
running some of the CUDA toolkit samples
That would be the best possible approach, if there would be an easy way to produce 400MB of CUDA binaries at once... And with just few KB - it might be difficult to reproduce the whole thing.
-----------------------------------------------------------------------------------------------
setting the environment variable CUDA_FORCE_PTX_JIT to 1
Sounds interesting... But this does not give me the same behaviour - the ComputeCache is still almost empty after running those commands - few KB only. It looks like files are being added and instantly erased. Hmm... Could you please advice - am I doing something wrong here? Were you able to make it populate the ComputeCache?
I had a colleague check their dual GTX 1080 system and they saw no issues, with 16b or with the current version with a forced JIT.
Sounds interesting... But this does not give me the same behaviour - the ComputeCache is still almost empty after running those commands - few KB only. It looks like files are being added and instantly erased. Hmm... Could you please advice - am I doing something wrong here? Were you able to make it populate the ComputeCache?
This works for me but ... possibly only when your card's architecture is the maximum supported or higher, because if it were lower there would be no compatible PTX in the libraries. So you'll need to run R2017a or R2017b for your Pascal card.
It would be good to establish why upgrading MATLAB is not an option for you.
It would be good to establish why upgrading MATLAB is not an option for you.
That's because in this particular case the request to me was to improve the performance without any major changes, like adding new external dependencies (e.g. newer MCR). For the next version, we'll definitely migrate to a newer Matlab.
-----------------------------------------------------------------------------------------------
This works for me but ... possibly only when your card's architecture is the maximum supported or higher, because if it were lower there would be no compatible PTX in the libraries. So you'll need to run R2017a or R2017b for your Pascal card.
Yep, I've used R2017b (because R2017a got the same issue as R2016b). But this trick does not work for me. I've just tried this once again - the ComputeCache size oscilates from 0 to few MB, while gpuDevice(1) is running (but it takes few minutes, so it's definitely compiling something). In the end, ComputeCache size is below 1 MB. That's strange. Just in case, I set this environment variable in cmd, before starting Matlab, e.g.
Microsoft Windows [Version 10.0.15063]
(c) 2017 Microsoft Corporation. All rights reserved.
C:\>cd "C:\Program Files\MATLAB\R2017b\bin\"
C:\Program Files\MATLAB\R2017b\bin>echo %CUDA_CACHE_MAXSIZE%
2147483647
C:\Program Files\MATLAB\R2017b\bin>echo %CUDA_CACHE_DISABLE%
0
C:\Program Files\MATLAB\R2017b\bin>set CUDA_FORCE_PTX_JIT=1
C:\Program Files\MATLAB\R2017b\bin>echo %CUDA_FORCE_PTX_JIT%
1
C:\Program Files\MATLAB\R2017b\bin>matlab.exe
In R2018a update 3, trying to run gpu-related commands with CUDA_FORCE_PTX_JIT=1 produces a different result. The ComputeCache remains empty. There's almost no delay. And it fails on convn:
>> getenv('CUDA_FORCE_PTX_JIT')
ans =
'1'
>> gpuDevice(1);
Warning: The CUDA driver must recompile the GPU libraries because CUDA_FORCE_PTX_JIT is set to '1'. Recompiling can take several minutes. Learn more.
> In parallel.internal.gpu.selectDevice
In parallel.gpu.GPUDevice.select (line 58)
In gpuDevice (line 21)
>> gpuDevice(2);
Warning: The CUDA driver must recompile the GPU libraries because CUDA_FORCE_PTX_JIT is set to '1'. Recompiling can take several minutes. Learn more.
> In parallel.internal.gpu.selectDevice
In parallel.gpu.GPUDevice.select (line 58)
In gpuDevice (line 21)
>> a=gpuArray(zeros([9 9 9]));
>> b=gpuArray(zeros([3 3 3]));
>> c=convn(a,b)
Error using gpuArray/convn
An unexpected error occurred trying to launch a kernel. The CUDA error was:
invalid device symbol
Probably it fails because R2018a is designed for CUDA9, and my current GPU driver does not support it. This is as-expected. The strange part is that the ComputeCache is just empty.
-----------------------------------------------------------------------------------------------
I had a colleague check their dual GTX 1080 system and they saw no issues, with 16b or with the current version with a forced JIT.
Thanks for testing this! But could you please also specify, what was the NVidia driver version?
Provided that it works in the "current version" - probably that's some newer driver, with CUDA9 support. Maybe this means that the issue is fixed in newer NVidia drivers. I think I should test this myself as well.
However, maybe, after all, this issue do depend on "mixing Pascal and Maxwell". It looks like at least some aspects do depend on it. I've recently noticed that this fails:
% clear CUDA JIT cache and restart Matlab first
gpuDevice(1);
parpool(16);
fetchOutputs( parfevalOnAll(@gpuDevice,1) )
but this works (tested twice):
% clear CUDA JIT cache and restart Matlab first
gpuDevice(1);
parpool(16);
fetchOutputs( parfevalOnAll(@() gpuDevice(1),1) )
this works as well (tested twice):
% clear CUDA JIT cache and restart Matlab first
gpuDevice(1);
parpool(16);
spmd
gpuDevice(mod(labindex,2)+1);
gather(gpuArray(1));
end
but this fails:
% clear CUDA JIT cache and restart Matlab first
gpuDevice(1);
parpool(16);
spmd
gpuDevice(mod(labindex,2)+2);
gather(gpuArray(1));
end
Starting parallel pool (parpool) using the 'local' profile ... connected to 16 workers.
Warning: An error has occurred during SPMD execution. An attempt has been made to interrupt execution on the workers. If this situation persists, it may be necessary to
interrupt execution using CTRL-C and then deleting and restarting the parallel pool.
The error that occurred on worker 13 is:
Error using gpuDevice (line 26)
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error
.
> In spmdlang.RemoteSpmdExecutor/maybeWarnIfInterruptedAndWaiting (line 300)
In spmdlang.RemoteSpmdExecutor/isComputationComplete (line 131)
In spmdlang.spmd_feval_impl (line 19)
In spmd_feval (line 8)
Error detected on worker 13.
Caused by:
Error using gpuDevice (line 26)
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error
-----------------------------------------------------------------------------------------------
UPD:
I've just tried Nvidia 397.93 driver. And now the original issue is gone, and this:
% clear CUDA JIT cache and restart Matlab first
parpool(16);
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )
works OK in R2016b (tested twice). And the ComputeCache size is much smaller - only ~140MB.
So, after all, it looks like the issue does not exist in newer driver versions. So, sorry for the buzz. I should have checked this before. :)
(But the CUDA_FORCE_PTX_JIT in R2017b still behaves the same for me, by the way.)

Sign in to comment.

1 Answer

Answer by Igor Varfolomeev on 25 Nov 2018
 Accepted Answer

As noted in comments, it looks like the issue does not exist in newer driver versions. So, I'm sorry for the buzz.

  0 Comments

Sign in to comment.