I run this code
T=randn(10000,64);
data=randn(1000,64,10);
Tg=gpuArray(T);
datag=gpuArray(data);
res=zeros(10000,1000);
resg=gpuArray(res);
for i=1:10
res=res+T*data(:,:,i)';
end
for i=1:10
resg=resg+Tg*datag(:,:,i)';
end
resg=gather(resg);
norm(res-resg,'fro')/norm(res,'fro')
where I would expect "res" (CPU comptuted) and "resg" (GPU computed) to be the same, but they are not.
I am running this on a Tesla Card, i.e.
gpuDevice
ans =
parallel.gpu.CUDADevice handle
Package: parallel.gpu
Properties:
Name: 'Tesla C1060'
Index: 1
ComputeCapability: '1.3'
SupportsDouble: 1
DriverVersion: 3.2000
MaxThreadsPerBlock: 512
MaxShmemPerBlock: 16384
MaxThreadBlockSize: [512 512 64]
MaxGridSize: [65535 65535]
SIMDWidth: 32
TotalMemory: 4.2948e+09
FreeMemory: 4.0671e+09
MultiprocessorCount: 30
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Methods, Events, Superclasses