making a custom way to train CNNs, and I am noticing that avgpool is SIGNIFICANTLY faster than maxpool in forward and backwards passes…
33 views (last 30 days)
Show older comments
I’m designing a custom training procedure for a CNN that is different from backpropagation in that I use manual update rules for layers or sets of layers. I'm studying my gradient for two types of layers: “conv + actfun + maxpool”, and “conv + actfun + avgpool”, which are identical layers except the last action is a different pooling type.
I compared the two layer types with identical data dimension sizes to see the time differences between maxpool and avgpool, both in the forward pass and the backwards pass of the pooling layers. All other steps in calculating the gradient were exactly the same between the two layers, and showed the same time costs in the two layers. But when looking at time costs specifically of the pooling operations’ forward and backwards passes, I get significantly different times (average of 5000 runs of the gradient, each measurement is in milliseconds):
gradient step | AvgPool | MaxPool | Difference
--------------------------|---------|---------|----------
pooling (forward pass) | 0.4165 | 38.6316 | +38.2151
unpooling (backward pass) | 9.9468 | 46.1667 | +36.2199
For reference, all my data arrays are dlarrays on the GPU (gpuArrays in dlarrays), all single precision, and the pooling operations convert 32 by 32 feature maps (across 2 channels and 16384 batch size) to 16 by 16 feature maps (of same # channels and batch size), so just a 2 by 2 pooling operation.
You can see here that the maxpool forward pass (using “maxpool” function) is about 92 times slower than the avgpool forward pass (using “avgpool”), and the maxpool backward pass (using “maxunpool”) is about 4.6 times slower than the avgpool backward pass (using a custom “avgunpool” function that Anthropic’s Claude had to create for me, since matlab has no “avgunpool”).
These results are extremely suspect to me. For the forwards pass, comparing matlab's built in "maxpool" to built in "avgpool" functions gives a 92x difference, but searching online people seem to instead claim that training max pooling is supposed to be faster than avg pooling, which contradicts the results here.
For simplicity, see the code example below that runs just "maxpool" and "avgpool" only (no other functions) and compares their times:
function analyze_pooling_timing()
% GPU setup
g = gpuDevice();
fprintf('GPU: %s\n', g.Name);
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
% Number of iterations for averaging
num_iter = 100;
fprintf('Running %d iterations for each timing measurement...\n\n', num_iter);
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time AVERAGE POOLING
fprintf('=== AVERAGE POOLING (conv_af_ap) ===\n');
times_ap = struct();
for iter = 1:num_iter
% Average pooling
tic;
Oj_pooled = avgpool(Oj, pool_params.pool_size, ...
'Stride', pool_params.pool_stride, ...
'Padding', pool_params.pool_padding);
wait(g);
times_ap.pooling(iter) = toc;
end
%% Time MAX POOLING
fprintf('\n=== MAX POOLING (conv_af_mp) ===\n');
times_mp = struct();
for iter = 1:num_iter
% Max pooling with indices
tic;
[Oj_pooled, max_indices] = maxpool(Oj, pool_params.pool_size, ...
'Stride', pool_params.pool_stride, ...
'Padding', pool_params.pool_padding);
wait(g);
times_mp.pooling(iter) = toc;
end
%% Compute statistics and display results
fprintf('\n=== TIMING RESULTS (milliseconds) ===\n');
fprintf('%-25s %12s %12s %12s\n', 'Step', 'AvgPool', 'MaxPool', 'Difference');
fprintf('%s\n', repmat('-', 1, 65));
steps_common = { 'pooling'};
total_ap = 0;
total_mp = 0;
for i = 1:length(steps_common)
step = steps_common{i};
if isfield(times_ap, step) && isfield(times_mp, step)
mean_ap = mean(times_ap.(step)) * 1000; % times 1000 to convert seconds to milliseconds
mean_mp = mean(times_mp.(step)) * 1000; % times 1000 to convert seconds to milliseconds
total_ap = total_ap + mean_ap;
total_mp = total_mp + mean_mp;
diff = mean_mp - mean_ap;
fprintf('%-25s %12.4f %12.4f %+12.4f\n', step, mean_ap, mean_mp, diff);
end
end
fprintf('%s\n', repmat('-', 1, 65));
%fprintf('%-25s %12.4f %12.4f %+12.4f\n', 'TOTAL', total_ap, total_mp, total_mp - total_ap);
fprintf('%-25s %12s %12s %12.2fx\n', 'Speedup', '', '', total_mp/total_ap);
end
Now you can see one main difference between my calling of maxpool and avgpool: for avgpool I only have 1 output (the pooled values), but with maxpool I have 2 outputs (the pooled values, and the index locations of these max values).
This is important because if I replaced the call to maxpool with only requesting the pooled values (1 output), maxpool is faster as expected:
>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement...
=== AVERAGE POOLING (conv_af_ap) ===
=== MAX POOLING (conv_af_mp) ===
=== TIMING RESULTS (milliseconds) ===
Step AvgPool MaxPool Difference
-----------------------------------------------------------------
pooling 0.4358 0.2330 -0.2028
-----------------------------------------------------------------
max/avg 0.53x
>>
However if I call maxpool like here with 2 outputs (pooled values and indices), or with 3 outputs (pooled values, indices, and inputSize), this is where maxpool is extremely slow:
>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement...
=== AVERAGE POOLING (conv_af_ap) ===
=== MAX POOLING (conv_af_mp) ===
=== TIMING RESULTS (milliseconds) ===
Step AvgPool MaxPool Difference
-----------------------------------------------------------------
pooling 0.4153 38.9818 +38.5665
-----------------------------------------------------------------
max/avg 93.86x
>>
So it appears that this is the reason why maxpool is so much slower: requesting the maxpool function to also output the indices leads to a massive slowdown.
Unfortunately, the indices are needed to later differentiate (backwards pass) the maxpool layer... so I need the indices...
I'd assume that whenever someone wants to train a CNN in matlab using a maxpool layer, they would have to call maxpool with indices, and thus I'd expect a similar slowdown...
Can anyone here comment on this? My application needs me to train maxpool layers, so I need to both forward pass and backwards pass through them, thus I need these indices. But it appears that matlab's version of maxpool may not be the best implementation of a maxpool operation... maybe some inefficient processes regarding obtaining the indices?
3 Comments
Accepted Answer
Matt J
on 25 Aug 2025
Edited: Matt J
on 26 Aug 2025
I think the pooling operations are simply not optimized for the specific situation when the pooling is tiled (i.e., poolsize=stride). This is ironic, I guess, since this is a very common case.
Below I demonstrate how optimize the forward pass for both kinds of pooling using sepblockfun(), downloadable from,
For the maxpool backward pass, it's not really the maximizing indices you need. The more direct thing you need is a binary map of the maxima locations. When you have tiled pooling, this can be generated much faster, as done below in tiledmaxpoolBACK. There are ways to eliminate the second repelem for an even faster implementation, but this should make the point.
analyze_pooling_timing()
GPU: NVIDIA GeForce RTX 4080 SUPER
Milliseconds =
1×4 table
AvgPool(Forw) AvgPool(Back) MaxPool(Forw) MaxPool(Back)
_____________ _____________ _____________ _____________
3.6987 1.121 1.1704 2.0492
function analyze_pooling_timing()
% GPU setup
g = gpuDevice();
fprintf('GPU: %s\n', g.Name);
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time Tests
millisec = 1000; % Conversion factor
sz=pool_params.pool_size;
Y=tiledmaxpool(Oj, sz); %fake input to back pass
dLdY=Y;
apforwTime = gputimeit(@() tiledavgpool(Oj, sz) )*millisec;
apbackTime = gputimeit(@() tiledavgpoolBACK(dLdY,Oj,Y, sz) )*millisec;
mpforwTime = gputimeit(@() tiledmaxpool(Oj, sz) )*millisec;
mpbackTime = gputimeit(@() tiledmaxpoolBACK(dLdY,Oj,Y, sz) )*millisec;
Milliseconds = table(apforwTime, apbackTime, mpforwTime,mpbackTime,...
'Var',{'AvgPool(Forw)' 'AvgPool(Back)',...
'MaxPool(Forw)', 'MaxPool(Back)'})
end
function Y=tiledavgpool(X,s)
Y=sepblockfun(X,[s,1,1],'mean');
end
function Q=tiledavgpoolBACK(dLdY,X,Y,s)
Q=repelem(dLdY/prod(s),s(1),s(2));
end
function Y=tiledmaxpool(X,s)
Y=sepblockfun(X,[s,1,1],'max');
end
function Q=tiledmaxpoolBACK(dLdY,X,Y,s)
Q=repelem(dLdY,s(1),s(2)) .* (X==repelem(Y,s(1),s(2)));
end
0 Comments
More Answers (1)
Catalytic
on 26 Aug 2025
Edited: Catalytic
on 26 Aug 2025
As @Matt J says, it may be worthwhile to make customized pooling layers specifically for the case of tiled pools. However, it might also be that your analysis is tainted because you are using the avgpool and maxpool functions to implement pooling instead of implementing them with actual layers. When I test that, I don't see such a big difference between the average pool and max pool performance -
analyze_pooling_timing()
Milliseconds =
1×2 table
AvgPool MaxPool
_______ _______
5.0346 4.4899
function analyze_pooling_timing()
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time Tests
millisec = 1000; % Conversion factor
sz=pool_params.pool_size;
for i=1:25
net=dlnetwork(averagePooling2dLayer(sz, Stride=sz),Oj);
fun = @() dlfeval(@modelGradients, net, Oj);
apforwTime(i) = gputimeit( fun,2 )*millisec;
net=dlnetwork(maxPooling2dLayer(sz, Stride=sz),Oj);
fun = @() dlfeval(@modelGradients, net, Oj);
mpforwTime(i) = gputimeit(fun,2 )*millisec;
end
Milliseconds = table(mean(apforwTime), mean(mpforwTime),...
'Var',{'AvgPool' , 'MaxPool',})
end
function [loss, gradients] = modelGradients(net, X)
% Forward pass and loss
Y = forward(net, X);
loss = Y(1);
% Backpropagation (automatic differentiation)
gradients = dlgradient(loss, X);
end
4 Comments
Matt J
on 17 Oct 2025 at 1:06
Edited: Matt J
on 17 Oct 2025 at 1:20
my case is different though ... my "hand-derived" gradient function written in matlab first performs the forward pass: dlconv, then activation function, then maxpool. Then I takee the derivative of my objective function w.r.t. output features.
Nothing you mention about "your case" sounds like a deviation from what @Catalytic has presented. His examples also consist of hand-derived forward and backward computations.
It is clear that you want to implement max pooling (and its gradient) as an isolated step. Both @Catalytic and I have demonstrated fast alternatives to maxpool() for doing that.
See Also
Categories
Find more on Parallel and Cloud in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!