making a custom way to train CNNs, and I am noticing that avgpool is SIGNIFICANTLY faster than maxpool in forward and backwards passes…

33 views (last 30 days)
I’m designing a custom training procedure for a CNN that is different from backpropagation in that I use manual update rules for layers or sets of layers. I'm studying my gradient for two types of layers: “conv + actfun + maxpool”, and “conv + actfun + avgpool”, which are identical layers except the last action is a different pooling type.
I compared the two layer types with identical data dimension sizes to see the time differences between maxpool and avgpool, both in the forward pass and the backwards pass of the pooling layers. All other steps in calculating the gradient were exactly the same between the two layers, and showed the same time costs in the two layers. But when looking at time costs specifically of the pooling operations’ forward and backwards passes, I get significantly different times (average of 5000 runs of the gradient, each measurement is in milliseconds):
gradient step | AvgPool | MaxPool | Difference
--------------------------|---------|---------|----------
pooling (forward pass) | 0.4165 | 38.6316 | +38.2151
unpooling (backward pass) | 9.9468 | 46.1667 | +36.2199
For reference, all my data arrays are dlarrays on the GPU (gpuArrays in dlarrays), all single precision, and the pooling operations convert 32 by 32 feature maps (across 2 channels and 16384 batch size) to 16 by 16 feature maps (of same # channels and batch size), so just a 2 by 2 pooling operation.
You can see here that the maxpool forward pass (using “maxpool” function) is about 92 times slower than the avgpool forward pass (using “avgpool”), and the maxpool backward pass (using “maxunpool”) is about 4.6 times slower than the avgpool backward pass (using a custom “avgunpool” function that Anthropic’s Claude had to create for me, since matlab has no “avgunpool”).
These results are extremely suspect to me. For the forwards pass, comparing matlab's built in "maxpool" to built in "avgpool" functions gives a 92x difference, but searching online people seem to instead claim that training max pooling is supposed to be faster than avg pooling, which contradicts the results here.
For simplicity, see the code example below that runs just "maxpool" and "avgpool" only (no other functions) and compares their times:
function analyze_pooling_timing()
% GPU setup
g = gpuDevice();
fprintf('GPU: %s\n', g.Name);
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
% Number of iterations for averaging
num_iter = 100;
fprintf('Running %d iterations for each timing measurement...\n\n', num_iter);
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time AVERAGE POOLING
fprintf('=== AVERAGE POOLING (conv_af_ap) ===\n');
times_ap = struct();
for iter = 1:num_iter
% Average pooling
tic;
Oj_pooled = avgpool(Oj, pool_params.pool_size, ...
'Stride', pool_params.pool_stride, ...
'Padding', pool_params.pool_padding);
wait(g);
times_ap.pooling(iter) = toc;
end
%% Time MAX POOLING
fprintf('\n=== MAX POOLING (conv_af_mp) ===\n');
times_mp = struct();
for iter = 1:num_iter
% Max pooling with indices
tic;
[Oj_pooled, max_indices] = maxpool(Oj, pool_params.pool_size, ...
'Stride', pool_params.pool_stride, ...
'Padding', pool_params.pool_padding);
wait(g);
times_mp.pooling(iter) = toc;
end
%% Compute statistics and display results
fprintf('\n=== TIMING RESULTS (milliseconds) ===\n');
fprintf('%-25s %12s %12s %12s\n', 'Step', 'AvgPool', 'MaxPool', 'Difference');
fprintf('%s\n', repmat('-', 1, 65));
steps_common = { 'pooling'};
total_ap = 0;
total_mp = 0;
for i = 1:length(steps_common)
step = steps_common{i};
if isfield(times_ap, step) && isfield(times_mp, step)
mean_ap = mean(times_ap.(step)) * 1000; % times 1000 to convert seconds to milliseconds
mean_mp = mean(times_mp.(step)) * 1000; % times 1000 to convert seconds to milliseconds
total_ap = total_ap + mean_ap;
total_mp = total_mp + mean_mp;
diff = mean_mp - mean_ap;
fprintf('%-25s %12.4f %12.4f %+12.4f\n', step, mean_ap, mean_mp, diff);
end
end
fprintf('%s\n', repmat('-', 1, 65));
%fprintf('%-25s %12.4f %12.4f %+12.4f\n', 'TOTAL', total_ap, total_mp, total_mp - total_ap);
fprintf('%-25s %12s %12s %12.2fx\n', 'Speedup', '', '', total_mp/total_ap);
end
Now you can see one main difference between my calling of maxpool and avgpool: for avgpool I only have 1 output (the pooled values), but with maxpool I have 2 outputs (the pooled values, and the index locations of these max values).
This is important because if I replaced the call to maxpool with only requesting the pooled values (1 output), maxpool is faster as expected:
>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement...
=== AVERAGE POOLING (conv_af_ap) ===
=== MAX POOLING (conv_af_mp) ===
=== TIMING RESULTS (milliseconds) ===
Step AvgPool MaxPool Difference
-----------------------------------------------------------------
pooling 0.4358 0.2330 -0.2028
-----------------------------------------------------------------
max/avg 0.53x
>>
However if I call maxpool like here with 2 outputs (pooled values and indices), or with 3 outputs (pooled values, indices, and inputSize), this is where maxpool is extremely slow:
>> analyze_pooling_timing
GPU: NVIDIA GeForce RTX 5080
Running 1000 iterations for each timing measurement...
=== AVERAGE POOLING (conv_af_ap) ===
=== MAX POOLING (conv_af_mp) ===
=== TIMING RESULTS (milliseconds) ===
Step AvgPool MaxPool Difference
-----------------------------------------------------------------
pooling 0.4153 38.9818 +38.5665
-----------------------------------------------------------------
max/avg 93.86x
>>
So it appears that this is the reason why maxpool is so much slower: requesting the maxpool function to also output the indices leads to a massive slowdown.
Unfortunately, the indices are needed to later differentiate (backwards pass) the maxpool layer... so I need the indices...
I'd assume that whenever someone wants to train a CNN in matlab using a maxpool layer, they would have to call maxpool with indices, and thus I'd expect a similar slowdown...
Can anyone here comment on this? My application needs me to train maxpool layers, so I need to both forward pass and backwards pass through them, thus I need these indices. But it appears that matlab's version of maxpool may not be the best implementation of a maxpool operation... maybe some inefficient processes regarding obtaining the indices?
  3 Comments
Cal
Cal on 15 Oct 2025 at 16:26
Edited: Cal on 15 Oct 2025 at 16:27
Hi Richard,
This is outdated now since I will be using custom cuDNN mex files to implement this in matlab now, but to answer your question "If you can share some more details about how you are using the network fragments and what you need the gradient of then we may be able to suggest code that can do that with dlgradient."
I derived an update rule for local learning in a CNN, a single layer instead of all layers, and my derivation leads to an analytic update rule instead of relying on automatic differentiation / backpropagation. Thus, this update rule would be more efficient, as direct coded formulas for the update rule are faster than automatic differentiation since you don't have to evaluate the cost, among various other factors: you can just calculate the formulas for the derivative directly.
So in a sense, the point of this would to NOT use dlgradient or any of these other functionalities that must use automatic differentiation. the maxpool function allows the forward pass without AD, the maxunpool allows the backward pass without AD. Its just that the maxunpool function appears to have optimization issues as current, because the vast majority of people aren't doing what I'm doing; they are using backpropagation, and thus dlgradient naturally does that, and the backward pass w.r.t. maxpool is as of right now better designed for backpropagation / dlgradient based functionality.
Now what I'm doing is to have a matlab wrapper function that either performs the forward pass over a single maxpool operation, or the backward pass over the maxpool operation.

Sign in to comment.

Accepted Answer

Matt J
Matt J on 25 Aug 2025
Edited: Matt J on 26 Aug 2025
I think the pooling operations are simply not optimized for the specific situation when the pooling is tiled (i.e., poolsize=stride). This is ironic, I guess, since this is a very common case.
Below I demonstrate how optimize the forward pass for both kinds of pooling using sepblockfun(), downloadable from,
For the maxpool backward pass, it's not really the maximizing indices you need. The more direct thing you need is a binary map of the maxima locations. When you have tiled pooling, this can be generated much faster, as done below in tiledmaxpoolBACK. There are ways to eliminate the second repelem for an even faster implementation, but this should make the point.
analyze_pooling_timing()
GPU: NVIDIA GeForce RTX 4080 SUPER
Milliseconds =
1×4 table
AvgPool(Forw) AvgPool(Back) MaxPool(Forw) MaxPool(Back)
_____________ _____________ _____________ _____________
3.6987 1.121 1.1704 2.0492
function analyze_pooling_timing()
% GPU setup
g = gpuDevice();
fprintf('GPU: %s\n', g.Name);
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time Tests
millisec = 1000; % Conversion factor
sz=pool_params.pool_size;
Y=tiledmaxpool(Oj, sz); %fake input to back pass
dLdY=Y;
apforwTime = gputimeit(@() tiledavgpool(Oj, sz) )*millisec;
apbackTime = gputimeit(@() tiledavgpoolBACK(dLdY,Oj,Y, sz) )*millisec;
mpforwTime = gputimeit(@() tiledmaxpool(Oj, sz) )*millisec;
mpbackTime = gputimeit(@() tiledmaxpoolBACK(dLdY,Oj,Y, sz) )*millisec;
Milliseconds = table(apforwTime, apbackTime, mpforwTime,mpbackTime,...
'Var',{'AvgPool(Forw)' 'AvgPool(Back)',...
'MaxPool(Forw)', 'MaxPool(Back)'})
end
function Y=tiledavgpool(X,s)
Y=sepblockfun(X,[s,1,1],'mean');
end
function Q=tiledavgpoolBACK(dLdY,X,Y,s)
Q=repelem(dLdY/prod(s),s(1),s(2));
end
function Y=tiledmaxpool(X,s)
Y=sepblockfun(X,[s,1,1],'max');
end
function Q=tiledmaxpoolBACK(dLdY,X,Y,s)
Q=repelem(dLdY,s(1),s(2)) .* (X==repelem(Y,s(1),s(2)));
end

More Answers (1)

Catalytic
Catalytic on 26 Aug 2025
Edited: Catalytic on 26 Aug 2025
As @Matt J says, it may be worthwhile to make customized pooling layers specifically for the case of tiled pools. However, it might also be that your analysis is tainted because you are using the avgpool and maxpool functions to implement pooling instead of implementing them with actual layers. When I test that, I don't see such a big difference between the average pool and max pool performance -
analyze_pooling_timing()
Milliseconds =
1×2 table
AvgPool MaxPool
_______ _______
5.0346 4.4899
function analyze_pooling_timing()
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384; % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
'Stride', conv_params.stride, ...
'Padding', conv_params.padding, ...
'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
%% Time Tests
millisec = 1000; % Conversion factor
sz=pool_params.pool_size;
for i=1:25
net=dlnetwork(averagePooling2dLayer(sz, Stride=sz),Oj);
fun = @() dlfeval(@modelGradients, net, Oj);
apforwTime(i) = gputimeit( fun,2 )*millisec;
net=dlnetwork(maxPooling2dLayer(sz, Stride=sz),Oj);
fun = @() dlfeval(@modelGradients, net, Oj);
mpforwTime(i) = gputimeit(fun,2 )*millisec;
end
Milliseconds = table(mean(apforwTime), mean(mpforwTime),...
'Var',{'AvgPool' , 'MaxPool',})
end
function [loss, gradients] = modelGradients(net, X)
% Forward pass and loss
Y = forward(net, X);
loss = Y(1);
% Backpropagation (automatic differentiation)
gradients = dlgradient(loss, X);
end
  4 Comments
Matt J
Matt J on 17 Oct 2025 at 1:06
Edited: Matt J on 17 Oct 2025 at 1:20
my case is different though ... my "hand-derived" gradient function written in matlab first performs the forward pass: dlconv, then activation function, then maxpool. Then I takee the derivative of my objective function w.r.t. output features.
Nothing you mention about "your case" sounds like a deviation from what @Catalytic has presented. His examples also consist of hand-derived forward and backward computations.
It is clear that you want to implement max pooling (and its gradient) as an isolated step. Both @Catalytic and I have demonstrated fast alternatives to maxpool() for doing that.
Cal
Cal on 17 Oct 2025 at 17:03
To my understanding, all of Catalytic's examples use automatic differentiation which is what I'm trying to avoid.
I see your examples Matt for e.g. tiledmaxpool and tiledmaxpoolBACK. Thank you for those functions, I'll try them out. I would prefer more general improved functions like maxpool and maxunpool, but this may work.

Sign in to comment.

Categories

Find more on Parallel and Cloud in Help Center and File Exchange

Products


Release

R2025a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!