making a custom way to train CNNs, and I am noticing that avgpool is SIGNIFICANTLY faster than maxpool in forward and backwards passes…

Question

Cal on 25 Aug 2025

2
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/2179587-making-a-custom-way-to-train-cnns-and-i-am-noticing-that-avgpool-is-significantly-faster-than-maxpo

Commented: Cal on 17 Oct 2025

I’m designing a custom training procedure for a CNN that is different from backpropagation in that I use manual update rules for layers or sets of layers. I'm studying my gradient for two types of layers: “conv + actfun + maxpool”, and “conv + actfun + avgpool”, which are identical layers except the last action is a different pooling type.

I compared the two layer types with identical data dimension sizes to see the time differences between maxpool and avgpool, both in the forward pass and the backwards pass of the pooling layers. All other steps in calculating the gradient were exactly the same between the two layers, and showed the same time costs in the two layers. But when looking at time costs specifically of the pooling operations’ forward and backwards passes, I get significantly different times (average of 5000 runs of the gradient, each measurement is in milliseconds):

gradient step | AvgPool | MaxPool | Difference

--------------------------|---------|---------|----------

pooling (forward pass) | 0.4165 | 38.6316 | +38.2151

unpooling (backward pass) | 9.9468 | 46.1667 | +36.2199

For reference, all my data arrays are dlarrays on the GPU (gpuArrays in dlarrays), all single precision, and the pooling operations convert 32 by 32 feature maps (across 2 channels and 16384 batch size) to 16 by 16 feature maps (of same # channels and batch size), so just a 2 by 2 pooling operation.

You can see here that the maxpool forward pass (using “maxpool” function) is about 92 times slower than the avgpool forward pass (using “avgpool”), and the maxpool backward pass (using “maxunpool”) is about 4.6 times slower than the avgpool backward pass (using a custom “avgunpool” function that Anthropic’s Claude had to create for me, since matlab has no “avgunpool”).

These results are extremely suspect to me. For the forwards pass, comparing matlab's built in "maxpool" to built in "avgpool" functions gives a 92x difference, but searching online people seem to instead claim that training max pooling is supposed to be faster than avg pooling, which contradicts the results here.

For simplicity, see the code example below that runs just "maxpool" and "avgpool" only (no other functions) and compares their times:

function analyze_pooling_timing()
    
% GPU setup
g = gpuDevice();
fprintf('GPU: %s\n', g.Name);
    
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384;  % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
    
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
    
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
    
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
    
    
% Number of iterations for averaging
num_iter = 100;
fprintf('Running %d iterations for each timing measurement...\n\n', num_iter);
    
    
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
        'Stride', conv_params.stride, ...
        'Padding', conv_params.padding, ...
        'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
    
    
%% Time AVERAGE POOLING
fprintf('=== AVERAGE POOLING (conv_af_ap) ===\n');
times_ap = struct();
    
for iter = 1:num_iter
    
    % Average pooling
    tic;
    Oj_pooled = avgpool(Oj, pool_params.pool_size, ...
        'Stride', pool_params.pool_stride, ...
        'Padding', pool_params.pool_padding);
    wait(g);
    times_ap.pooling(iter) = toc;
    
end
    
%% Time MAX POOLING
fprintf('\n=== MAX POOLING (conv_af_mp) ===\n');
times_mp = struct();
for iter = 1:num_iter
    
    % Max pooling with indices
    tic;
    [Oj_pooled, max_indices] = maxpool(Oj, pool_params.pool_size, ...
        'Stride', pool_params.pool_stride, ...
        'Padding', pool_params.pool_padding);
    wait(g);
    times_mp.pooling(iter) = toc;
    
end
%% Compute statistics and display results
fprintf('\n=== TIMING RESULTS (milliseconds) ===\n');
fprintf('%-25s %12s %12s %12s\n', 'Step', 'AvgPool', 'MaxPool', 'Difference');
fprintf('%s\n', repmat('-', 1, 65));
steps_common = { 'pooling'};
total_ap = 0;
total_mp = 0;
for i = 1:length(steps_common)
    step = steps_common{i};
    if isfield(times_ap, step) && isfield(times_mp, step)
        mean_ap = mean(times_ap.(step)) * 1000;    % times 1000 to convert seconds to milliseconds
        mean_mp = mean(times_mp.(step)) * 1000;  % times 1000 to convert seconds to milliseconds
        total_ap = total_ap + mean_ap;
        total_mp = total_mp + mean_mp;
        diff = mean_mp - mean_ap;
        fprintf('%-25s %12.4f %12.4f %+12.4f\n', step, mean_ap, mean_mp, diff);
    end
end
fprintf('%s\n', repmat('-', 1, 65));
%fprintf('%-25s %12.4f %12.4f %+12.4f\n', 'TOTAL', total_ap, total_mp, total_mp - total_ap);
fprintf('%-25s %12s %12s %12.2fx\n', 'Speedup', '', '', total_mp/total_ap);
end

Now you can see one main difference between my calling of maxpool and avgpool: for avgpool I only have 1 output (the pooled values), but with maxpool I have 2 outputs (the pooled values, and the index locations of these max values).

This is important because if I replaced the call to maxpool with only requesting the pooled values (1 output), maxpool is faster as expected:

>> analyze_pooling_timing

GPU: NVIDIA GeForce RTX 5080

Running 1000 iterations for each timing measurement...

=== AVERAGE POOLING (conv_af_ap) ===

=== MAX POOLING (conv_af_mp) ===

=== TIMING RESULTS (milliseconds) ===

Step AvgPool MaxPool Difference

-----------------------------------------------------------------

pooling 0.4358 0.2330 -0.2028

-----------------------------------------------------------------

max/avg 0.53x

>>

However if I call maxpool like here with 2 outputs (pooled values and indices), or with 3 outputs (pooled values, indices, and inputSize), this is where maxpool is extremely slow:

>> analyze_pooling_timing

GPU: NVIDIA GeForce RTX 5080

Running 1000 iterations for each timing measurement...

=== AVERAGE POOLING (conv_af_ap) ===

=== MAX POOLING (conv_af_mp) ===

=== TIMING RESULTS (milliseconds) ===

Step AvgPool MaxPool Difference

-----------------------------------------------------------------

pooling 0.4153 38.9818 +38.5665

-----------------------------------------------------------------

max/avg 93.86x

>>

So it appears that this is the reason why maxpool is so much slower: requesting the maxpool function to also output the indices leads to a massive slowdown.

Unfortunately, the indices are needed to later differentiate (backwards pass) the maxpool layer... so I need the indices...

I'd assume that whenever someone wants to train a CNN in matlab using a maxpool layer, they would have to call maxpool with indices, and thus I'd expect a similar slowdown...

Can anyone here comment on this? My application needs me to train maxpool layers, so I need to both forward pass and backwards pass through them, thus I need these indices. But it appears that matlab's version of maxpool may not be the best implementation of a maxpool operation... maybe some inefficient processes regarding obtaining the indices?

3 Comments
Show 1 older commentHide 1 older comment

Richard on 11 Sep 2025

Edited: Richard on 11 Sep 2025

Open in MATLAB Online

Hi @Cal, thanks for bringing this to our attention!

We will investigate making improvements to calculating the second output from maxpool, however I am also interested in exactly how you are using them. You say "the indices are needed to later differentiate (backwards pass) the maxpool layer", however this is not the case when calling dlgradient:

X = dlarray(gpuArray.rand(1000, 'single'), 'SSCB');
grad = dlfeval(@(x) dlgradient(sum(maxpool(x,3),'all'),x), X);

The second output from maxpool was only intended for use when there is a maxunpool layer later in the forwards pass. We didn't expect maxunpool to be needed as the backward implementation for maxpool.

I understand you are performing a custom update loop and not a simple backwards across a full network, but at the same time it does sound like you still want the true gradients from maxpool as part of that, and of course dlgradient can do that. If you can share some more details about how you are using the network fragments and what you need the gradient of then we may be able to suggest code that can do that with dlgradient.

Cal on 15 Oct 2025

Edited: Cal on 15 Oct 2025

Hi Richard,

This is outdated now since I will be using custom cuDNN mex files to implement this in matlab now, but to answer your question "If you can share some more details about how you are using the network fragments and what you need the gradient of then we may be able to suggest code that can do that with dlgradient."

I derived an update rule for local learning in a CNN, a single layer instead of all layers, and my derivation leads to an analytic update rule instead of relying on automatic differentiation / backpropagation. Thus, this update rule would be more efficient, as direct coded formulas for the update rule are faster than automatic differentiation since you don't have to evaluate the cost, among various other factors: you can just calculate the formulas for the derivative directly.

So in a sense, the point of this would to NOT use dlgradient or any of these other functionalities that must use automatic differentiation. the maxpool function allows the forward pass without AD, the maxunpool allows the backward pass without AD. Its just that the maxunpool function appears to have optimization issues as current, because the vast majority of people aren't doing what I'm doing; they are using backpropagation, and thus dlgradient naturally does that, and the backward pass w.r.t. maxpool is as of right now better designed for backpropagation / dlgradient based functionality.

Now what I'm doing is to have a matlab wrapper function that either performs the forward pass over a single maxpool operation, or the backward pass over the maxpool operation.

Cal on 16 Oct 2025

also see my answer below to Catalytic for more details

Sign in to comment.

Sign in to answer this question.

Answer 1

Matt J on 25 Aug 2025

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2179587-making-a-custom-way-to-train-cnns-and-i-am-noticing-that-avgpool-is-significantly-faster-than-maxpo#answer_1569721

Edited: Matt J on 26 Aug 2025

Open in MATLAB Online

I think the pooling operations are simply not optimized for the specific situation when the pooling is tiled (i.e., poolsize=stride). This is ironic, I guess, since this is a very common case.

Below I demonstrate how optimize the forward pass for both kinds of pooling using sepblockfun(), downloadable from,

https://www.mathworks.com/matlabcentral/fileexchange/48089-separable-block-wise-operations

For the maxpool backward pass, it's not really the maximizing indices you need. The more direct thing you need is a binary map of the maxima locations. When you have tiled pooling, this can be generated much faster, as done below in tiledmaxpoolBACK. There are ways to eliminate the second repelem for an even faster implementation, but this should make the point.

analyze_pooling_timing()

GPU: NVIDIA GeForce RTX 4080 SUPER

Milliseconds =

1×4 table

AvgPool(Forw) AvgPool(Back) MaxPool(Forw) MaxPool(Back)

_____________ _____________ _____________ _____________

3.6987 1.121 1.1704 2.0492

function analyze_pooling_timing()
    
% GPU setup
g = gpuDevice();
fprintf('GPU: %s\n', g.Name);
    
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384;  % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
    
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
    
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
    
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
    
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
        'Stride', conv_params.stride, ...
        'Padding', conv_params.padding, ...
        'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
    
    
%% Time Tests
millisec = 1000; % Conversion factor
sz=pool_params.pool_size;
 Y=tiledmaxpool(Oj, sz); %fake input to back pass
 dLdY=Y;
 apforwTime = gputimeit(@() tiledavgpool(Oj, sz) )*millisec;
 apbackTime =  gputimeit(@() tiledavgpoolBACK(dLdY,Oj,Y, sz) )*millisec;
    
 mpforwTime = gputimeit(@() tiledmaxpool(Oj, sz) )*millisec;
 mpbackTime = gputimeit(@() tiledmaxpoolBACK(dLdY,Oj,Y, sz) )*millisec;
Milliseconds =  table(apforwTime, apbackTime, mpforwTime,mpbackTime,...
                  'Var',{'AvgPool(Forw)' 'AvgPool(Back)',...
                          'MaxPool(Forw)', 'MaxPool(Back)'})  
end
function Y=tiledavgpool(X,s)
 Y=sepblockfun(X,[s,1,1],'mean');
end
function Q=tiledavgpoolBACK(dLdY,X,Y,s)
 Q=repelem(dLdY/prod(s),s(1),s(2));
end
function Y=tiledmaxpool(X,s)
 Y=sepblockfun(X,[s,1,1],'max');
end
function Q=tiledmaxpoolBACK(dLdY,X,Y,s)
 Q=repelem(dLdY,s(1),s(2)) .* (X==repelem(Y,s(1),s(2)));
end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

Catalytic on 26 Aug 2025

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2179587-making-a-custom-way-to-train-cnns-and-i-am-noticing-that-avgpool-is-significantly-faster-than-maxpo#answer_1569744

Edited: Catalytic on 26 Aug 2025

Open in MATLAB Online

As @Matt J says, it may be worthwhile to make customized pooling layers specifically for the case of tiled pools. However, it might also be that your analysis is tainted because you are using the avgpool and maxpool functions to implement pooling instead of implementing them with actual layers. When I test that, I don't see such a big difference between the average pool and max pool performance -

analyze_pooling_timing()

Milliseconds =

1×2 table

AvgPool MaxPool

_______ _______

5.0346 4.4899

function analyze_pooling_timing()
 
% Parameters matching your test
H_in = 32; W_in = 32; C_in = 3; C_out = 2;
N = 16384;  % batch size. Try N = 32 small or N = 16384 big
kH = 3; kW = 3;
    
pool_params.pool_size = [2, 2];
pool_params.pool_stride = [2, 2];
pool_params.pool_padding = 0;
    
conv_params.stride = [1, 1];
conv_params.padding = 'same';
conv_params.dilation = [1, 1];
    
% Initialize data
Wj = dlarray(gpuArray(single(randn(kH, kW, C_in, C_out) * 0.01)), 'SSCU');
Bj = dlarray(gpuArray(single(zeros(C_out, 1))), 'C');
Fjmin1 = dlarray(gpuArray(single(randn(H_in, W_in, C_in, N))), 'SSCB');
    
%% setup everything in forward pass before the pooling:
% Forward convolution
Sj = dlconv(Fjmin1, Wj, Bj, ...
        'Stride', conv_params.stride, ...
        'Padding', conv_params.padding, ...
        'DilationFactor', conv_params.dilation);
% activation function (and derivative)
Oj = max(Sj, 0); Fprimej = sign(Oj);
    
    
%% Time Tests
millisec = 1000; % Conversion factor
sz=pool_params.pool_size;
for i=1:25
 net=dlnetwork(averagePooling2dLayer(sz, Stride=sz),Oj);
 fun = @() dlfeval(@modelGradients, net, Oj);
apforwTime(i) = gputimeit( fun,2 )*millisec;
 net=dlnetwork(maxPooling2dLayer(sz, Stride=sz),Oj);
 fun = @() dlfeval(@modelGradients, net, Oj);
 mpforwTime(i) = gputimeit(fun,2  )*millisec;
end
Milliseconds =  table(mean(apforwTime), mean(mpforwTime),...
    'Var',{'AvgPool'  , 'MaxPool',})
end
function [loss, gradients] = modelGradients(net, X)
    % Forward pass and loss
    Y = forward(net, X);
    loss = Y(1);
    
    % Backpropagation (automatic differentiation)
    gradients = dlgradient(loss, X);
end

4 Comments
Show 2 older commentsHide 2 older comments

Cal on 16 Oct 2025

Edited: Cal on 16 Oct 2025

Open in MATLAB Online

Thanks Catalytic, yes I assume that if one were to properly use layers, e.g. functions like dlnetwork. dlgradient, and forward, which are designed for backpropagation over layers, I'd assume this is what people do most of the time anyways and is probably well optimized in matlab.

my case is different though because Im doing hand-derived update rules per layer (local greedy layerwise training), and I want to reap the benefits of hand-derived update rules instead of relying on backpropagation which has additional overhead from automatic differentiation.

that is why unlike most people I'm calling specific operations functions (e.g. "dlconv", "fullyconnect", "maxpool", "avgpool", etc.), instead of using a more conventional framework function like dlnetwork. dlgradient, and forward. So e.g. if I have a layer that contains a linear convolution followed by activation function followed by maxpool, my "hand-derived" gradient function written in matlab first performs the forward pass: dlconv, then activation function, then maxpool. Then I takee the derivative of my objective function w.r.t. output features. Then the rest of the backward pass is done: backward pass over maxpool, then over act fun, then over dlconv. it would look something like this:

function [dcostdWj, dcostdBj, local_costfn_val] = local_gradient_CNN_conv_af_mp(flags, phi, Wj, Bj, Fhatjmin1, Y, conv_params, pool_params)
% Gradient of a CNN layer with max pooling (conv + activation function + maxpool)
% Minimal defaults
if ~isfield(conv_params,'stride')
    conv_params.stride = [1, 1];
end
if ~isfield(conv_params,'padding')
    conv_params.padding = 'same';
end
if ~isfield(conv_params,'dilation')
    conv_params.dilation = [1, 1];  % Default dilation (no dilation)
end
% Set pooling defaults
if nargin < 8 || isempty(pool_params)
    pool_params = struct();
end
if ~isfield(pool_params,'pool_size')
    pool_params.pool_size = [2, 2];
end
if ~isfield(pool_params,'pool_stride')
    pool_params.pool_stride = pool_params.pool_size;
end
if ~isfield(pool_params,'pool_padding')
    pool_params.pool_padding = 0;
end
% Get batch size only
N = size(Fhatjmin1, 4);
%% FORWARD PASS - Conv, activation, and max pooling
% Forward convolution
Sj = dlconv(Fhatjmin1, Wj, Bj, ...
    'Stride', conv_params.stride, ...
    'Padding', conv_params.padding, ...
    'DilationFactor', conv_params.dilation); 
% Extract once for activation - stay on GPU
Sj = extractdata(Sj);
%% ACTIVATION
if strcmp(flags.ActFun, 'relu')
    Oj = max(Sj, 0);
    Sj = sign(Oj);  % Derivative of ReLU
else
    [Oj, Sj] = getFj_Fprimej(Sj, flags.ActFun);
end
% Store size before pooling (needed for unpooling)
pool_input_size = size(Oj);
%% MAX POOLING (WITH WARP-OPTIMIZED CUDA)
if isscalar(pool_params.pool_padding)
    padding = [pool_params.pool_padding, pool_params.pool_padding];
else
    padding = pool_params.pool_padding;
end
[Oj, max_indices] = maxpoolCUDA('forward', Oj, pool_params.pool_size, pool_params.pool_stride, padding);
%% LOCAL GRADIENT - Call your optimized function
% calculates the gradient of objective function w.r.t. layer output
% features Oj.
% the output "Oj" then becomes this gradient tensor.
if nargout > 2
    [Oj, local_costfn_val] = local_objective_function_gradient(flags, phi, Oj, Y);
else
    Oj = local_objective_function_gradient(flags, phi, Oj, Y);
end
%% BACKWARD PASS - Unpool, multiply with activation derivative, then conv gradients
% Use warp-optimized unpooling
Oj = maxpoolCUDA('backward', Oj, max_indices, pool_input_size);
% Multiply with derivative of activation function
Oj = Oj .* Sj;
% Convert back to dlarray for gradient computation
Oj = dlarray(Oj, 'SSCB');
%% WEIGHT GRADIENTS - Optimized autodiff
% Use the same efficient gradient computation approach as conv_af
[dcostdWj, dcostdBj] = dlfeval(@grad_compute, Wj, Bj, Fhatjmin1, Oj, conv_params);
% Scale by phi here (smaller arrays to scale)
dcostdWj = dcostdWj * phi;
dcostdBj = dcostdBj * phi;
end
%% Optimized gradient computation function (same as conv_af)
function [dW, dB] = grad_compute(W, B, X, dOutput, params)
    % Minimal overhead gradient computation
    
    % Single forward pass
    Z = dlconv(X, W, B, ...
        'Stride', params.stride, ...
        'Padding', params.padding, ...
        'DilationFactor', params.dilation);
    
    % Pseudo-loss for gradient (no anonymous function overhead)
    loss = sum(Z .* dOutput, 'all');
    
    % Get gradients
    [dW, dB] = dlgradient(loss, W, B);
end

Only maxpool has a dedicated matlab function for the backward pass (maxunpool), but it is apparently not well optimized for my use case, instead more optimized for conventional use case in layers with backpropagation, probably called by dlnetwork. dlgradient, and forward. That is why in the forward pass and the backward pass above instead of using maxpool/maxunpool, I use a CUDA MEX file I wrote called "maxpoolCUDA" designed to do either.

I think I can use CUDA MEX or cuDNN mex files for most of these individual operations anyways. I made a feature request regarding this though (https://www.mathworks.com/matlabcentral/discussions/ideas/884443-feature-request-better-forward-backward-pass-functionalities-for-neural-network-operations).

Matt J on 17 Oct 2025

Edited: Matt J on 17 Oct 2025

my case is different though ... my "hand-derived" gradient function written in matlab first performs the forward pass: dlconv, then activation function, then maxpool. Then I takee the derivative of my objective function w.r.t. output features.

Nothing you mention about "your case" sounds like a deviation from what @Catalytic has presented. His examples also consist of hand-derived forward and backward computations.

It is clear that you want to implement max pooling (and its gradient) as an isolated step. Both @Catalytic and I have demonstrated fast alternatives to maxpool() for doing that.

Cal on 17 Oct 2025

I didn't see the feature request portal before. David added it here https://www.mathworks.com/matlabcentral/discussions/ideas/884443-feature-request-better-forward-backward-pass-functionalities-for-neural-network-operations

To my understanding, all of Catalytic's examples use automatic differentiation which is what I'm trying to avoid.

I see your examples Matt for e.g. tiledmaxpool and tiledmaxpoolBACK. Thank you for those functions, I'll try them out. I would prefer more general improved functions like maxpool and maxunpool, but this may work.

Sign in to comment.

making a custom way to train CNNs, and I am noticing that avgpool is SIGNIFICANTLY faster than maxpool in forward and backwards passes…

3 Comments
Show 1 older commentHide 1 older comment

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (1)

4 Comments
Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

making a custom way to train CNNs, and I am noticing that avgpool is SIGNIFICANTLY faster than maxpool in forward and backwards passes…

3 Comments Show 1 older commentHide 1 older comment

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (1)

4 Comments Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

4 Comments
Show 2 older commentsHide 2 older comments