CONVOLVE2 can be used wherever CONV2 is used, taking the same arguments and returning the same results to within a small tolerance. The computation is speeded up by using the singular value decomposition of the mask to express it as a sum of outer products. Each of these can be computed efficiently as convolution with a row and a column vector. CONV2 is used to carry out the individual convolutions.
Separable masks are a particular case and are handled by CONVOLVE2 much as FILTER2 does. Many other masks which are not separable have low rank (e.g. Gabor function masks) and are handled more efficiently by CONVOLVE2.
The function will also compute a reducedrank approximation to a given mask if required and will use this if it will speed up the computation. An extra argument allows control over the error.
Additional shape options allow: (a) 2D "circular" convolution  that is, the input array is taken to be periodic rather than surrounded by zeros; (b) a "reflection" boundary condition  that is, the input array is taken to be surrounded by reflected copies of itself.
See 'Computer and Robot Vision' Vol I, by R.M. Haralick and L.G. Shapiro (AddisonWesley 1992), pp. 298299.
David Young (2020). Fast 2D convolution (https://www.mathworks.com/matlabcentral/fileexchange/22619fast2dconvolution), MATLAB Central File Exchange. Retrieved .
1.3.0.0  Alternative names for shape options for consistency with other functions. 

1.2.0.0  Minor correction to help comments; updated exindex to current version. 

1.1.0.0  Now uses exindex to simplify the boundary condition computation. 
Create scripts with code, output, and formatted text in a single executable document.
I can't believe I have wasted so long time for Conv2 before I met this.
ORZ...
I have read the 'Computer and Robot Vision' Vol I in page 298 and 299, in fact I do not know why decompose the kernel could reduce the computation burden. For example, let w be a 2x2 kernel, by writing
a_11 w_11 + a_12 w_12 + a_21 w_21 + a_22 w_22
= a_11 u_1 v_1 + a_12 u_1 v_2 + a_21 u_2 v_1 + a_22 u_2 v_2
= (a_11 v_1 + a_12 v_2) u_1 + (a_21 v_1 + a_22 v_2 ) u_2
does not save the number of multiplications.
Fast and accurate!
Great submission. Works as described without a flaw and is well documented in code.
A Suggestion if anyone feels like working on this great code further:
One could clearly seperate the code that analyses the filter from the actual filtering (convolution). This would make the benefits clearer for iterative variational methods and/or realtime video processing.
cheers
Thank you David! I use this a lot and it always improves dramatically the speed of my algorithms. Great job!
Rob, you are right that the tolerance parameter makes no difference if the mask is ones(n). The reason is that the mask can be decomposed into 1D masks without any loss of accuracy, so this is done regardless of the tolerance. The same is true of many simple masks, e.g. those produced by the 'gauss' and 'log' arguments to fspecial, and Gabor function masks.
The tolerance becomes useful if the mask is complex  for example, if it is obtained by snipping out part of an image. Then, to get full accuracy, a full 2D convolution is needed and there will be no gain over conv2 (indeed, a slight loss of time); however, if the tolerance is large enough an approximation to the mask will be used, allowing a speedup at a cost of accuracy.
I get the following speed improvements. I didn't see an effect of changing the tolerance value. It would be nice to have usage examples where this parameter makes a difference to computation speed.
>> tic,for i=1:100;C=conv2(randn(1000),ones(10),'same');end,toc
>> Elapsed time is 17.121093 seconds.
>> tic,for i=1:100;C=convolve2(randn(1000),ones(10),'same',1);end,toc
>> Elapsed time is 10.800977 seconds.
>> tic,for i=1:100;C=convolve2(randn(1000),ones(10),'same',0);end,toc
>> Elapsed time is 10.419711 seconds.
I'm fairly sure this used to have loads of 5* comments? Anyway, this script is great. The increase in speed over conv2 for me was 10 times. Really easy to use  outstanding!
I use this a lot. It is very fast and well documented. A great replacement for conv2().
In some cases, a fft based convolution is faster. I've seen this be the case when the filter and image are the same size.
Does anybody know a c/c++ code which is as useful as this one? Multipurpose, fast, reliable, 2d convolution.