Loss of precision in wave equations

Hi,
I'm stumped with the following loss of precision. Both of the below command line code snippets should return the same result, 0. Checkable by writing it out by hand for a small N :).
Problem is that MATLAB returns very different behavior when s1 and s2 are scalars, or arrays. Snippets should have order of ~4*N*K arithmetic operations, and the correct result in fact has more arithmetic operations, but still reproduces the correct result.
Its only when s1 and s2 are Nx1 arrays that the error starts popping up. It is there irrespective of the value of the scaling factor A.
Any thoughts on how to mitigate this, as I need s1 and s2 to be arrays and can't afford the noise introduced?
Thanks!!!
K = 1e6; N = 1e1; A=1e30; rz = zeros(K,1); r1 = 2:N+1; r2=1:N; p1 = 1:N; p2 = 2:N+1;
for i=1:K,
s2 = A.*rand(1,1); s1 = A.*rand(1,1); d= zeros(N+1,1);
d(r1) = d(r1) - s2;
d(p1) = d(p1) + s2;
d(r2) = d(r2)-s1;
d(p2) = d(p2) + s1;
rz(i) = abs(sum(d));
end;
disp(sum(rz)./A)
0
K = 1e1; N = 1e4; A=1e30; rz = zeros(K,1); r1 = 2:N+1; r2=1:N; p1 = 1:N; p2 = 2:N+1;
for i=1:K,
s2 = A.*rand(N,1); s1 = A.*rand(N,1); d= zeros(N+1,1);
d(r1) = d(r1) - s2;
d(p1) = d(p1) + s2;
d(r2) = d(r2)-s1;
d(p2) = d(p2) + s1;
rz(i) = abs(sum(d));
end;
disp(sum(rz)./A)
7.62234236932456e-13
(edited above single line command prompt sequence into a more readable block)
PS. I've tried playing around with this a bit more. If I were to change the calculation of the residual, rz, and eliminate the abs(sum(d)), I still get the same type of behavior...
K = 1e1; N = 1e4; A=1e30; rz = zeros(K,1); r1 = 2:N+1; r2=1:N; p1 = 1:N; p2 = 2:N+1; for i=1:K, s2 = A.*rand(N,1); s1 = A.*rand(N,1); d= zeros(N+1,1); d(r1) = d(r1) - s2; d(p1) = d(p1) + s2; d(r2) = d(r2)-s1; d(p2) = d(p2) + s1; rz(i) = sum(d); end; disp(sum(rz)./A)
-3.2172589838028e-13
And its not about averaging....
K = 1e4; N = 1e4; A=1e30; rz = zeros(K,1); r1 = 2:N+1; r2=1:N; p1 = 1:N; p2 = 2:N+1; for i=1:K, s2 = A.*rand(N,1); s1 = A.*rand(N,1); d= zeros(N+1,1); d(r1) = d(r1) - s2; d(p1) = d(p1) + s2; d(r2) = d(r2)-s1; d(p2) = d(p2) + s1; rz(i) = sum(d); end; disp(sum(rz)./A)
-3.8100276925694e-12
Nor does it seem to be about rational/irational numbers.....
K = 1e6; N = 3; A=1e30; rz = zeros(K,1); r1 = 2:N+1; r2=1:N; p1 = 1:N; p2 = 2:N+1; for i=1:K, s2 = A.*[pi/3423890482309; pi/8.3242342; pi/7.23492034]; s1 = A.*[0.3; 0.5; 0.7]; d= zeros(N+1,1); d(r1) = d(r1) - s2; d(p1) = d(p1) + s2; d(r2) = d(r2)-s1; d(p2) = d(p2) + s1; rz(i) = sum(d); end; disp(sum(rz)./A)
0
K = 1e6; N = 3; A=1e30; rz = zeros(K,1); r1 = 2:N+1; r2=1:N; p1 = 1:N; p2 = 2:N+1; for i=1:K, s2 = A.*[pi/10; pi/8.3242342; pi/7.23492034]; s1 = A.*[0.3; 0.5; 0.7]; d= zeros(N+1,1); d(r1) = d(r1) - s2; d(p1) = d(p1) + s2; d(r2) = d(r2)-s1; d(p2) = d(p2) + s1; rz(i) = sum(d); end; disp(sum(rz)./A)
-3.5184372088832e-11

 Accepted Answer

This is expected behavior.
By default MATLAB uses IEEE 754 double precision. This has finite precision and cannot exactly represent many real values; round-off is inherent in all arithmetic operations. See the MATLAB documentation on floating-point numbers:
Floating-point addition is not associative. Changing the order of additions changes rounding error. MATLAB’s sum(x)can execute partial sums in different orders (and even parallelize), so the same algebraic sum over a vector can yield small non-zero results due to accumulated rounding error. Cleve Moler’s blog “SuperSum, In Defense of Floating Point Arithmetic” discusses how order, round-off, and summation algorithm affect results:
The small residuals you see (~1e-13 relative to A) are exactly the round-off errors expected when summing ~O(N) floating-point operations; they are not a bug. You also see similar explanations in MathWorks Answers about floating-point differences between sum and cumsum:
Lets try it now:
% Example showing (a + b) + c ≠ a + (b + c)
a = +1e30;
b = -1e30;
c = 1.0;
one = (a + b) + c; % round( (a + b) ) then + c
two = a + (b + c); % round( (b + c) ) then + a
fprintf(' (a + b) + c = %.17g\n', one)
(a + b) + c = 1
fprintf(' a + (b + c) = %.17g\n', two)
a + (b + c) = 0
one==two
ans = logical
0
If you need numerically accurate cancellation over large arrays, typical approaches include using compensated summation (e.g., Kahan sum) or exact summation techniques, both of which reduce accumulated error in vector sums:
As an alternative, if your logic requires mathematically exact cancellation (and the values really should sum to zero in exact arithmetic), you can compare against a tolerance rather than relying on exact floating-point equality. The new isapprox helper in recent MATLAB releases encapsulates this pattern:

More Answers (2)

What @Stephen23 says is entirely correct.
That said:
K = 1e1; N = 1e4; A=sym(10)^30; rz = zeros(K,1,'sym'); r1 = 2:N+1; r2=1:N; p1 = 1:N; p2 = 2:N+1;
for i=1:K,
s2 = A.*sym(rand(N,1)); s1 = A.*sym(rand(N,1)); d= zeros(N+1,1,'sym');
d(r1) = d(r1) - s2;
d(p1) = d(p1) + s2;
d(r2) = d(r2)-s1;
d(p2) = d(p2) + s1;
rz(i) = abs(sum(d));
end;
disp(sum(rz)./A)
0
That is, if you use the Symbolic Toolbox to carry out the calculations using rational numbers, then the calculations will be exact.
Jan
Jan on 20 Feb 2026
The sum is numerically instable. So your observation is expected.
Beside the symbolic toolbox oder a toolbox for arbitrary presion numbers, there are stabilised algorithms for calculating the sum. See https://www.mathworks.com/matlabcentral/fileexchange/26800-xsum , which uses some additional bits (Kahan) or a quadruple precision (Knuth) for the calculations. This wil increase the presicion, such that the rounbding problem occur later, but the general problem will remain. See Stephen's answer.

5 Comments

Thank you all for your quick answers. I found them very helpful and thorough. Much appreciated! I will look more into the symbollic and Kahan-type algorithms.
What I find really intriguing(even after taking out A) is that if s1 and s2 are scalars, or in the case s1 = rand(1).*ones(N,1), and similar for s2, the code always (in my finite set of examples for K and N values, running into a total of over N*K>1e9 fp operations) returns the correct answer, 0! It is only when s1=rand(N,1) that the answer is not correct....I wonder if there's some more structure to be had in this problem....
Remember, that RAND replies numbers of the interval [0, 1] with 52 bit resolution. This means that abs(rand - rand) replies values >= 2e-52 (maybe 2e-51, my brain is tired currently, but this does not influence the argument in general). Therefore adding a small set of random value is accurate as long, as now rounding occurs.
The standard example for these effetcs is:
1e15 + 1 - 1e15
ans = 1
1e16 + 1 - 1e16
ans = 0
At a specific value rounding occurs. For a large random vector there is a higher chance that an intermediate value will exceed this limit. To understand the problem, omit the randomness:
sum(ones(1, 1e16)) + 1 - sum(ones(1, 1e16))
ans = 0
The order of elements matters. If the above is reordered, no rounding occurs:
sum([1, -1, 1, -1, 1, -1...]) + 1
ans = 1
If you computer suffers from creating a vector with 1e16 doubles (80 PetaBytes...), a smaller set is working also:
5e15 + 5e15 + 1 - 5e15 - 5e15
5e15 - 5e15 + 1 + 5e15 - 5e15
ans = 0
ans = 1
The computetations are performed from left to right, so the intermediate values produce a loss of accuracy in the first case, but not in the second one:
cumsum([5e15, 5e15, 1, -5e15, -5e15])
ans = [5e+15, 1e+16, 1e+16, 5e+15, 0]
cumsum([5e15, -5e15, 1, 5e15, -5e15])
ans = [5e+15, 0, 1, 5e+15, 1]
This will happen for a list of random number also for a certain length.
Thank you all for your quick, thoughtful, and thorough answers. Was just trying to understand sources of numerical noise in effectively integrating what amounts to a mass-conserving wave propagation equation. Should've known also that Knuth had something to say about this back when. Anyway, turns out that part of the problem was that for very small values, dx/x would also have lot of noise in it, and dealing with that helped.
I don't currently have the symbolic toolbox, but do wish that sometime in the future, Matlab would also offer the option of long (64bit) computation. :)
Thank you all again, much appreciated!
Double precision uses 64 bits, divided into 1 sign bit, 11 exponent bit, and 52 mantissa bits.
MATLAB does not offer 80 bit floating point, or 100 bit floating point, or 128 bit floating point.
Note that although the specifications for modern versions of MATLAB on x64 architectures are such that the hardware chips all happen to have 80 bit float point. it is not the case that Apple Silicon (ARM) cpus have native 80 bit floating point hardware, which is also absent on Windows ARM (MATLAB R2026a is partly supported on Windows ARM through the PRISM emulator.)
"Matlab would also offer the option of long (64bit) computation"
Double is 64 bit.
Using more bits does not eliminate the behaviors of binary floating point numbers.

Sign in to comment.

Products

Release

R2025b

Asked:

on 20 Feb 2026

Commented:

on 28 Feb 2026

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!