# How to calculate error of trapz() function?

76 views (last 30 days)
Emil Doyle on 6 Feb 2021
Edited: David Goodmanson on 15 Feb 2021
I have a set of data points, together with the error at each point in an error bar. My question is how can you calculate the error of the integral calculated using trapz() from the error bars I have at each point?

John D'Errico on 6 Feb 2021
This is not a question about MATLAB, but a question about understanding numerical integration, and perhaps a bit of statistics. Not only that, you need to appreciate that the result that trapz produces is only an approximation, because if you knew the EXACT values at each point, trapz will return only an approximation to the "true" integral of that function. In fact, the true integral could be almost anything, since we don't know what the function does between those points.
So at best, all you can ask is the uncertainty on the result that trapz would have generated, If the data had error bars of zero width at each point. Again, that still has nothing to do with MATLAB, because we still have not yet defined the meaning of your question.
So lets assume that we have a set of points, as (x,y) pairs. Assume that x is a monotonic set of values, so x(i+1) > x(i). At each point, y(i) is given, but assume a set of error bars around y(i) exist and are known.
Now, let me consider what an error bar means. Does that say that the true function is uniformly ANY value within the limits of those error bars? I'll assume that is true, but you may prefer to assume the function is normally distributed, with say 99% of the values lay inside the error bars. That would change the analysis in a somewhat minor way.
Regardless, the important thing to remember is that integration is a LINEAR operator! What does that mean? That tells us
int( F(x) + G(x) ) = int(F(x)) + int(G(x))
Here int can be used as a substitute for trapz, or any such tool.
So assume the function is given as a set of limits, lower and upper bounds on the function at each point. So now we have a vector x, and vectors ylo and yhi. The analysis is simpler if the vector x is uniformly spaced, since then trapezoidal integration is way simpler to write.
So next, we can write the problem as a SUM of two integrals. That is, break down the problem as the integral:
int(ylo(x)) + int(e(x))
where e(x) is a strictly positve, but random function that varies uniformly beween the limits 0 and yhi(x)-ylo(x).
Now we can try to discuss the distribution of the result. How do those limits vary? Are the width of those error bars uniform, so all the same size? If so, then yhi(x)-ylo(x) is a constant. In fact, now the central limit theorem starts to come into the problem. We can now compute the mean and variance of the result.
All of this is pure speculation, since I've now made a LOT of assumptinos, only some of which are probably valid.
However, I would suggest that you talk to someone with an understanding of both numerical integration and at least some knowledge of statistics. While I am an excellent person in that respect, I don't do consulting. (So please don't send me e-mail asking for me to help you directly. I won't.)
Emil Doyle on 6 Feb 2021

David Goodmanson on 7 Feb 2021
Edited: David Goodmanson on 15 Feb 2021
Hi Emil,
Although there are uncertainties, you can draw some conclusions by using reasonable assumptions. First, you have to be reasonably confident that the points are a fair representation of the function, with point spacing small enough that unusual behavior in between points is unlikely. Second, you have to know what the error bars signify. Do they give you the standard deviaton of the 'error' at each point? If not, can they be related in a definitive way to the standard deviation? [For example, if the error bars correspond to 10% and 90% bounds for a normal distribution, you can find the standard deviation with that information]. And is each point located right at the mean of each error bar, which is the usual situation? In that case you can make progress.
The case where the (x,y) points are equally spaced in x is the simplest. Let the spacing between points be h. Then ignoring some second-order details with the trapezoidal rule, the integral is approximately
I = h*(Sum{i} y(i))
Let the standard deviation be denoted by sigma, and let each point y(i) be characterized by sig(i). The errors for the points are assumed to be statistically independent. Then for N points, a standard statistical result for the standard deviation of the sum (you should check this) is
sig( Sum{i} y(i) ) = sqrt(N)*sig_rms
where sig_rms is the rms average of all the sigmas. If all the sigmas are identical, sig(i) = sig, then
sig( (Sum{i} y(i) ) = sqrt(N)*sig
which is a standard result. The integral itself includes an extra factor of h, so
sig_I = h*sqrt(N)*sig_rms
or
sig_I = h*sqrt(N)*sig
if the sigmas are all the same. This is not something you can take to the bank, but rather a decent and indicative approximation of the 'error' of the integral.
Suppose the sigmas are all the same and you extend the range of x, keeping the same h. Then sig_I goes up as sqrt(N). On the other hand, depending on the integrand, the integral itself may go up proportional to N, meaning that the relative error would go down. This illustrates the tendency for an integral to do smoothing.
Or, suppose you take new data halfway in between all the points you already have. It would appear at first glace that because of the sqrt(N) factor, sig_I goes up and there is a penalty for increased information, which hardly seems fair. But now h --> h/2 so sig_I decreases by a factor of sqrt(2).