You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
How to calculate the confidence interval
1 129 views (last 30 days)
Show older comments
Hi
I have a vector x with e.g. 100 data point. I can easy calculate the mean but now I want the 95% confidence interval. I can calculate the 95% confidence interval as follows:
CI = mean(x)+- t * (s / square(n))
where s is the standard deviation and n the sample size (= 100).
Is there a method in matlab where I just can feed in the vector and then I get the confidence interval?
Or I can write my own method but I need at least the value of t (critical value of the t distribution) because it depends on the number of samples and I don't want to lookup it in a table everytime. Is this possible?
Would be very nice if somebody could give an example.
Last but not least, I want 95% confidence in a 5% interval around the mean. For checking that I just have to calculate the 95% confidence interval and then check if the retrieved value is less than 5% of my mean, right?
4 Comments
Andrei Keino
on 18 Jul 2017
How to calculate confidence interval for linear model. It's the same for mean value, but the number of degree of freedom (dof) is equal to 1
Jennifer Wade
on 15 Feb 2022
I use something like this for a generic data vector, A.....
N = length(A)
STDmean = mean(A)/sqrt(N)
dof = N - 1; %Depends on the problem but this is standard for a CI around a mean.
studentst = tinv([.025 0.975],dof) %tinv is the student's t lookup table for the two-tailed 95% CI ...
CI = studentst*STDmean
I'm looking into bootci now!
Jennifer Wade
on 15 Feb 2022
Sorry, just saw the same answer below!
Accepted Answer
Star Strider
on 20 Oct 2014
This works:
x = randi(50, 1, 100); % Create Data
SEM = std(x)/sqrt(length(x)); % Standard Error
ts = tinv([0.025 0.975],length(x)-1); % T-Score
CI = mean(x) + ts*SEM; % Confidence Intervals
You have to have the Statistics Toolbox to use the tinv function. If you do not have it, I can provide you with a few lines of my code that will calculate the t-probability and its inverse.
25 Comments
Sepp
on 20 Oct 2014
Edited: Sepp
on 20 Oct 2014
Thank you so much. I have the statistics toolbox. I will try it out now.
Will this compute the 97.5% confidence interval? Just asking because we have tinv([0.025 0.975]...
And is this two-sided? I need it on both sides (+ and -). I think that's what two-sided means.
Star Strider
on 20 Oct 2014
Edited: Star Strider
on 20 Oct 2014
My pleasure!
It will give you the 95% confidence interval using a two-tailed t-distribution. This is the centre 95%, so the lower and upper 2.5% tails of the distribution are not included.
Note that it also considers that you are only estimating one parameter (the mean) and so has n-1 degrees-of-freedom.
Sepp
on 21 Oct 2014
Thank you for the clarification. So CI has now two values, one above the mean and one below. So if I want to plot the confidence interval I just add (upper bound) and subtract (lower bound) the ts*SEM to the mean and plot it, right?
And if I want to calculate if my measurements (one parameter) are withing a 5% interval I just calculate ts*SEM and chck if it is less than 5% of the mean. is this right?
Star Strider
on 21 Oct 2014
My pleasure.
Yes. That is what I did in my calculation of the ‘CI’ variable.
If you want to determine if a value is within the 95% confidence limits, test to see if it is >=CI(1) and <=CI(2).
Also, it is termed a 95% confidence interval, not 5%. I mention this to avoid confusion.
Sepp
on 21 Oct 2014
Edited: Sepp
on 21 Oct 2014
Hmmm ok, but I have to use the following: "For experiments, fix a target (typically 95% confidence in a 5 - 10% interval around the mean) and repeat the experiments until the level of confidence is reached."
Does this not mean just checking if CI(2) - CI(1) is < 5-10% of mean?
Star Strider
on 21 Oct 2014
I’d have to know more about what you’re doing. The statement "For experiments, fix a target (typically 95% confidence in a 5 - 10% interval around the mean) and repeat the experiments until the level of confidence is reached." makes no sense to me. The confidence interval is defined by the parameter (or parameters) you are estimating. You can’t play fast and loose with the definition!
Sepp
on 21 Oct 2014
If I collect more data points, i.e. if N increases then the confidence interval will get narrower most likely. I have to do experiments where I'm measuring the throughput over e.g. 10minutes. If I would run it 30 minutes I would get more data points and thus the CI is narrower.
Star Strider
on 21 Oct 2014
True, but that also assumes the standard deviation (SD) does not change. If you collect 3x as many samples, and your SD remains the same, your standard error (SE) would decrease by about 40%, 1-1/sqrt(3). The t-distribution approaches the normal distribution above about 30 degrees-of-freedom, so the t-statistic would not change significantly.
Star Strider
on 22 Oct 2014
You could certainly do that, but I’m not sure how meaningful it would be. Consider two vectors of random numbers with the same (normal) distribution, the only difference between them being a fixed offset:
x1 = randn(1,100);
x2 = randn(1,100)+10;
SEM1 = std(x1)/sqrt(length(x1));
SEM2 = std(x2)/sqrt(length(x2));
RR1 = SEM1/mean(x1);
RR2 = SEM1/mean(x2);
taking the ratio as in ‘RR1’ and ‘RR2’ would produce a significantly lower ratio for ‘RR2’ (in this illustration) in spite of the data themselves being essentially the same.
The way I understand your latest comment (no promises that I do), you might want to compare CI values with increasing numbers of data points and compare them. I suspect they will become asymptotic to some non-zero value and not decrease further.
To illustrate:
x = randn(1,1E+6);
st = 10000;
for k1 = 1:st:length(x)
xs = x(1:st+(k1-1));
SE = std(xs)/sqrt(length(xs));
xss(1+fix((k1-1)/st)) = length(xs);
CI(1+fix((k1-1)/st)) = SE*tinv(0.975,length(xs)-1)*2;
end
figure(1)
stairs(xss,CI)
grid
xlabel('Sample Size')
ylabel('CI')
I’m certain there is an analytic proof of this available, but I’m not up to looking for it just now.
Sepp
on 24 Oct 2014
Edited: Sepp
on 24 Oct 2014
Hi. Sorry for disturbing you again. I have now played around with the confidence interval and I recognized a very strange behaviour.
I have the following values: length(x) = 4508338, std(x) = 2036.04818, mean(x) = 1246.88844.
With your formula I get ts*SEM = +- 1.879438088697457 which is incredible small compared to the standard deviation.
Why? I thought that when having a very large standard deviation I also must have a large confidence interval.
Star Strider
on 24 Oct 2014
That is a correct result.
This is the confidence interval for the mean, indicating that these are the limits based on the sample that would include the mean of the population. So the larger your sample, the more likely you are to estimate the mean of the population, and therefore the confidence interval decreases with increasing sample size. As the sample size approaches infinity, the standard error and therefore the confidence interval approach zero.
Sepp
on 25 Oct 2014
Edited: Sepp
on 25 Oct 2014
Ok, but why is then the standard deviation twice as high as the mean? Do the stanard deviation and the confidence interval not correlate? So if I'm looking only at the stanard deviation, this would tell me that the measurement is very bad (inaccurate) but when I'm looking only a the confidence intervall this would tell me that the measurement is very good (accurate).
Star Strider
on 25 Oct 2014
The standard deviation tells you the dispersion of your data. The confidence interval on the mean is calculated from the standard deviation, so in that sense they definitely correlate. However the confidence interval on the mean is an estimate of the dispersion of the true population mean, and since you are usually comparing means of two or more populations to see if they are different, or to see if the mean of one population is different from zero (or some other constant), that is appropriate.
A standard deviation twice the mean indicates that the data can go negative a large part of the time (about 27% based on my normcdf calculation). If they cannot — if they are always positive — then the normal distribution is not appropriate and you have to use an alternate distribution, depending on the nature of your data and the distribution that best describes it (lognormal for instance).
Star Strider
on 5 Aug 2019
Adam Danz
on 21 Aug 2019
Edited: Adam Danz
on 12 Jul 2022
Here's an anonymous function based on Star Strider's answer. It uses tinv() which means the stats toolbox is required. This function also uses "omitnan" flags so that NaN values are ignored which requires r2016a or later. Note that the t-distribution method assumes the data form an approximately normal distribution but this can be fairly robust to skewed data.
% x is a vector, matrix, or any numeric array of data. NaNs are ignored.
% p is a the confident level (ie, 95 for 95% CI)
% The output is 1x2 vector showing the [lower,upper] interval values.
CIFcn = @(x,p)std(x(:),'omitnan')/sqrt(sum(~isnan(x(:)))) * tinv(abs([0,1]-(1-p/100)/2),sum(~isnan(x(:)))-1) + mean(x(:),'omitnan');
Alternatively, you could compute CI of the mean using bootstrapping along with the percentile method. This approach does not assume a normal distribution and is more robust than the t-distribution method.
Here's a demo comparing both methods to show a small difference in CI.
Generate skewed data
rng('default')
x = raylrnd(5,[1,2000]); % requires stats & machine learning toolbox
Compute CI using the t-distribution method
CIFcn = @(x,p)std(x(:),'omitnan')/sqrt(sum(~isnan(x(:)))) * tinv(abs([0,1]-(1-p/100)/2),sum(~isnan(x(:)))-1) + mean(x(:),'omitnan');
p = 95;
CItdist = CIFcn(x,p)
CItdist = 1×2
6.1236 6.4105
Compute CI using bootstrapping & percentile method
bootci requires the stats & machine learning toolbox. However, this is fairly easy to compute without the bootci function. Simply create a for-loop with n iterations for n bootstraps (I've chosen 1000 here). In each iteration of the for-loop, sample your data with replacement (use the randi function) and store the mean of the resampled data. After you have n means, compute the 95% CI of the means using prctile.
Note, I would not use mean as the statistic for a non-normal distribution. The median would be a much better approach.
[CIbsMean, CImeans] = bootci(1000, {@mean, x}, 'type','per','alpha', 0.05);
disp(CIbsMean')
6.1355 6.4138
Plot the results.
The first axes shows the distribution of the raw data and both sets of CIs. You can see that they are so close the nearly overlap. The second axes show the same sets of CIs but magnified to see the difference. The last axes shows the distribution of the means from the bootstrap. Notice that they are approximately normally distributed even though the underlying data are not normally distributed. Herein lies the magic of bootstrapping with the percentile method. Thanks to the central limit theorm, the distribution of bootstrapped means will always be normally distributed no matter what the underlying distribution is from the raw data!
figure()
tiledlayout(3,1,'TileSpacing','Compact');
nexttile
histogram(x)
x1 = xline(CItdist,'k:','LineWidth',1,'DisplayName','tinv');
x2 = xline(CIbsMean,'m--','LineWidth',1,'DisplayName','BootMean');
x3 = xline(mean(x),'k-','DisplayName','mean');
legend([x1(1),x2(1),x3],'Location','EastOutside')
title('CI and underlying data')
nexttile
x1 = xline(CItdist,'k:','LineWidth',1,'DisplayName','tinv');
x2 = xline(CIbsMean,'m--','LineWidth',1,'DisplayName','BootMean');
x3 = xline(mean(x),'k-','DisplayName','mean');
legend([x1(1),x2(1),x3],'Location','EastOutside')
title('CIs')
box on
nexttile
histogram(CImeans)
xline(CIbsMean,'m--','LineWidth',1,'DisplayName','BootMean');
title('Bootstrapped means')
Lastly, for people looking to compute the bootstrapped CI on the distribution rather than the mean of the distribution, you can simply use the prctile function:
% x is a vector, matrix, or any numeric array of data. NaNs are ignored.
% p is the confidence level (ie, 95 for 95% CI)
% The output is 1x2 vector showing the [lower,upper] interval values.
CIFcn = @(x,p)prctile(x,abs([0,100]-(100-p)/2));
Demo:
figure
x = pearsrnd(0,1,1,4,100,1);
histogram(x);
CItdist = CIFcn(x,p);
xline(CItdist,'k:','LineWidth',1,'DisplayName','CI Mean')
CIFcn = @(x,p)prctile(x,abs([0,100]-(100-p)/2));
CIDist = CIFcn(x,95);
xline(CIDist,'m--','LineWidth',1,'DisplayName','CI Distribution')
Phi Phan
on 18 Feb 2022
Hi @Star Strider, thank you for the useful comments. I'd like to to calculate the CI but I do not have the statistics tool box. It'd be really great help if you could share with me the codes you suggested that you'd be able to provide. Thank you very much in advance!
Star Strider
on 18 Feb 2022
Phi Phan —
Sure!
I am posting them as commented code in order to include the ‘documentation’ for them (such as it is), however it is only necessary to remove the comments in the appropriate lines (anonymous functions) to use the functions —
% % % % % T-DISTRIBUTIONS —
% % Variables:
% % t: t-statistic
% % v: degrees of freedom
%
% tdist2T = @(t,v) (1-betainc(v/(v+t^2),v/2,0.5)); % 2-tailed t-distribution
% tdist1T = @(t,v) 1-(1-tdist2T(t,v))/2; % 1-tailed t-distribution
%
% % This calculates the inverse t-distribution (parameters given the
% % probability ‘alpha’ and degrees of freedom ‘v’:
% t_inv = @(alpha,v) fzero(@(tval) (max(alpha,(1-alpha)) - tdist1T(tval,v)), 5); % T-Statistic Given Probability ‘alpha’ & Degrees-Of-Freedom ‘v’
These use only basic MATLAB functions. Another option for ‘t_inv’ could be interp1, however I have never used it in this context.
.
Ishmaal Erekson
on 12 Jul 2022
@Adam Danz, in your example provided 21 Aug 2019, you compare the tinv method to the prctile method, stating that the prctile method is more robust because it doesn't assume a normal distribution. I am somewhat new to these statistical methods, so please correct me if I am wrong, but aren't those methods calculating two different things? The tinv method you use provides the confidence interval of the mean and as explained by Star Strider, will decrease with increased number of samples. The prctile method you use will simply tell you the bounds in which lies 95% of your samples -- assuming sufficient samples are taken, the confidence intervals using this method will not change much as the number of samples increases.
Is there a way to calculate the confidence interval on the mean (such as is done in the tinv method), but without assuming a normal distribution?
Adam Danz
on 12 Jul 2022
Thanks for the comment, @Ishmaal Erekson. You're correct. My 2019 comment was misleading and I'll update it to avoid confusion.
> Is there a way to calculate the confidence interval on the mean (such as is done in the tinv method), but without assuming a normal distribution?
Yes. I'll update my comment in a moment to include this.
Agata Oskroba
on 26 Nov 2022
@Star Strider could you explain how to get the t-score without the toolbox?
Star Strider
on 26 Nov 2022
Try this —
% Variables:
% t: t-statistic
% v: degrees of freedom
tdist2T = @(t,v) (1-betainc(v/(v+t^2),v/2,0.5)); % 2-tailed t-distribution
tdist1T = @(t,v) 1-(1-tdist2T(t,v))/2; % 1-tailed t-distribution
% This calculates the inverse t-distribution (parameters given the
% probability ‘alpha’ and degrees of freedom ‘v’:
t_inv = @(alpha,v) fzero(@(tval) (max(alpha,(1-alpha)) - tdist1T(tval,v)), 5); % T-Statistic Given Probability ‘alpha’ & Degrees-Of-Freedom ‘v’
.
Niraj Desai
on 25 Aug 2023
@Star Strider Thank you so much for your answers (over the course of eight years !!!) I realize this thread started in 2014, but I only found it today. It clarified something that I had been confused about. I'm grateful.
Star Strider
on 25 Aug 2023
@Niraj Desai — My pleasure!
More Answers (0)
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)