using chi2gof to determine sample representativeness

13 views (last 30 days)
I am trying to use chi2gof function to test if the collected sample data is representative of the population data. Say here we have 8 bins and we have the population and sample value for each bin. Is this the correct way to do this test?
Population = [996, 749, 370, 53, 9, 3, 1, 0];
Sample = [647, 486, 100, 22, 0, 0, 0, 0];

Answers (1)

William Rose
William Rose on 26 Oct 2022
Your code (below) does not work because chi2gof expects a vector x containing the observed values of the valriable- not the count of how many are in each cell, which you have provided.
There are (at least) 2 solutions.
  1. Create an x vector and a vector of bin edges, so that the count in each cell comes out to the values in your vector "sample". Use thtose as inputs to chi2gof().
  2. Compute the chi quared test statistic yourself and compare it to a critical value, using the correct degrees of freedom.
Furthermore: Cells with 0 expected value cause the calculation of the chi squared statistic to blow up. Cells with less than 4-5 expected should be combined as needed, until all cells have at least 4-5 expected. Therefore combine cells 4-8 into a single cell:
Population = [996, 749, 370, 53, 9, 3, 1, 0];
Sample = [647, 486, 100, 22, 0, 0, 0, 0];
pop2 = [996, 749, 370, sum(Population(4:8))]
pop2 = 1×4
996 749 370 66
sample2 = [647, 486, 100, sum(Sample(4:8))]
sample2 = 1×4
647 486 100 22
Now let's try method 1 above:
for i=1:length(sample2), x=[x,i*ones(1,sample2(i))]; end
Now do the chi2 test using chi2gof(). k has statistical info, so we inspect it, to make sure the observed values ("O") are what we want them to be.
h = 1
p = 2.9069e-95
k = struct with fields:
chi2stat: 440.9990 df: 3 edges: [0.5000 1.5000 2.5000 3.5000 4.5000] O: [647 486 100 22] E: [996 749 370 66]
The oberved vector "O" has the values in "sample2" vector. That means our x vector and the edges vector worked as desired.
h=1 means the null hypothesis (which is that the sample data matches the population) is rejected.
The low p value means it is highly improbable to get the observed data from this population.
Method 2: Compute the chi2 test statistic ourselves, then compare it to the critical value with the correct degrees of freedom.
chi2stat = 440.9990
df=length(pop2)-1; pcrit=.05; chi2crit=chi2inv(pcrit,df);
h2=chi2stat>chi2crit; p2=1-chi2cdf(chi2stat,df);
fprintf('h=%d, p=%.3f\n',h2,p2);
h=1, p=0.000
The chi squared statistic and h and p match the test statistic and h and p we found above with Method 1.





Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!